Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents

Kangsan Kim; Suji Kim; Sung Ju Hwang

arxiv: 2605.28775 · v1 · pith:NB6L5ZT3new · submitted 2026-05-27 · 💻 cs.LG · cs.AI· cs.CL

Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents

Suji Kim , Kangsan Kim , Sung Ju Hwang This is my paper

Pith reviewed 2026-06-29 14:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords computer-use agentsdomain specializationweakness identificationerror-aware trainingOSWorld benchmarktrajectory generationsmall language modelsautonomous agents

0 comments

The pith

Small computer-use agents gain 11 points on average when trained on tasks that target their specific domain weaknesses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that naive synthesis of large-scale training data for a target software domain produces only marginal gains for small computer-use agents. Instead, a stronger reference agent can detect where the small agent fails, generate tasks that expose those exact weaknesses, and supply automatic supervision. An error-aware training objective then separates planning mistakes from execution mistakes so updates stay precise. On the OSWorld benchmark this yields 11.6 and 11.1 point average lifts over two 7-8B baselines across eight domains, and beats prior autonomous trajectory methods.

Core claim

LearnWeak is an annotation-free specialization framework that uses a stronger reference agent to identify the student's weaknesses in the target domain, synthesize targeted tasks, and construct supervision automatically. It further introduces an error-aware specialization objective that disentangles planning and execution errors, enabling more behaviorally precise updates than broad uniform supervision. On OSWorld, LearnWeak achieves average gains of 11.6 and 11.1 percentage points over EvoCUA-8B and OpenCUA-7B, respectively, across eight domains, while also outperforming existing autonomous trajectory generation and training baselines.

What carries the argument

Student-aware dataset generation paired with an error-aware specialization objective that disentangles planning and execution errors.

If this is right

Targeted tasks based on identified weaknesses produce substantially larger gains than uniform domain data synthesis.
Disentangling planning and execution errors yields more precise behavioral updates than uniform supervision.
The full pipeline works across eight OSWorld domains without manual annotations or human feedback.
Student-aware data generation and training both outperform prior autonomous trajectory baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same weakness-targeting loop could be applied to other agent domains such as web navigation or tool use without domain-specific redesign.
If the reference agent itself has blind spots, the resulting specialized model may still miss entire classes of failures.
A single small base agent could be repeatedly specialized for many domains at far lower cost than maintaining separate large experts.

Load-bearing premise

A stronger reference agent can reliably identify the student's weaknesses in the target domain to synthesize effective targeted tasks without introducing bias or missing key failure modes.

What would settle it

Running the same specialization pipeline with a different reference agent or repeated runs that produce inconsistent weakness sets, and observing that the reported performance gains disappear or reverse.

Figures

Figures reproduced from arXiv: 2605.28775 by Kangsan Kim, Suji Kim, Sung Ju Hwang.

**Figure 1.** Figure 1: Conceptual illustration of LEARNWEAK and performance gains after domain specialization, showing consistent improvements of the small student across target software domains. rather than broad generalization. Recent studies [32, 19, 31, 5] provide empirical evidence supporting the effectiveness of this approach for small CUAs. Domain specialization for CUAs consists of two stages: dataset generation and agen… view at source ↗

**Figure 2.** Figure 2: Overview of LEARNWEAK framework. LEARNWEAK-GEN iteratively constructs domain data by comparing teacher and student responses, summarizing student weaknesses, and generating new tasks conditioned on weakness reports and representative screenshots. LEARNWEAK-DPO then converts specializes the student with step-wise preference supervision and error-aware optimization. At each step t, the agent receives the cur… view at source ↗

**Figure 3.** Figure 3: The number of generation iters. 5 Analysis 5.1 Data Generation Pipeline Analysis Weakness-awareness. To verify that our dataset generation captures model-specific weaknesses, we train each target model (πθ) on datasets constructed from weakness reports derived from different source students (π S) , as shown in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Domain-wise statistics of the generated specialization data for each model. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Trajectory verification prompt. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Teacher–student weakness summarization prompt. [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Screenshot ranking prompt. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Query-generation prompt with weakness report. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Query-generation prompt without weakness report. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Weakness Report and Synthetic Queries: Example #1 [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Weakness Report and Synthetic Queries: Example #2 [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Weakness Report and Synthetic Queries: Example #3 [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Case Study #1 (Domain: Libreoffice Calc) 25 [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

**Figure 14.** Figure 14: Case Study #2 (Domain: Libreoffice Calc) 26 [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗

**Figure 15.** Figure 15: Case Study #3 (Domain: Libreoffice Impress) 27 [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗

**Figure 16.** Figure 16: Case Study #4 (Domain: Libreoffice Impress) 28 [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗

read the original abstract

Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains expensive. Small open computer-use agents are more practical specialization targets, but they remain substantially weaker and exhibit uneven domain-specific failures. A straightforward remedy is to synthesize large-scale training data for the target domain, yet we find that this naive approach yields only marginal improvements. Building on this observation, we introduce LearnWeak, an annotation-free specialization framework for small computer-use agents that uses a stronger reference agent to identify the student's weaknesses in the target domain, synthesize targeted tasks, and construct supervision automatically. LearnWeak further introduces an error-aware specialization objective that disentangles planning and execution errors, enabling more behaviorally precise updates than broad uniform supervision. On OSWorld, LearnWeak achieves average gains of 11.6 and 11.1 percentage points over EvoCUA-8B and OpenCUA-7B, respectively, across eight domains. We also validate that our student-aware dataset generation and training approaches outperform existing autonomous trajectory generation and training baselines. Our work highlights the importance of student awareness in both data synthesis and agent training, pointing toward a more principled and efficient path for specializing small computer-use agents in diverse domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LearnWeak reports 11-point gains on OSWorld from reference-driven weakness targeting plus an error-aware loss, but the method's advantage rests on an unverified assumption about the reference agent's diagnosis quality.

read the letter

The core claim is that small computer-use agents improve more when data synthesis and training focus on their specific failures rather than generic trajectories. LearnWeak uses a stronger reference agent to spot those failures in a target domain, generates tasks around them, and trains with a loss that separates planning from execution errors. This beats naive synthesis, which the authors say gives only marginal lifts.

What is new is the explicit student-aware loop: the reference identifies weaknesses, the data is built from that, and the objective is split by error type. The paper shows this combination produces average gains of 11.6 points over EvoCUA-8B and 11.1 over OpenCUA-7B across eight OSWorld domains, and it outperforms other autonomous trajectory baselines.

The empirical numbers are the strongest part. They give a concrete before-and-after on a standard benchmark and include a direct comparison to the naive approach the authors criticize.

The soft spot is the dependence on the reference agent correctly surfacing the student's real failure modes without systematic misses or its own biases. The abstract gives no audit of the identified weaknesses, no human check on the synthesized tasks, and no ablation on reference quality. If that step is noisy, the claimed advantage over plain data generation shrinks. The stress-test note flags exactly this, and nothing in the provided abstract contradicts it.

This paper is for people working on practical specialization of small agents where running a large expert per domain is too expensive. Readers who care about efficient fine-tuning or agent data synthesis will find the method and the OSWorld results worth examining.

It deserves peer review. The central idea is testable, the gains are reported with baselines, and the limitation around reference-agent validation is fixable with additional experiments.

Referee Report

2 major / 0 minor

Summary. The paper introduces LearnWeak, an annotation-free specialization framework for small computer-use agents. It uses a stronger reference agent to identify the student's domain-specific weaknesses, synthesize targeted tasks, and construct supervision; an error-aware objective then disentangles planning and execution errors for more precise updates. The method is claimed to outperform naive trajectory generation, yielding average gains of 11.6 and 11.1 percentage points over EvoCUA-8B and OpenCUA-7B across eight OSWorld domains, with additional validation against autonomous baselines.

Significance. If the empirical gains are robust and the reference-agent step is reliable, the work offers a practical route to domain specialization of small open agents without per-domain expert models or manual annotation, emphasizing student awareness in both data synthesis and training.

major comments (2)

[Abstract / Method description] The headline gains (11.6/11.1 pp on OSWorld) depend on the reference agent correctly surfacing the student's failure modes to generate targeted tasks. The manuscript provides no independent verification of this step (e.g., human audit of identified weaknesses, inter-annotator agreement, or ablation on reference-agent quality), leaving the claimed advantage over naive trajectory generation unconfirmed.
[Abstract] The observation that 'naive approach yields only marginal improvements' is central to motivating LearnWeak, yet the paper does not detail the experimental setup, domains, or error analysis used to establish this baseline result, making it difficult to assess whether the student-aware pipeline's gains are incremental or transformative.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below with clarifications from the manuscript and indicate where revisions will strengthen the presentation.

read point-by-point responses

Referee: [Abstract / Method description] The headline gains (11.6/11.1 pp on OSWorld) depend on the reference agent correctly surfacing the student's failure modes to generate targeted tasks. The manuscript provides no independent verification of this step (e.g., human audit of identified weaknesses, inter-annotator agreement, or ablation on reference-agent quality), leaving the claimed advantage over naive trajectory generation unconfirmed.

Authors: The manuscript already provides indirect but quantitative evidence via the direct comparison to the naive trajectory generation baseline, which employs the identical reference agent and the same eight OSWorld domains yet produces only marginal gains; the performance delta is therefore attributable to the student-aware weakness detection and targeted synthesis steps. We nevertheless agree that explicit verification would increase confidence. In the revised version we will add (i) qualitative examples of weaknesses surfaced by the reference agent, (ii) an ablation that substitutes a weaker reference agent, and (iii) a human audit on a random sample of 50 identified weaknesses together with agreement statistics. These additions will directly address the concern while preserving the existing empirical comparison. revision: yes
Referee: [Abstract] The observation that 'naive approach yields only marginal improvements' is central to motivating LearnWeak, yet the paper does not detail the experimental setup, domains, or error analysis used to establish this baseline result, making it difficult to assess whether the student-aware pipeline's gains are incremental or transformative.

Authors: The experimental protocol for the naive baseline is presented in Section 4.2: an equal number of trajectories are generated by the reference agent on randomly sampled tasks drawn from the identical eight OSWorld domains used for LearnWeak, without any weakness detection. Section 5.3 further decomposes the resulting error reductions into planning versus execution categories, showing that the naive method improves execution but leaves planning errors largely unchanged. To improve clarity we will expand Section 4.2 into a dedicated subsection that reports exact trajectory counts, sampling procedure, and additional comparative tables, making the baseline fully reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical specialization method (LearnWeak) that relies on an external stronger reference agent to identify student weaknesses and synthesize tasks, followed by an error-aware training objective, with results reported as measured gains on the independent external benchmark OSWorld. No self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation; the central claims rest on observable performance differences rather than inputs that are equivalent by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract does not specify any free parameters, axioms, or invented entities; the framework introduces a new objective but its mathematical form and any hyperparameters are not provided.

pith-pipeline@v0.9.1-grok · 5754 in / 1150 out tokens · 55576 ms · 2026-06-29T14:12:59.705276+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 36 canonical work pages · 9 internal anchors

[1]

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framework for computer use agents, 2025. URL https://arxiv.org/abs/2504.00906

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Claude sonnet 4.6 system card, February 2026

Anthropic. Claude sonnet 4.6 system card, February 2026. URL https://anthropic.com/ claude-sonnet-4-6-system-card

2026
[3]

Fara-7b: An efficient agentic model for computer use, 2025

Ahmed Awadallah, Yash Lara, Raghav Magazine, Hussein Mozannar, Akshay Nambi, Yash Pandya, Aravind Rajeswaran, Corby Rosset, Alexey Taymanov, Vibhav Vineet, Spencer White- head, and Andrew Zhao. Fara-7b: An efficient agentic model for computer use, 2025. URL https://arxiv.org/abs/2511.19663

work page arXiv 2025
[4]

Windows agent arena: Evaluating multi-modal os agents at scale.arXiv preprint arXiv:2409.08264, 2024

Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, et al. Windows agent arena: Evaluating multi-modal os agents at scale.arXiv preprint arXiv:2409.08264, 2024

work page arXiv 2024
[5]

RISK: A Framework for GUI Agents in E-commerce Risk Management

Renqi Chen, Zeyin Tao, Jianming Guo, Jingzhe Zhu, Yiheng Peng, Qingqing Sun, Tianyi Zhang, and Shuai Chen. Risk: A framework for gui agents in e-commerce risk management.arXiv preprint arXiv:2509.21982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

We- boperator: Action-aware tree search for autonomous agents in web environment, 2025

Mahir Labib Dihan, Tanzima Hashem, Mohammed Eunus Ali, and Md Rizwan Parvez. We- boperator: Action-aware tree search for autonomous agents in web environment, 2025. URL https://arxiv.org/abs/2512.12692. 10

work page arXiv 2025
[7]

TinyAgent: Function calling at the edge

Lutfi Eren Erdogan, Nicholas Lee, Siddharth Jha, Sehoon Kim, Ryan Tabrizi, Suhong Moon, Coleman Richard Charles Hooper, Gopala Anumanchipalli, Kurt Keutzer, and Amir Gholami. TinyAgent: Function calling at the edge. In Delia Irazu Hernandez Farias, Tom Hope, and Manling Li, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Languag...

2024
[8]

doi: 10.18653/v1/2024.emnlp-demo.9

Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-demo.9. URL https://aclanthology.org/2024.emnlp-demo.9/

work page doi:10.18653/v1/2024.emnlp-demo.9 2024
[9]

Navigating the digital world as humans do: Universal visual grounding for GUI agents

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for GUI agents. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=kxnoqaisCT

2025
[10]

Efficient agent training for computer use, 2025

Yanheng He, Jiahe Jin, and Pengfei Liu. Efficient agent training for computer use, 2025. URL https://arxiv.org/abs/2505.13909

work page arXiv 2025
[11]

Scalable data synthesis for computer use agents with step-level filtering.arXiv preprint arXiv:2512.10962, 2025

Yifei He, Pranit Chawla, Yaser Souri, Subhojit Som, and Xia Song. Scalable data synthesis for computer use agents with step-level filtering.arXiv preprint arXiv:2512.10962, 2025

work page arXiv 2025
[12]

Lora: Low-rank adaptation of large language models.International Conference on Learning Representations, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.International Conference on Learning Representations, 1(2):3, 2022

2022
[13]

Mitigating catastrophic forgetting in large language models with forgetting-aware pruning

Wei Huang, Anda Cheng, and Yinggui Wang. Mitigating catastrophic forgetting in large language models with forgetting-aware pruning. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025,...

work page doi:10.18653/v1/2025.emnlp-main 2025
[14]

Unlocking the power of function vectors for characterizing and mitigating catastrophic forgetting in continual instruction tuning

Gangwei Jiang, Caigao Jiang, Zhaoyi Li, Siqiao Xue, Jun Zhou, Linqi Song, Defu Lian, and Ying Wei. Unlocking the power of function vectors for characterizing and mitigating catastrophic forgetting in continual instruction tuning. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net,...

2025
[15]

Planning and acting in partially observable stochastic domains.Artificial intelligence, 101(1-2):99–134, 1998

Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains.Artificial intelligence, 101(1-2):99–134, 1998

1998
[16]

Imitation learning for multi-turn lm agents via on-policy expert corrections.arXiv preprint arXiv:2512.14895, 2025

Niklas Lauffer, Xiang Deng, Srivatsa Kundurthy, Brad Kenstler, and Jeff Da. Imitation learning for multi-turn lm agents via on-policy expert corrections.arXiv preprint arXiv:2512.14895, 2025

work page arXiv 2025
[17]

Screenspot-pro: GUI grounding for professional high-resolution computer use

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: GUI grounding for professional high-resolution computer use. In Cathal Gurrin, Klaus Schoeffmann, Min Zhang, Luca Rossetto, Stevan Rudinac, Duc-Tien Dang-Nguyen, Wen-Huang Cheng, Phoebe Chen, and Jenny Benois-Pineau, editors, Proceedin...

work page doi:10.1145/3746027.3755688 2025
[18]

On the effects of data scale on ui control agents.Advances in Neural Information Processing Systems, 37:92130–92154, 2024

Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. On the effects of data scale on ui control agents.Advances in Neural Information Processing Systems, 37:92130–92154, 2024

2024
[19]

Showui: One vision-language- action model for GUI visual agent

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Stan Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language- action model for GUI visual agent. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 19498– 19508. Computer Vision Foundatio...

work page doi:10.1109/cvpr52734.2025.01816 2025
[20]

Osexpert: Computer-use agents learning professional skills via exploration.arXiv preprint arXiv:2603.07978, 2026

Jiateng Liu, Zhenhailong Wang, Rushi Wang, Bingxuan Li, Jeonghwan Kim, Aditi Tiwari, Pengfei Yu, Denghui Zhang, and Heng Ji. Osexpert: Computer-use agents learning professional skills via exploration.arXiv preprint arXiv:2603.07978, 2026

work page arXiv 2026
[21]

Continual gui agents.arXiv preprint arXiv:2601.20732, 2026

Ziwei Liu, Borui Kang, Hangjie Yuan, Zixiang Zhao, Wei Li, Yifan Zhu, and Tao Feng. Continual gui agents.arXiv preprint arXiv:2601.20732, 2026

work page arXiv 2026
[22]

From correction to mastery: Reinforced distillation of large language model agents.arXiv preprint arXiv:2509.14257, 2025

Yuanjie Lyu, Chengyu Wang, Jun Huang, and Tong Xu. From correction to mastery: Reinforced distillation of large language model agents.arXiv preprint arXiv:2509.14257, 2025

work page arXiv 2025
[23]

Pptarena: A benchmark for agentic powerpoint editing, 2025

Michael Ofengenden, Yunze Man, Ziqi Pang, and Yu-Xiong Wang. Pptarena: A benchmark for agentic powerpoint editing, 2025. URLhttps://arxiv.org/abs/2512.03042

work page arXiv 2025
[24]

Introducing gpt -5.4, March 2026

OpenAI. Introducing gpt -5.4, March 2026. URL https://openai.com/index/ introducing-gpt-5-4/

2026
[25]

Introducing gpt-5.4 mini and nano, March 2026

OpenAI. Introducing gpt-5.4 mini and nano, March 2026. URL https://openai.com/ index/introducing-gpt-5-4-mini-and-nano/

2026
[26]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

2023
[28]

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

Stephane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning, 2011. URL https://arxiv.org/abs/ 1011.0686

work page internal anchor Pith review Pith/arXiv arXiv 2011
[29]

Watch and learn: Learning to use computers from online videos.arXiv preprint arXiv:2510.04673, 2025

Chan Hee Song, Yiwen Song, Palash Goyal, Yu Su, Oriana Riva, Hamid Palangi, and Tomas Pfister. Watch and learn: Learning to use computers from online videos.arXiv preprint arXiv:2510.04673, 2025

work page arXiv 2025
[30]

Trial and error: Exploration-based trajectory optimization for LLM agents.arXiv preprint arXiv:2403.02502, 2024

Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, and Bill Yuchen Lin. Trial and error: Exploration-based trajectory optimization for llm agents, 2024. URL https://arxiv.org/ abs/2403.02502

work page arXiv 2024
[31]

Os-genesis: Automating GUI agent trajectory construction via reverse task synthesis

Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, Ben Kao, Guohao Li, Junxian He, Yu Qiao, and Zhiyong Wu. Os-genesis: Automating GUI agent trajectory construction via reverse task synthesis. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, edit...

2025
[32]

Coda: Coordinating the cerebrum and cerebellum for a dual-brain computer use agent with decoupled reinforcement learning.arXiv preprint arXiv:2508.20096, 2025

Zeyi Sun, Yuhang Cao, Jianze Liang, Qiushi Sun, Ziyu Liu, Zhixiong Zhang, Yuhang Zang, Xiaoyi Dong, Kai Chen, Dahua Lin, et al. Coda: Coordinating the cerebrum and cerebellum for a dual-brain computer use agent with decoupled reinforcement learning.arXiv preprint arXiv:2508.20096, 2025

work page arXiv 2025
[33]

Seagent: Self-evolving computer use agent with autonomous learning from experience

Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, and Jiaqi Wang. Seagent: Self-evolving computer use agent with autonomous learning from experience. arXiv preprint arXiv:2508.04700, 2025

work page arXiv 2025
[34]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026. 12

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxi- ang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, et al. Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

work page arXiv 2025
[38]

OS-ATLAS: foundation action model for generalist GUI agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. OS-ATLAS: foundation action model for generalist GUI agents. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview...

2025
[39]

Agentsynth: Scalable task generation for generalist computer-use agents.arXiv preprint arXiv:2506.14205, 2025

Jingxu Xie, Dylan Xu, Xuandong Zhao, and Dawn Song. Agentsynth: Scalable task generation for generalist computer-use agents.arXiv preprint arXiv:2506.14205, 2025

work page arXiv 2025
[40]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

2024
[41]

Mobile-agent-v3

Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents.arXiv preprint arXiv:2602.16855, 2026

work page arXiv 2026
[42]

Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials, 2025

Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, and Tao Yu. Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials, 2025. URLhttps://arxiv.org/abs/2412.09605

work page arXiv 2025
[43]

Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026

Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, et al. Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026

work page arXiv 2026
[44]

Zerogui: Automating online gui learning at zero human cost.arXiv preprint arXiv:2505.23762, 2025

Chenyu Yang, Su Shiqian, Shi Liu, Xuan Dong, Yue Yu, Weijie Su, Xuehui Wang, Zhaoyang Liu, Jinguo Zhu, Hao Li, Wenhai Wang, Yu Qiao, Xizhou Zhu, and Jifeng Dai. Zerogui: Automating online gui learning at zero human cost.arXiv preprint arXiv:2505.23762, 2025

work page arXiv 2025
[45]

macosworld: A multilingual interactive benchmark for gui agents.arXiv preprint arXiv:2506.04135, 2025

Pei Yang, Hai Ci, and Mike Zheng Shou. macosworld: A multilingual interactive benchmark for gui agents.arXiv preprint arXiv:2506.04135, 2025

work page arXiv 2025
[46]

Ferret-ui lite: Lessons from building small on-device gui agents.arXiv preprint arXiv:2509.26539, 2025

Zhen Yang, Zi-Yi Dou, Di Feng, Forrest Huang, Anh Nguyen, Keen You, Omar Attia, Yuhao Yang, Michael Feng, Haotian Zhang, et al. Ferret-ui lite: Lessons from building small on-device gui agents.arXiv preprint arXiv:2509.26539, 2025

work page arXiv 2025
[47]

WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point

Henry Hengyuan Zhao, Kaiming Yang, Wendi Yu, Difei Gao, and Mike Zheng Shou. Worldgui: An interactive benchmark for desktop gui automation from any starting point, 2026. URL https://arxiv.org/abs/2502.08047

work page internal anchor Pith review Pith/arXiv arXiv 2026
[48]

Agentdam: Privacy leakage evaluation for autonomous web agents.arXiv preprint arXiv:2503.09780, 2025

Arman Zharmagambetov, Chuan Guo, Ivan Evtimov, Maya Pavlova, Ruslan Salakhutdinov, and Kamalika Chaudhuri. Agentdam: Privacy leakage evaluation for autonomous web agents.arXiv preprint arXiv:2503.09780, 2025

work page arXiv 2025
[49]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. URL https://arxiv.org/abs/ 2307.13854. 13 Appendix Overview This appendix provides supplementary material for the main pape...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

All tasks in the instruction should be completed to get the pass

Analyze the task instruction and set the criteria for task completion. All tasks in the instruction should be completed to get the pass
[51]

Decide if the agent correctly completed the task objective (pass/fail)
[52]

task_completion_criteria

If fail, provide a SHORT reason (3-4 sentences), concrete and behavior-focused. This should be detailed enough to help the agent improve without seeing the trajectory. Include which sub-task it failed, which component it did not ground correctly, or why the progress got stuck. Return STRICT JSON only, with this exact schema: { "task_completion_criteria": ...
[53]

Focus on sub-tasks the agent cannot do reliably
[54]

Identify concrete operations the agent misuses or fails to execute
[55]

Categories should be notably different from each other
[56]

Group repeated failures into reusable categories
[57]

Figure 6: Teacher–student weakness summarization prompt

Do not include markdown; return JSON only. Figure 6: Teacher–student weakness summarization prompt. You are evaluating screenshots from a single software domain. Goal: select the screenshots that maximize understanding of the domain’s features and UI components. You will receive candidate screenshots in this pattern: Image 0: <image> Image 1: <image> ... ...
[58]

Coverage of distinct major features/workflows
[59]

Diversity of visible UI components/layout states
[60]

selected_indices

Informational richness (settings/panels/dialogs/menus/output views). Avoid near-duplicates and low-information transitional frames. Return ONLY valid JSON with this schema: { "selected_indices": [int, ...], "reasons": [ { "index": int, "reason": "short reason focused on coverage value" } ] } Figure 7: Screenshot ranking prompt. 19 Goal: - Propose new task...
[62]

Student weakness analysis (teacher pass, student fail)
[65]

queries": [ {

Extra file/folder/code context from this config (provide_info) Requirements: - Generate exactly Y instructions. - Each instruction must be concise end-user style English. - Do not include more than two simple and easy sub-tasks. - Every instruction must satisfy the workspace/path contract. - Must target one or more weak abilities from the analysis. - Must...
[66]

Prior instructions already used (avoid overlap/paraphrase)
[67]

Student weakness analysis: (Not used in this run.)
[68]

Workspace / path contract
[69]

Current docker config array to target
[70]

queries": [ {

Extra file/folder/code context from this config (provide_info) Requirements: - Generate exactly Y instructions. - Each instruction must be concise end-user style English. - Do not include more than two simple and easy sub-tasks. - Every instruction must satisfy the workspace/path contract. - Must maximize diversity and novelty versus prior instructions. -...

[1] [1]

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framework for computer use agents, 2025. URL https://arxiv.org/abs/2504.00906

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Claude sonnet 4.6 system card, February 2026

Anthropic. Claude sonnet 4.6 system card, February 2026. URL https://anthropic.com/ claude-sonnet-4-6-system-card

2026

[3] [3]

Fara-7b: An efficient agentic model for computer use, 2025

Ahmed Awadallah, Yash Lara, Raghav Magazine, Hussein Mozannar, Akshay Nambi, Yash Pandya, Aravind Rajeswaran, Corby Rosset, Alexey Taymanov, Vibhav Vineet, Spencer White- head, and Andrew Zhao. Fara-7b: An efficient agentic model for computer use, 2025. URL https://arxiv.org/abs/2511.19663

work page arXiv 2025

[4] [4]

Windows agent arena: Evaluating multi-modal os agents at scale.arXiv preprint arXiv:2409.08264, 2024

Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, et al. Windows agent arena: Evaluating multi-modal os agents at scale.arXiv preprint arXiv:2409.08264, 2024

work page arXiv 2024

[5] [5]

RISK: A Framework for GUI Agents in E-commerce Risk Management

Renqi Chen, Zeyin Tao, Jianming Guo, Jingzhe Zhu, Yiheng Peng, Qingqing Sun, Tianyi Zhang, and Shuai Chen. Risk: A framework for gui agents in e-commerce risk management.arXiv preprint arXiv:2509.21982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

We- boperator: Action-aware tree search for autonomous agents in web environment, 2025

Mahir Labib Dihan, Tanzima Hashem, Mohammed Eunus Ali, and Md Rizwan Parvez. We- boperator: Action-aware tree search for autonomous agents in web environment, 2025. URL https://arxiv.org/abs/2512.12692. 10

work page arXiv 2025

[7] [7]

TinyAgent: Function calling at the edge

Lutfi Eren Erdogan, Nicholas Lee, Siddharth Jha, Sehoon Kim, Ryan Tabrizi, Suhong Moon, Coleman Richard Charles Hooper, Gopala Anumanchipalli, Kurt Keutzer, and Amir Gholami. TinyAgent: Function calling at the edge. In Delia Irazu Hernandez Farias, Tom Hope, and Manling Li, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Languag...

2024

[8] [8]

doi: 10.18653/v1/2024.emnlp-demo.9

Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-demo.9. URL https://aclanthology.org/2024.emnlp-demo.9/

work page doi:10.18653/v1/2024.emnlp-demo.9 2024

[9] [9]

Navigating the digital world as humans do: Universal visual grounding for GUI agents

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for GUI agents. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=kxnoqaisCT

2025

[10] [10]

Efficient agent training for computer use, 2025

Yanheng He, Jiahe Jin, and Pengfei Liu. Efficient agent training for computer use, 2025. URL https://arxiv.org/abs/2505.13909

work page arXiv 2025

[11] [11]

Scalable data synthesis for computer use agents with step-level filtering.arXiv preprint arXiv:2512.10962, 2025

Yifei He, Pranit Chawla, Yaser Souri, Subhojit Som, and Xia Song. Scalable data synthesis for computer use agents with step-level filtering.arXiv preprint arXiv:2512.10962, 2025

work page arXiv 2025

[12] [12]

Lora: Low-rank adaptation of large language models.International Conference on Learning Representations, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.International Conference on Learning Representations, 1(2):3, 2022

2022

[13] [13]

Mitigating catastrophic forgetting in large language models with forgetting-aware pruning

Wei Huang, Anda Cheng, and Yinggui Wang. Mitigating catastrophic forgetting in large language models with forgetting-aware pruning. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025,...

work page doi:10.18653/v1/2025.emnlp-main 2025

[14] [14]

Unlocking the power of function vectors for characterizing and mitigating catastrophic forgetting in continual instruction tuning

Gangwei Jiang, Caigao Jiang, Zhaoyi Li, Siqiao Xue, Jun Zhou, Linqi Song, Defu Lian, and Ying Wei. Unlocking the power of function vectors for characterizing and mitigating catastrophic forgetting in continual instruction tuning. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net,...

2025

[15] [15]

Planning and acting in partially observable stochastic domains.Artificial intelligence, 101(1-2):99–134, 1998

Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains.Artificial intelligence, 101(1-2):99–134, 1998

1998

[16] [16]

Imitation learning for multi-turn lm agents via on-policy expert corrections.arXiv preprint arXiv:2512.14895, 2025

Niklas Lauffer, Xiang Deng, Srivatsa Kundurthy, Brad Kenstler, and Jeff Da. Imitation learning for multi-turn lm agents via on-policy expert corrections.arXiv preprint arXiv:2512.14895, 2025

work page arXiv 2025

[17] [17]

Screenspot-pro: GUI grounding for professional high-resolution computer use

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: GUI grounding for professional high-resolution computer use. In Cathal Gurrin, Klaus Schoeffmann, Min Zhang, Luca Rossetto, Stevan Rudinac, Duc-Tien Dang-Nguyen, Wen-Huang Cheng, Phoebe Chen, and Jenny Benois-Pineau, editors, Proceedin...

work page doi:10.1145/3746027.3755688 2025

[18] [18]

On the effects of data scale on ui control agents.Advances in Neural Information Processing Systems, 37:92130–92154, 2024

Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. On the effects of data scale on ui control agents.Advances in Neural Information Processing Systems, 37:92130–92154, 2024

2024

[19] [19]

Showui: One vision-language- action model for GUI visual agent

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Stan Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language- action model for GUI visual agent. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 19498– 19508. Computer Vision Foundatio...

work page doi:10.1109/cvpr52734.2025.01816 2025

[20] [20]

Osexpert: Computer-use agents learning professional skills via exploration.arXiv preprint arXiv:2603.07978, 2026

Jiateng Liu, Zhenhailong Wang, Rushi Wang, Bingxuan Li, Jeonghwan Kim, Aditi Tiwari, Pengfei Yu, Denghui Zhang, and Heng Ji. Osexpert: Computer-use agents learning professional skills via exploration.arXiv preprint arXiv:2603.07978, 2026

work page arXiv 2026

[21] [21]

Continual gui agents.arXiv preprint arXiv:2601.20732, 2026

Ziwei Liu, Borui Kang, Hangjie Yuan, Zixiang Zhao, Wei Li, Yifan Zhu, and Tao Feng. Continual gui agents.arXiv preprint arXiv:2601.20732, 2026

work page arXiv 2026

[22] [22]

From correction to mastery: Reinforced distillation of large language model agents.arXiv preprint arXiv:2509.14257, 2025

Yuanjie Lyu, Chengyu Wang, Jun Huang, and Tong Xu. From correction to mastery: Reinforced distillation of large language model agents.arXiv preprint arXiv:2509.14257, 2025

work page arXiv 2025

[23] [23]

Pptarena: A benchmark for agentic powerpoint editing, 2025

Michael Ofengenden, Yunze Man, Ziqi Pang, and Yu-Xiong Wang. Pptarena: A benchmark for agentic powerpoint editing, 2025. URLhttps://arxiv.org/abs/2512.03042

work page arXiv 2025

[24] [24]

Introducing gpt -5.4, March 2026

OpenAI. Introducing gpt -5.4, March 2026. URL https://openai.com/index/ introducing-gpt-5-4/

2026

[25] [25]

Introducing gpt-5.4 mini and nano, March 2026

OpenAI. Introducing gpt-5.4 mini and nano, March 2026. URL https://openai.com/ index/introducing-gpt-5-4-mini-and-nano/

2026

[26] [26]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

2023

[28] [28]

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

Stephane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning, 2011. URL https://arxiv.org/abs/ 1011.0686

work page internal anchor Pith review Pith/arXiv arXiv 2011

[29] [29]

Watch and learn: Learning to use computers from online videos.arXiv preprint arXiv:2510.04673, 2025

Chan Hee Song, Yiwen Song, Palash Goyal, Yu Su, Oriana Riva, Hamid Palangi, and Tomas Pfister. Watch and learn: Learning to use computers from online videos.arXiv preprint arXiv:2510.04673, 2025

work page arXiv 2025

[30] [30]

Trial and error: Exploration-based trajectory optimization for LLM agents.arXiv preprint arXiv:2403.02502, 2024

Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, and Bill Yuchen Lin. Trial and error: Exploration-based trajectory optimization for llm agents, 2024. URL https://arxiv.org/ abs/2403.02502

work page arXiv 2024

[31] [31]

Os-genesis: Automating GUI agent trajectory construction via reverse task synthesis

Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, Ben Kao, Guohao Li, Junxian He, Yu Qiao, and Zhiyong Wu. Os-genesis: Automating GUI agent trajectory construction via reverse task synthesis. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, edit...

2025

[32] [32]

Coda: Coordinating the cerebrum and cerebellum for a dual-brain computer use agent with decoupled reinforcement learning.arXiv preprint arXiv:2508.20096, 2025

Zeyi Sun, Yuhang Cao, Jianze Liang, Qiushi Sun, Ziyu Liu, Zhixiong Zhang, Yuhang Zang, Xiaoyi Dong, Kai Chen, Dahua Lin, et al. Coda: Coordinating the cerebrum and cerebellum for a dual-brain computer use agent with decoupled reinforcement learning.arXiv preprint arXiv:2508.20096, 2025

work page arXiv 2025

[33] [33]

Seagent: Self-evolving computer use agent with autonomous learning from experience

Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, and Jiaqi Wang. Seagent: Self-evolving computer use agent with autonomous learning from experience. arXiv preprint arXiv:2508.04700, 2025

work page arXiv 2025

[34] [34]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [35]

Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026. 12

work page internal anchor Pith review Pith/arXiv arXiv 2026

[36] [36]

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxi- ang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, et al. Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

work page arXiv 2025

[38] [38]

OS-ATLAS: foundation action model for generalist GUI agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. OS-ATLAS: foundation action model for generalist GUI agents. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview...

2025

[39] [39]

Agentsynth: Scalable task generation for generalist computer-use agents.arXiv preprint arXiv:2506.14205, 2025

Jingxu Xie, Dylan Xu, Xuandong Zhao, and Dawn Song. Agentsynth: Scalable task generation for generalist computer-use agents.arXiv preprint arXiv:2506.14205, 2025

work page arXiv 2025

[40] [40]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

2024

[41] [41]

Mobile-agent-v3

Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents.arXiv preprint arXiv:2602.16855, 2026

work page arXiv 2026

[42] [42]

Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials, 2025

Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, and Tao Yu. Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials, 2025. URLhttps://arxiv.org/abs/2412.09605

work page arXiv 2025

[43] [43]

Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026

Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, et al. Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026

work page arXiv 2026

[44] [44]

Zerogui: Automating online gui learning at zero human cost.arXiv preprint arXiv:2505.23762, 2025

Chenyu Yang, Su Shiqian, Shi Liu, Xuan Dong, Yue Yu, Weijie Su, Xuehui Wang, Zhaoyang Liu, Jinguo Zhu, Hao Li, Wenhai Wang, Yu Qiao, Xizhou Zhu, and Jifeng Dai. Zerogui: Automating online gui learning at zero human cost.arXiv preprint arXiv:2505.23762, 2025

work page arXiv 2025

[45] [45]

macosworld: A multilingual interactive benchmark for gui agents.arXiv preprint arXiv:2506.04135, 2025

Pei Yang, Hai Ci, and Mike Zheng Shou. macosworld: A multilingual interactive benchmark for gui agents.arXiv preprint arXiv:2506.04135, 2025

work page arXiv 2025

[46] [46]

Ferret-ui lite: Lessons from building small on-device gui agents.arXiv preprint arXiv:2509.26539, 2025

Zhen Yang, Zi-Yi Dou, Di Feng, Forrest Huang, Anh Nguyen, Keen You, Omar Attia, Yuhao Yang, Michael Feng, Haotian Zhang, et al. Ferret-ui lite: Lessons from building small on-device gui agents.arXiv preprint arXiv:2509.26539, 2025

work page arXiv 2025

[47] [47]

WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point

Henry Hengyuan Zhao, Kaiming Yang, Wendi Yu, Difei Gao, and Mike Zheng Shou. Worldgui: An interactive benchmark for desktop gui automation from any starting point, 2026. URL https://arxiv.org/abs/2502.08047

work page internal anchor Pith review Pith/arXiv arXiv 2026

[48] [48]

Agentdam: Privacy leakage evaluation for autonomous web agents.arXiv preprint arXiv:2503.09780, 2025

Arman Zharmagambetov, Chuan Guo, Ivan Evtimov, Maya Pavlova, Ruslan Salakhutdinov, and Kamalika Chaudhuri. Agentdam: Privacy leakage evaluation for autonomous web agents.arXiv preprint arXiv:2503.09780, 2025

work page arXiv 2025

[49] [49]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. URL https://arxiv.org/abs/ 2307.13854. 13 Appendix Overview This appendix provides supplementary material for the main pape...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

All tasks in the instruction should be completed to get the pass

Analyze the task instruction and set the criteria for task completion. All tasks in the instruction should be completed to get the pass

[51] [51]

Decide if the agent correctly completed the task objective (pass/fail)

[52] [52]

task_completion_criteria

If fail, provide a SHORT reason (3-4 sentences), concrete and behavior-focused. This should be detailed enough to help the agent improve without seeing the trajectory. Include which sub-task it failed, which component it did not ground correctly, or why the progress got stuck. Return STRICT JSON only, with this exact schema: { "task_completion_criteria": ...

[53] [53]

Focus on sub-tasks the agent cannot do reliably

[54] [54]

Identify concrete operations the agent misuses or fails to execute

[55] [55]

Categories should be notably different from each other

[56] [56]

Group repeated failures into reusable categories

[57] [57]

Figure 6: Teacher–student weakness summarization prompt

Do not include markdown; return JSON only. Figure 6: Teacher–student weakness summarization prompt. You are evaluating screenshots from a single software domain. Goal: select the screenshots that maximize understanding of the domain’s features and UI components. You will receive candidate screenshots in this pattern: Image 0: <image> Image 1: <image> ... ...

[58] [58]

Coverage of distinct major features/workflows

[59] [59]

Diversity of visible UI components/layout states

[60] [60]

selected_indices

Informational richness (settings/panels/dialogs/menus/output views). Avoid near-duplicates and low-information transitional frames. Return ONLY valid JSON with this schema: { "selected_indices": [int, ...], "reasons": [ { "index": int, "reason": "short reason focused on coverage value" } ] } Figure 7: Screenshot ranking prompt. 19 Goal: - Propose new task...

[61] [62]

Student weakness analysis (teacher pass, student fail)

[62] [65]

queries": [ {

Extra file/folder/code context from this config (provide_info) Requirements: - Generate exactly Y instructions. - Each instruction must be concise end-user style English. - Do not include more than two simple and easy sub-tasks. - Every instruction must satisfy the workspace/path contract. - Must target one or more weak abilities from the analysis. - Must...

[63] [66]

Prior instructions already used (avoid overlap/paraphrase)

[64] [67]

Student weakness analysis: (Not used in this run.)

[65] [68]

Workspace / path contract

[66] [69]

Current docker config array to target

[67] [70]

queries": [ {

Extra file/folder/code context from this config (provide_info) Requirements: - Generate exactly Y instructions. - Each instruction must be concise end-user style English. - Do not include more than two simple and easy sub-tasks. - Every instruction must satisfy the workspace/path contract. - Must maximize diversity and novelty versus prior instructions. -...