Automating SKILL.md Generation for Computer-Using Agents via Interaction Trajectory Mining

Xiaomin Li; Yuexing Hao

arxiv: 2606.20363 · v1 · pith:TAJ4DZ4Gnew · submitted 2026-06-18 · 💻 cs.AI

Automating SKILL.md Generation for Computer-Using Agents via Interaction Trajectory Mining

Yuexing Hao , Xiaomin Li This is my paper

Pith reviewed 2026-06-26 17:00 UTC · model grok-4.3

classification 💻 cs.AI

keywords skill librarytrajectory miningGUI agentssegment clusteringpolicy trainingdiagnostic studycomputer-using agentsGRPO

0 comments

The pith

Trajectory mining from GUI interactions produces readable skill clusters but does not reliably improve agent policies on new tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether explicit skill libraries for computer-using agents can be automatically generated from interaction trajectories in a way that enhances downstream performance. It implements a pipeline that segments trajectories, clusters the segments into skills, and then trains a skill-aware policy. On the source benchmark, the clusters show high purity matching known workflows, yet when used for training, the resulting policies show only slight gains in one metric and none in others. The authors conclude that current techniques for boundary detection, representation, and reward modeling fall short for cross-domain transfer. This serves as a diagnostic rather than a solution, highlighting specific bottlenecks in the mining approach.

Core claim

What carries the argument

A three-stage pipeline that segments GUI trajectories, clusters segments into candidate skills, and trains a skill-aware policy from the annotations.

If this is right

Five of eight mined clusters achieve at least 0.95 purity against InteraSkill Workflows labels on the source benchmark.
GRPO training on the mined skills raises skill-step accuracy on IW from 18.5% to 20.5%.
Performance on the BrowseComp+ benchmark remains essentially unchanged after training.
The skill-aware policy underperforms simple frequency-based priors on several source-domain metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A boundary detector that more accurately identifies skill transitions could allow the same clustering step to produce clusters that support larger policy gains.
Replacing the orderless segment representation with one that preserves sequence order might capture dependencies that current clusters miss.
Switching from an offline reward model to one that is updated during policy training could reduce the gap to frequency priors observed on source metrics.
Applying the pipeline to additional held-out domains beyond IW and BrowseComp+ would test whether the reported insufficiency is specific to those benchmarks.

Load-bearing premise

High purity of mined clusters against InteraSkill Workflows labels indicates transferable skills that will improve policies on held-out benchmarks like BrowseComp+.

What would settle it

A controlled experiment that keeps all other components fixed but replaces the current boundary detector with one that achieves near-perfect segment boundaries, then measures whether GRPO training produces gains on BrowseComp+ larger than the observed zero change.

Figures

Figures reproduced from arXiv: 2606.20363 by Xiaomin Li, Yuexing Hao.

**Figure 1.** Figure 1: Study design for automated SKILL.md generation. IW is the source dataset for trajectory segmentation, skill-library construction, and Phase 3 GRPO policy training; WebArena and BrowseComp+ are the completed held-out transfer checks. Mind2Web zero-shot and WorkArena-NLP are reported only as diagnostics, not as current GRPO transfer evidence. The paper evaluates boundary quality, cluster quality, auto-gener… view at source ↗

**Figure 2.** Figure 2: Data-efficiency comparison for generated [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Explicit skill libraries make computer-using agents easier to inspect, but it remains unclear whether such libraries can be mined from interaction data in a way that improves downstream policies. We study this question through a three-stage pipeline that segments GUI trajectories, clusters segments into candidate skills, and trains a skill-aware policy from the resulting annotations. The mined clusters are readable on the source benchmark: five of eight clusters have at least 0.95 purity against InteraSkill Workflows labels. However, readability does not imply transfer. GRPO improves IW skill-step accuracy only from 18.5\% to 20.5\%, leaves BrowseComp+ essentially unchanged, and underperforms trivial frequency priors on key source-domain metrics. We therefore present the method as a diagnostic study: trajectory mining can expose inspectable skill structure, but the current boundary detector, orderless segment representation, and offline reward model are insufficient for reliable cross-domain policy improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clear negative result on trajectory-mined skills for agents: high source purity but no real policy gains and underperformance vs. frequency baselines.

read the letter

The main takeaway is that this is a diagnostic study showing trajectory mining can produce readable clusters (five of eight at 0.95 purity against InteraSkill labels) yet the resulting skills do not transfer well. GRPO only lifts IW accuracy from 18.5% to 20.5%, leaves BrowseComp+ flat, and loses to simple frequency priors on source metrics. That negative outcome is the actual contribution.

What the work does is apply a three-stage pipeline—segmentation, clustering into orderless segments, then offline reward for GRPO—to GUI trajectories and report the transfer failure explicitly. The honesty in framing it as evidence that current boundary detection and segment representations are insufficient is useful. It directly tests and falsifies the assumption that high-purity clusters will yield cross-domain gains.

The soft spots are mostly in the limited evidence for the pipeline itself. The abstract gives the purity and accuracy numbers but no error bars, variance across runs, or access to the mined clusters and reward model details. Without those, it is hard to judge whether the boundary detector or the orderless representation is the main culprit or if small changes could flip the result. The benchmarks are standard, but the underperformance versus priors makes the insufficiency claim rest heavily on those specific numbers.

This is for people working on skill libraries and agent trajectory mining in the computer-use setting. A reader already following InteraSkill or similar GUI agent work will get value from the negative transfer data. It deserves a serious referee because the empirical comparison is direct and the conclusion is internally consistent, even if the absolute gains are small.

Referee Report

2 major / 2 minor

Summary. The paper describes a three-stage pipeline (trajectory segmentation, segment clustering into candidate skills, and GRPO-based skill-aware policy training) for automatically generating inspectable skill libraries from GUI interaction data. It reports that five of eight mined clusters achieve ≥0.95 purity against InteraSkill Workflows labels on the source domain, yet GRPO yields only a 2-point gain in IW skill-step accuracy (18.5% → 20.5%), no improvement on BrowseComp+, and underperforms frequency priors; the work is framed as a diagnostic study showing that current boundary detection, orderless segment representations, and offline rewards are insufficient for reliable cross-domain policy gains.

Significance. If the negative result is robust, the paper supplies a concrete falsification that high source-domain cluster purity does not imply transferable policy improvement, identifying three specific pipeline bottlenecks. This diagnostic framing is useful for the computer-using agents literature and avoids overclaiming positive transfer. The explicit comparison against both external labels and trivial baselines strengthens the evidentiary value.

major comments (2)

[§4] §4 (Results): the central insufficiency claim rests on the GRPO vs. frequency-prior comparison and the +2% IW gain, yet the text provides only point estimates with no error bars, run-to-run variance, or statistical significance tests; without these, it is impossible to judge whether the observed gaps are reliable enough to support the conclusion that the three pipeline components are insufficient.
[§3.2–3.3] §3.2–3.3 (Boundary detector and segment representation): the paper identifies these as load-bearing limitations but reports no ablation that isolates their individual contributions to the transfer failure (e.g., replacing the orderless representation with an ordered one while keeping the same clusters); the diagnostic conclusion therefore remains partly qualitative.

minor comments (2)

[Abstract, §4] Abstract and §4: the purity and accuracy numbers are given without reference to the exact number of trajectories or episodes used, making it hard to assess sample size.
[Figures] Figure captions: several figures lack axis labels or legend entries that would allow a reader to verify the reported purity and accuracy values directly from the plots.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance and the recommendation for minor revision. The comments identify key areas where additional rigor can strengthen the diagnostic claims regarding the pipeline's limitations. Below we provide point-by-point responses.

read point-by-point responses

Referee: [§4] §4 (Results): the central insufficiency claim rests on the GRPO vs. frequency-prior comparison and the +2% IW gain, yet the text provides only point estimates with no error bars, run-to-run variance, or statistical significance tests; without these, it is impossible to judge whether the observed gaps are reliable enough to support the conclusion that the three pipeline components are insufficient.

Authors: We agree with this assessment. The current manuscript reports only single-run point estimates for the skill-step accuracy improvements. In the revised version, we will rerun the GRPO training multiple times to compute means and standard deviations, and include statistical tests (such as Wilcoxon signed-rank tests) comparing against the frequency prior baseline. This will allow readers to better evaluate the reliability of the +2% gain and the underperformance relative to priors. revision: yes
Referee: [§3.2–3.3] §3.2–3.3 (Boundary detector and segment representation): the paper identifies these as load-bearing limitations but reports no ablation that isolates their individual contributions to the transfer failure (e.g., replacing the orderless representation with an ordered one while keeping the same clusters); the diagnostic conclusion therefore remains partly qualitative.

Authors: We acknowledge that the identification of specific bottlenecks is based on the overall experimental outcomes rather than isolated ablations. Conducting the suggested ablations would require substantial additional engineering and compute to implement ordered segment representations and alternative boundary detectors while controlling for other variables. As this is framed as a diagnostic study highlighting insufficiencies, we believe the current evidence suffices to motivate future work on these components. We will revise the text to more clearly state the qualitative basis of these claims and their implications. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is framed explicitly as a diagnostic study: it mines clusters from trajectories, reports high source-domain purity (0.95 on five of eight clusters vs. InteraSkill labels), then shows that the resulting annotations yield only marginal GRPO gains (+2% IW accuracy) and no BrowseComp+ improvement while underperforming frequency priors. This negative result is supported by direct empirical comparisons to external baselines and does not rely on any derivation that reduces a claimed prediction or uniqueness result to fitted parameters, self-citations, or definitional equivalence. No load-bearing step invokes the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the work is presented as an empirical diagnostic without explicit modeling assumptions or new postulated entities.

pith-pipeline@v0.9.1-grok · 5685 in / 1111 out tokens · 41990 ms · 2026-06-26T17:00:44.083402+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 2 canonical work pages

[1]

WebShop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. WebShop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

2022
[2]

Mind2Web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2Web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

2023
[3]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024. 9

2024
[4]

VisualWebArena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

2024
[5]

Laradji, Manuel Del Verme, Tom Marty, Leo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Leo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How capable are web agents at solving common knowledge work tasks? InInternational Conference on Machine Learning, 2024

2024
[6]

OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InAdvances in Neural Information P...

2024
[7]

OpAgent: Operator agent for web navigation

Yuyu Guo, Wenjie Yang, Siyuan Yang, et al. OpAgent: Operator agent for web navigation. arXiv preprint arXiv:2602.13559, 2026

Pith/arXiv arXiv 2026
[8]

OpenCUA: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

Xinyuan Wang, Bowen Wang, Dunjie Lu, et al. OpenCUA: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

arXiv 2025
[9]

UltraCUA: A foundation model for computer use agents with hybrid action.arXiv preprint arXiv:2510.17790, 2025

Yuhao Yang, Zhen Yang, Zi-Yi Dou, et al. UltraCUA: A foundation model for computer use agents with hybrid action.arXiv preprint arXiv:2510.17790, 2025

Pith/arXiv arXiv 2025
[10]

InProceedings of the CHI Conference on Human Factors in Computing Systems, CHI 2024, Honolulu, HI, USA, May 11-16, 2024, Florian ’Floyd’ Mueller, Penny Kyburz, Julie R

Yuexing Hao, Zeyu Liu, Bob Riter, and Saleh Kalantari. Advancing patient-centered shared decision-making with AI systems for older adult cancer patients. InProceedings of the CHI Conference on Human Factors in Computing Systems, pages 1–19, 2024. doi: 10.1145/3613904. 3642353

work page doi:10.1145/3613904 2024
[11]

Waddle, Brian J

Yuexing Hao, Jason Holmes, Mark R. Waddle, Brian J. Davis, Nathan Y . Yu, Kristin Vickers, Heather Preston, Drew Margolin, Corinna E. Lockenhoff, Aditya Vashistha, Saleh Kalantari, Marzyeh Ghassemi, and Wei Liu. Personalizing prostate cancer education for patients using an EHR-integrated LLM agent.npj Digital Medicine, 2025

2025
[12]

Stern, and Marzyeh Ghassemi

Yuexing Hao, Kumail Alhamoud, Hyewon Jeong, Haoran Zhang, Isha Puri, Philip Torr, Mike Schaekermann, Ariel D. Stern, and Marzyeh Ghassemi. MedPAIR: Measuring physicians and AI relevance alignment in medical question answering.arXiv preprint arXiv:2505.24040, 2025

arXiv 2025
[13]

MedGUIDE: Benchmarking clinical decision-making in large language models.arXiv preprint arXiv:2505.11613, 2025

Xiaomin Li, Mingye Gao, Yuexing Hao, Taoran Li, Guangya Wan, Zihan Wang, and Yijun Wang. MedGUIDE: Benchmarking clinical decision-making in large language models.arXiv preprint arXiv:2505.11613, 2025

arXiv 2025
[14]

Selection of LLM fine-tuning data based on orthogonal rules.arXiv preprint arXiv:2410.04715, 2024

Xiaomin Li, Mingye Gao, Zhiwei Zhang, Chang Yue, and Hong Hu. Selection of LLM fine-tuning data based on orthogonal rules.arXiv preprint arXiv:2410.04715, 2024

arXiv 2024
[15]

Data-adaptive safety rules for training reward models.arXiv preprint arXiv:2501.15453, 2025

Xiaomin Li, Mingye Gao, Zhiwei Zhang, Jingxuan Fan, and Weiyu Li. Data-adaptive safety rules for training reward models.arXiv preprint arXiv:2501.15453, 2025

arXiv 2025
[16]

Agent workflow memory

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. arXiv preprint arXiv:2409.07429, 2024

Pith/arXiv arXiv 2024
[17]

Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su

Boyuan Zheng, Michael Y . Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su. SkillWeaver: Web agents can self-improve by discovering and honing skills.arXiv preprint arXiv:2504.07079, 2025

Pith/arXiv arXiv 2025
[18]

AutoManual: Constructing instruction manuals by LLM agents via interactive environmental learning

Minghao Chen, Yihang Li, Yanting Yang, Shiyu Yu, Binbin Lin, and Xiaofei He. AutoManual: Constructing instruction manuals by LLM agents via interactive environmental learning. In Advances in Neural Information Processing Systems, 2024

2024
[19]

Tarr, William W

Gabriel Sarch, Lawrence Jang, Michael J. Tarr, William W. Cohen, Kenneth Marino, and Katerina Fragkiadaki. VLM agents generate their own memories: Distilling experience into embodied programs of thought. InAdvances in Neural Information Processing Systems, 2024. 10

2024
[20]

LearnAct: Few-shot mobile GUI agent with a unified demonstration benchmark.arXiv preprint arXiv:2504.13805, 2025

Guangyi Liu, Pengxiang Zhao, Liang Liu, Zhiyong Chen, Yuning Chai, Shuai Ren, Hao Wang, Shixiang He, and Wanli Meng. LearnAct: Few-shot mobile GUI agent with a unified demonstration benchmark.arXiv preprint arXiv:2504.13805, 2025

arXiv 2025
[21]

Open-world skill discovery from unsegmented demonstration videos

Jingwen Deng, Zihao Wang, Shaofei Cai, Anji Liu, and Yitao Liang. Open-world skill discovery from unsegmented demonstration videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10708–10718, 2025

2025
[22]

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2): 181–211, 1999

Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2): 181–211, 1999

1999
[23]

The option-critic architecture

Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. InProceedings of the AAAI Conference on Artificial Intelligence, 2017

2017
[24]

Kulkarni, Karthik R

Tejas D. Kulkarni, Karthik R. Narasimhan, Ardavan Saeedi, and Joshua B. Tenenbaum. Hierar- chical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. InAdvances in Neural Information Processing Systems, 2016

2016
[25]

FeUdal networks for hierarchical reinforcement learning

Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. FeUdal networks for hierarchical reinforcement learning. InProceedings of the 34th International Conference on Machine Learning, pages 3540–3549. PMLR, 2017

2017
[26]

Data-efficient hierarchical reinforcement learning

Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning. InAdvances in Neural Information Processing Systems, 2018

2018
[27]

Learning abstract options

Matthew Riemer, Miao Liu, and Gerald Tesauro. Learning abstract options. InAdvances in Neural Information Processing Systems, 2018

2018
[28]

Hierarchical reinforcement learning with advantage-based auxiliary rewards

Siyuan Li, Rui Wang, Minxue Tang, and Chongjie Zhang. Hierarchical reinforcement learning with advantage-based auxiliary rewards. InAdvances in Neural Information Processing Systems, 2019

2019
[29]

Meta learning shared hierarchies

Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, and John Schulman. Meta learning shared hierarchies. InInternational Conference on Learning Representations, 2018

2018
[30]

OPAL: Offline primitive discovery for accelerating offline reinforcement learning

Anurag Ajay, Aviral Kumar, Pulkit Agrawal, Sergey Levine, and Ofir Nachum. OPAL: Offline primitive discovery for accelerating offline reinforcement learning. InInternational Conference on Learning Representations, 2021

2021
[31]

Variational intrinsic control

Karol Gregor, Danilo Jimenez Rezende, and Daan Wierstra. Variational intrinsic control. In International Conference on Learning Representations, 2017

2017
[32]

Diversity is all you need: Learning skills without a reward function

Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. InInternational Conference on Learning Representations, 2019

2019
[33]

Dynamics- aware unsupervised discovery of skills

Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics- aware unsupervised discovery of skills. InInternational Conference on Learning Representa- tions, 2020

2020
[34]

Unsupervised reinforcement learning with contrastive intrinsic control

Michael Laskin, Hao Liu, Xue Bin Peng, Denis Yarats, Aravind Rajeswaran, and Pieter Abbeel. Unsupervised reinforcement learning with contrastive intrinsic control. InAdvances in Neural Information Processing Systems, 2022

2022
[35]

Learning actionable representations with goal-conditioned policies

Dibya Ghosh, Abhishek Gupta, and Sergey Levine. Learning actionable representations with goal-conditioned policies. InInternational Conference on Learning Representations, 2019

2019
[36]

The information geometry of unsupervised reinforcement learning

Benjamin Eysenbach, Ruslan Salakhutdinov, and Sergey Levine. The information geometry of unsupervised reinforcement learning. InInternational Conference on Learning Representations, 2022. 11

2022
[37]

DigiRL: Training in-the-wild device-control agents with autonomous reinforcement learning

Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. DigiRL: Training in-the-wild device-control agents with autonomous reinforcement learning. InAdvances in Neural Information Processing Systems, 2024

2024
[38]

WebRL: Training LLM web agents via self-evolving online curriculum reinforcement learning

Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenhan Zhao, Yuxiao Yang, Xiao Yang, Jiadai Sun, Shuntian Yao, Tianjie Zhang, Wei Xu, Jie Tang, and Yuxiao Dong. WebRL: Training LLM web agents via self-evolving online curriculum reinforcement learning. InInternational Conference on Learning Representations, 2025

2025
[39]

AgentTrek: Agent trajectory synthesis via guiding replay with web tutorials

Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, and Tao Yu. AgentTrek: Agent trajectory synthesis via guiding replay with web tutorials. In International Conference on Learning Representations, 2025

2025
[40]

OS-Genesis: Automating GUI agent trajectory construction via reverse task synthesis

Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, Ben Kao, Guohao Li, Junxian He, Yu Qiao, and Zhiyong Wu. OS-Genesis: Automating GUI agent trajectory construction via reverse task synthesis. InProceedings of the 63rd Annual Meeting of the Association for Computational Lingui...

2025
[41]

Proposer-agent-evaluator (PAE): Autonomous skill discovery for foundation model internet agents

Yifei Zhou, Qianlan Yang, Kaixiang Lin, Min Bai, Xiong Zhou, Yu-Xiong Wang, Sergey Levine, and Erran Li. Proposer-agent-evaluator (PAE): Autonomous skill discovery for foundation model internet agents. InProceedings of the 42nd International Conference on Machine Learning, 2025

2025
[42]

Skills-coach: A self-evolving skill optimizer via training-free GRPO.arXiv preprint arXiv:2604.27488, 2026

Yu Tian, Jiawei Chen, Lifan Zheng, Mingxiang Tao, Xinyi Zeng, Zhaoxia Yin, Hang Su, and Xian Sun. Skills-coach: A self-evolving skill optimizer via training-free GRPO.arXiv preprint arXiv:2604.27488, 2026

Pith/arXiv arXiv 2026
[43]

Selective review of offline change point detection methods

Charles Truong, Laurent Oudre, and Nicolas Vayatis. Selective review of offline change point detection methods.Signal Processing, 167:107299, 2020. doi: 10.1016/j.sigpro.2019.107299

work page doi:10.1016/j.sigpro.2019.107299 2020
[44]

D. C. Dowson and B. V . Landau. The Fréchet distance between multivariate normal distributions. Journal of Multivariate Analysis, 12(3):450–455, 1982

1982
[45]

Computational optimal transport: With applications to data science.Foundations and Trends in Machine Learning, 11(5–6):355–607, 2019

Gabriel Peyré and Marco Cuturi. Computational optimal transport: With applications to data science.Foundations and Trends in Machine Learning, 11(5–6):355–607, 2019

2019
[46]

Sinkhorn distances: Lightspeed computation of optimal transportation distances

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transportation distances. InAdvances in Neural Information Processing Systems, pages 2292–2300, 2013

2013
[47]

held-out benchmark correctness

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. InAdvances in Neural Information Processing Systems, 2020. 12 A Appendix A.1 Additional GRPO Training Sessions We also run a scale-control GRPO session on Llama-3.1-70B-Instruct with quantized ...

arXiv 2020

[1] [1]

WebShop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. WebShop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

2022

[2] [2]

Mind2Web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2Web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

2023

[3] [3]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024. 9

2024

[4] [4]

VisualWebArena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

2024

[5] [5]

Laradji, Manuel Del Verme, Tom Marty, Leo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Leo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How capable are web agents at solving common knowledge work tasks? InInternational Conference on Machine Learning, 2024

2024

[6] [6]

OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InAdvances in Neural Information P...

2024

[7] [7]

OpAgent: Operator agent for web navigation

Yuyu Guo, Wenjie Yang, Siyuan Yang, et al. OpAgent: Operator agent for web navigation. arXiv preprint arXiv:2602.13559, 2026

Pith/arXiv arXiv 2026

[8] [8]

OpenCUA: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

Xinyuan Wang, Bowen Wang, Dunjie Lu, et al. OpenCUA: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

arXiv 2025

[9] [9]

UltraCUA: A foundation model for computer use agents with hybrid action.arXiv preprint arXiv:2510.17790, 2025

Yuhao Yang, Zhen Yang, Zi-Yi Dou, et al. UltraCUA: A foundation model for computer use agents with hybrid action.arXiv preprint arXiv:2510.17790, 2025

Pith/arXiv arXiv 2025

[10] [10]

InProceedings of the CHI Conference on Human Factors in Computing Systems, CHI 2024, Honolulu, HI, USA, May 11-16, 2024, Florian ’Floyd’ Mueller, Penny Kyburz, Julie R

Yuexing Hao, Zeyu Liu, Bob Riter, and Saleh Kalantari. Advancing patient-centered shared decision-making with AI systems for older adult cancer patients. InProceedings of the CHI Conference on Human Factors in Computing Systems, pages 1–19, 2024. doi: 10.1145/3613904. 3642353

work page doi:10.1145/3613904 2024

[11] [11]

Waddle, Brian J

Yuexing Hao, Jason Holmes, Mark R. Waddle, Brian J. Davis, Nathan Y . Yu, Kristin Vickers, Heather Preston, Drew Margolin, Corinna E. Lockenhoff, Aditya Vashistha, Saleh Kalantari, Marzyeh Ghassemi, and Wei Liu. Personalizing prostate cancer education for patients using an EHR-integrated LLM agent.npj Digital Medicine, 2025

2025

[12] [12]

Stern, and Marzyeh Ghassemi

Yuexing Hao, Kumail Alhamoud, Hyewon Jeong, Haoran Zhang, Isha Puri, Philip Torr, Mike Schaekermann, Ariel D. Stern, and Marzyeh Ghassemi. MedPAIR: Measuring physicians and AI relevance alignment in medical question answering.arXiv preprint arXiv:2505.24040, 2025

arXiv 2025

[13] [13]

MedGUIDE: Benchmarking clinical decision-making in large language models.arXiv preprint arXiv:2505.11613, 2025

Xiaomin Li, Mingye Gao, Yuexing Hao, Taoran Li, Guangya Wan, Zihan Wang, and Yijun Wang. MedGUIDE: Benchmarking clinical decision-making in large language models.arXiv preprint arXiv:2505.11613, 2025

arXiv 2025

[14] [14]

Selection of LLM fine-tuning data based on orthogonal rules.arXiv preprint arXiv:2410.04715, 2024

Xiaomin Li, Mingye Gao, Zhiwei Zhang, Chang Yue, and Hong Hu. Selection of LLM fine-tuning data based on orthogonal rules.arXiv preprint arXiv:2410.04715, 2024

arXiv 2024

[15] [15]

Data-adaptive safety rules for training reward models.arXiv preprint arXiv:2501.15453, 2025

Xiaomin Li, Mingye Gao, Zhiwei Zhang, Jingxuan Fan, and Weiyu Li. Data-adaptive safety rules for training reward models.arXiv preprint arXiv:2501.15453, 2025

arXiv 2025

[16] [16]

Agent workflow memory

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. arXiv preprint arXiv:2409.07429, 2024

Pith/arXiv arXiv 2024

[17] [17]

Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su

Boyuan Zheng, Michael Y . Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su. SkillWeaver: Web agents can self-improve by discovering and honing skills.arXiv preprint arXiv:2504.07079, 2025

Pith/arXiv arXiv 2025

[18] [18]

AutoManual: Constructing instruction manuals by LLM agents via interactive environmental learning

Minghao Chen, Yihang Li, Yanting Yang, Shiyu Yu, Binbin Lin, and Xiaofei He. AutoManual: Constructing instruction manuals by LLM agents via interactive environmental learning. In Advances in Neural Information Processing Systems, 2024

2024

[19] [19]

Tarr, William W

Gabriel Sarch, Lawrence Jang, Michael J. Tarr, William W. Cohen, Kenneth Marino, and Katerina Fragkiadaki. VLM agents generate their own memories: Distilling experience into embodied programs of thought. InAdvances in Neural Information Processing Systems, 2024. 10

2024

[20] [20]

LearnAct: Few-shot mobile GUI agent with a unified demonstration benchmark.arXiv preprint arXiv:2504.13805, 2025

Guangyi Liu, Pengxiang Zhao, Liang Liu, Zhiyong Chen, Yuning Chai, Shuai Ren, Hao Wang, Shixiang He, and Wanli Meng. LearnAct: Few-shot mobile GUI agent with a unified demonstration benchmark.arXiv preprint arXiv:2504.13805, 2025

arXiv 2025

[21] [21]

Open-world skill discovery from unsegmented demonstration videos

Jingwen Deng, Zihao Wang, Shaofei Cai, Anji Liu, and Yitao Liang. Open-world skill discovery from unsegmented demonstration videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10708–10718, 2025

2025

[22] [22]

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2): 181–211, 1999

Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2): 181–211, 1999

1999

[23] [23]

The option-critic architecture

Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. InProceedings of the AAAI Conference on Artificial Intelligence, 2017

2017

[24] [24]

Kulkarni, Karthik R

Tejas D. Kulkarni, Karthik R. Narasimhan, Ardavan Saeedi, and Joshua B. Tenenbaum. Hierar- chical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. InAdvances in Neural Information Processing Systems, 2016

2016

[25] [25]

FeUdal networks for hierarchical reinforcement learning

Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. FeUdal networks for hierarchical reinforcement learning. InProceedings of the 34th International Conference on Machine Learning, pages 3540–3549. PMLR, 2017

2017

[26] [26]

Data-efficient hierarchical reinforcement learning

Ofir Nachum, Shixiang Gu, Honglak Lee, and Sergey Levine. Data-efficient hierarchical reinforcement learning. InAdvances in Neural Information Processing Systems, 2018

2018

[27] [27]

Learning abstract options

Matthew Riemer, Miao Liu, and Gerald Tesauro. Learning abstract options. InAdvances in Neural Information Processing Systems, 2018

2018

[28] [28]

Hierarchical reinforcement learning with advantage-based auxiliary rewards

Siyuan Li, Rui Wang, Minxue Tang, and Chongjie Zhang. Hierarchical reinforcement learning with advantage-based auxiliary rewards. InAdvances in Neural Information Processing Systems, 2019

2019

[29] [29]

Meta learning shared hierarchies

Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, and John Schulman. Meta learning shared hierarchies. InInternational Conference on Learning Representations, 2018

2018

[30] [30]

OPAL: Offline primitive discovery for accelerating offline reinforcement learning

Anurag Ajay, Aviral Kumar, Pulkit Agrawal, Sergey Levine, and Ofir Nachum. OPAL: Offline primitive discovery for accelerating offline reinforcement learning. InInternational Conference on Learning Representations, 2021

2021

[31] [31]

Variational intrinsic control

Karol Gregor, Danilo Jimenez Rezende, and Daan Wierstra. Variational intrinsic control. In International Conference on Learning Representations, 2017

2017

[32] [32]

Diversity is all you need: Learning skills without a reward function

Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. InInternational Conference on Learning Representations, 2019

2019

[33] [33]

Dynamics- aware unsupervised discovery of skills

Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics- aware unsupervised discovery of skills. InInternational Conference on Learning Representa- tions, 2020

2020

[34] [34]

Unsupervised reinforcement learning with contrastive intrinsic control

Michael Laskin, Hao Liu, Xue Bin Peng, Denis Yarats, Aravind Rajeswaran, and Pieter Abbeel. Unsupervised reinforcement learning with contrastive intrinsic control. InAdvances in Neural Information Processing Systems, 2022

2022

[35] [35]

Learning actionable representations with goal-conditioned policies

Dibya Ghosh, Abhishek Gupta, and Sergey Levine. Learning actionable representations with goal-conditioned policies. InInternational Conference on Learning Representations, 2019

2019

[36] [36]

The information geometry of unsupervised reinforcement learning

Benjamin Eysenbach, Ruslan Salakhutdinov, and Sergey Levine. The information geometry of unsupervised reinforcement learning. InInternational Conference on Learning Representations, 2022. 11

2022

[37] [37]

DigiRL: Training in-the-wild device-control agents with autonomous reinforcement learning

Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. DigiRL: Training in-the-wild device-control agents with autonomous reinforcement learning. InAdvances in Neural Information Processing Systems, 2024

2024

[38] [38]

WebRL: Training LLM web agents via self-evolving online curriculum reinforcement learning

Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenhan Zhao, Yuxiao Yang, Xiao Yang, Jiadai Sun, Shuntian Yao, Tianjie Zhang, Wei Xu, Jie Tang, and Yuxiao Dong. WebRL: Training LLM web agents via self-evolving online curriculum reinforcement learning. InInternational Conference on Learning Representations, 2025

2025

[39] [39]

AgentTrek: Agent trajectory synthesis via guiding replay with web tutorials

Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, and Tao Yu. AgentTrek: Agent trajectory synthesis via guiding replay with web tutorials. In International Conference on Learning Representations, 2025

2025

[40] [40]

OS-Genesis: Automating GUI agent trajectory construction via reverse task synthesis

Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, Ben Kao, Guohao Li, Junxian He, Yu Qiao, and Zhiyong Wu. OS-Genesis: Automating GUI agent trajectory construction via reverse task synthesis. InProceedings of the 63rd Annual Meeting of the Association for Computational Lingui...

2025

[41] [41]

Proposer-agent-evaluator (PAE): Autonomous skill discovery for foundation model internet agents

Yifei Zhou, Qianlan Yang, Kaixiang Lin, Min Bai, Xiong Zhou, Yu-Xiong Wang, Sergey Levine, and Erran Li. Proposer-agent-evaluator (PAE): Autonomous skill discovery for foundation model internet agents. InProceedings of the 42nd International Conference on Machine Learning, 2025

2025

[42] [42]

Skills-coach: A self-evolving skill optimizer via training-free GRPO.arXiv preprint arXiv:2604.27488, 2026

Yu Tian, Jiawei Chen, Lifan Zheng, Mingxiang Tao, Xinyi Zeng, Zhaoxia Yin, Hang Su, and Xian Sun. Skills-coach: A self-evolving skill optimizer via training-free GRPO.arXiv preprint arXiv:2604.27488, 2026

Pith/arXiv arXiv 2026

[43] [43]

Selective review of offline change point detection methods

Charles Truong, Laurent Oudre, and Nicolas Vayatis. Selective review of offline change point detection methods.Signal Processing, 167:107299, 2020. doi: 10.1016/j.sigpro.2019.107299

work page doi:10.1016/j.sigpro.2019.107299 2020

[44] [44]

D. C. Dowson and B. V . Landau. The Fréchet distance between multivariate normal distributions. Journal of Multivariate Analysis, 12(3):450–455, 1982

1982

[45] [45]

Computational optimal transport: With applications to data science.Foundations and Trends in Machine Learning, 11(5–6):355–607, 2019

Gabriel Peyré and Marco Cuturi. Computational optimal transport: With applications to data science.Foundations and Trends in Machine Learning, 11(5–6):355–607, 2019

2019

[46] [46]

Sinkhorn distances: Lightspeed computation of optimal transportation distances

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transportation distances. InAdvances in Neural Information Processing Systems, pages 2292–2300, 2013

2013

[47] [47]

held-out benchmark correctness

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. InAdvances in Neural Information Processing Systems, 2020. 12 A Appendix A.1 Additional GRPO Training Sessions We also run a scale-control GRPO session on Llama-3.1-70B-Instruct with quantized ...

arXiv 2020