pith. machine review for the scientific record. sign in

arxiv: 2605.13527 · v2 · submitted 2026-05-13 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

MMSkills: Towards Multimodal Skills for General Visual Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:55 UTC · model grok-4.3

classification 💻 cs.AI
keywords multimodal skillsvisual agentsprocedural knowledgeGUI agentsgame agentsskill generationmultimodal proceduresagent trajectories
0
0 comments X

The pith

MMSkills equips visual agents with reusable packages of textual procedures, state cards, and multi-view keyframes derived from public trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formalizes the need for multimodal procedural knowledge in visual agents, where reuse requires not only knowing the operation but also recognizing states and interpreting visual evidence of progress or failure. It presents MMSkills as a framework that packages each skill compactly with text plus runtime state cards and keyframes, generated through an agentic process of trajectory grouping, procedure induction, visual grounding, and auditing. At inference, a branch-loaded mechanism lets the agent inspect selected evidence in a temporary branch, align it with the live scene, and distill guidance without overloading the main context. Experiments on GUI and game benchmarks show consistent gains for both frontier and smaller multimodal agents, supporting the view that external multimodal knowledge complements internal model priors.

Core claim

MMSkills represents reusable multimodal procedures as state-conditioned packages that couple a textual procedure with runtime state cards and multi-view keyframes, generated from public non-evaluation trajectories via workflow grouping, procedure induction, visual grounding, and meta-skill auditing, and consulted at runtime through temporary branch loading that aligns evidence with the live environment.

What carries the argument

The MMSkill package: a compact, state-conditioned unit that pairs a textual procedure with runtime state cards and multi-view keyframes to support state recognition, visual progress interpretation, and next-action decisions.

If this is right

  • Public interaction trajectories can be transformed into reusable multimodal skills that agents consult across tasks.
  • Both frontier and smaller multimodal agents receive consistent performance lifts on GUI and game-based benchmarks.
  • External multimodal procedural knowledge complements the priors already present inside the model.
  • Temporary branch loading allows agents to inspect and align multimodal evidence without permanent context overload.
  • The generation pipeline of grouping, induction, grounding, and auditing turns raw trajectories into auditable skill packages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Community-wide sharing of such skill packages could let agents bootstrap capabilities faster than training alone.
  • Pairing MMSkills with future larger models might produce additive gains by supplying structured external memory.
  • Extending the same generation and branch-loading approach to robotic or web-scale environments would test whether the multimodal format generalizes beyond current benchmarks.

Load-bearing premise

That the generated multimodal skills can be consulted at inference time without excessive image context or over-anchoring to reference screenshots.

What would settle it

A controlled run on the same GUI and game benchmarks in which adding MMSkills produces no gain or a drop in agent success rate relative to the no-skill baseline.

read the original abstract

Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimodal procedural knowledge and address three practical challenges: (I) what a multimodal skill package should contain; (II) where such packages can be derived from public interaction experience; and (III) how agents can consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots. We introduce MMSkills, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making. Each MMSkill is a compact, state-conditioned package that couples a textual procedure with runtime state cards and multi-view keyframes. To construct these packages, we develop an agentic trajectory-to-skill Generator that transforms public non-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. To use them, we introduce a branch-loaded multimodal skill agent: selected state cards and keyframes are inspected in a temporary branch, aligned with the live environment, and distilled into structured guidance for the main agent. Experiments across GUI and game-based visual-agent benchmarks show that MMSkills consistently improve both frontier and smaller multimodal agents, suggesting that external multimodal procedural knowledge complements model-internal priors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces MMSkills, a framework for reusable multimodal procedural knowledge in visual agents. Each MMSkill packages a textual procedure with state cards and multi-view keyframes. These are generated from public trajectories via an agentic trajectory-to-skill Generator (workflow grouping, procedure induction, visual grounding, meta-skill auditing) and consulted at inference via a branch-loaded agent that inspects selected cards and keyframes in a temporary branch, aligns them with the live environment, and distills guidance for the main agent. Experiments on GUI and game-based benchmarks are claimed to show consistent improvements for both frontier and smaller multimodal agents, suggesting external multimodal knowledge complements model-internal priors.

Significance. If the empirical claims hold after proper quantification and controls, the work would provide a concrete mechanism for injecting reusable visual procedural knowledge into agents without retraining, addressing a gap between textual skill libraries and visual decision-making. The branch-loaded consultation mechanism and trajectory-to-skill pipeline are novel engineering contributions that could be adopted by other visual-agent systems.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): the central claim that MMSkills 'consistently improve both frontier and smaller multimodal agents' is stated without any quantitative results, baselines, error bars, number of runs, or statistical tests. This absence prevents verification of the complementarity hypothesis and makes the experimental support for the framework load-bearing claim unverifiable.
  2. [§3.3] §3.3 (Branch-loaded multimodal skill agent): the inference procedure is described as inspecting state cards and keyframes in a temporary branch, aligning them with the live environment, and distilling guidance, yet no quantitative bound on added image tokens, no specification of the alignment method (feature matching, prompt injection, etc.), and no ablation on over-anchoring or context exhaustion are provided. These omissions directly affect whether the claimed 'without excessive image context' property holds.
  3. [§3.2] §3.2 (trajectory-to-skill Generator): the pipeline (workflow grouping, procedure induction, visual grounding, meta-skill-guided auditing) is presented at a high level with no pseudocode, no formal definition of the output MMSkill structure, and no analysis of failure modes or coverage of the generated skills relative to the source trajectories.
minor comments (2)
  1. [Figure 1 and §2] Figure 1 and §2: the distinction between 'state cards' and 'keyframes' is introduced without a precise definition of their visual format or how they differ from standard screenshots.
  2. [Related Work] Related Work: several recent works on visual skill libraries and GUI agents are cited but the positioning relative to them could be sharpened by explicitly stating which components are new versus adapted.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The comments highlight important areas where additional quantification, formalization, and ablation analysis will strengthen the manuscript. We address each major comment below and commit to revisions that directly resolve the identified gaps.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim that MMSkills 'consistently improve both frontier and smaller multimodal agents' is stated without any quantitative results, baselines, error bars, number of runs, or statistical tests. This absence prevents verification of the complementarity hypothesis and makes the experimental support for the framework load-bearing claim unverifiable.

    Authors: We agree that the abstract summarizes the empirical outcomes at too high a level and that §4 would benefit from explicit quantification to allow verification. In the revised manuscript we will update the abstract to reference the concrete success-rate improvements, baselines (direct prompting, retrieval-augmented agents, and skill-free variants), and the number of runs. In §4 we will add error bars (standard deviation across runs), state the exact number of runs performed, and report statistical significance tests supporting the complementarity claim. revision: yes

  2. Referee: [§3.3] §3.3 (Branch-loaded multimodal skill agent): the inference procedure is described as inspecting state cards and keyframes in a temporary branch, aligning them with the live environment, and distilling guidance, yet no quantitative bound on added image tokens, no specification of the alignment method (feature matching, prompt injection, etc.), and no ablation on over-anchoring or context exhaustion are provided. These omissions directly affect whether the claimed 'without excessive image context' property holds.

    Authors: We acknowledge that the current description of the branch-loaded consultation lacks the requested quantitative and methodological details. In the revision we will specify the alignment procedure (a hybrid of prompt injection for textual state cards and embedding-based visual similarity for keyframe selection), provide an explicit bound on added image tokens (at most four additional images per consultation), and include a dedicated ablation on over-anchoring and context-length limits to substantiate the “without excessive image context” claim. revision: yes

  3. Referee: [§3.2] §3.2 (trajectory-to-skill Generator): the pipeline (workflow grouping, procedure induction, visual grounding, meta-skill-guided auditing) is presented at a high level with no pseudocode, no formal definition of the output MMSkill structure, and no analysis of failure modes or coverage of the generated skills relative to the source trajectories.

    Authors: We agree that the generator pipeline is described at an insufficient level of formality. In the revised version we will add (i) pseudocode for the full trajectory-to-skill workflow in the appendix, (ii) a formal definition of the MMSkill data structure (textual procedure + state cards + multi-view keyframes), and (iii) an analysis of failure modes together with coverage statistics showing the fraction of source trajectories that successfully produce valid skills. revision: yes

Circularity Check

0 steps flagged

No circularity: framework is procedural construction from trajectories with independent empirical validation

full rationale

The paper contains no equations, derivations, fitted parameters, or predictions that reduce to inputs by construction. MMSkills is introduced as a representation and pipeline (trajectory-to-skill generator plus branch-loaded agent) that transforms public interaction data into reusable multimodal packages; the central claim of complementarity is supported by benchmark experiments rather than self-referential definitions or self-citation chains. No uniqueness theorems, ansatzes, or renamings of known results are invoked in a load-bearing manner. The approach is additive external knowledge whose validity is tested externally, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on the assumption that multimodal models can effectively process the provided visual cards and keyframes, plus the new entities introduced for skill representation and usage.

axioms (1)
  • domain assumption Multimodal models can process and align visual state cards with live environments at inference time.
    Invoked in the description of how the branch-loaded agent uses the skills.
invented entities (3)
  • MMSkill no independent evidence
    purpose: Compact state-conditioned package coupling textual procedure with runtime state cards and multi-view keyframes.
    Core new representation introduced to address multimodal procedural knowledge.
  • trajectory-to-skill Generator no independent evidence
    purpose: Agentic process that transforms public trajectories into reusable multimodal skills via grouping, induction, grounding, and auditing.
    Proposed mechanism for deriving the skills from experience.
  • branch-loaded multimodal skill agent no independent evidence
    purpose: Mechanism that inspects selected state cards and keyframes in a temporary branch to produce structured guidance.
    New inference-time usage pattern to avoid context overload.

pith-pipeline@v0.9.0 · 5608 in / 1426 out tokens · 42979 ms · 2026-05-15T05:55:26.129211+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 15 internal anchors

  1. [1]

    URL https://arxiv.org/abs/2410.08164. Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julia...

  2. [2]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    URL https://arxiv.org/abs/2204.01691. Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi- agent systems,

  3. [3]

    URL https://arxiv.org/abs/2603.02766. Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong...

  4. [4]

    Qwen3-VL Technical Report

    URL https://arxiv.org/abs/2511.21631. Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding,

  5. [5]

    URL https://arxiv.org/abs/2308.14508. Tianyi Chen, Yinheng Li, Michael Solodko, Sen Wang, Nan Jiang, Tingyuan Cui, Junheng Hao, Jongwoo Ko, Sara Abdali, Leon Xu, Suzhen Zheng, Hao Fan, Pashmina Cameron, Justin Wagle, and Kazuhito Koishida. CUA-skill: Develop skills for computer using agent,

  6. [6]

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu

    URL https://arxiv.org/abs/2601.21123. Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing GUI grounding for advanced visual GUI agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 9313–9332. Association for Computational Linguistics,

  7. [7]

    URL https://doi.org/10.18653/v1/2024.acl-long.505

    doi: 10.18653/V1/2024.ACL-LONG.505. URL https://doi.org/10.18653/v1/2024.acl-long.505. Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web,

  8. [8]

    Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su

    URL https://arxiv.org/abs/2306.06070. Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents,

  9. [9]

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu

    URL https://arxiv.org/abs/2410.05243. Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Web- voyager: Building an end-to-end web agent with large multimodal models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 6864–6890. Association for Computational Li...

  10. [10]

    URL https://doi.org/10.18653/v1/2024.acl-long.371

    doi: 10.18653/V1/2024.ACL-LONG.371. URL https://doi.org/10.18653/v1/2024.acl-long.371. 10 Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents,

  11. [11]

    Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P

    URL https: //arxiv.org/abs/2312.08914. Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P. Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang. lmgame-bench: How good are llms at playing games?,

  12. [12]

    Auke Jan Ijspeert, Jun Nakanishi, Heiko Hoffmann, Peter Pastor, and Stefan Schaal

    URL https://arxiv.org/abs/2505.15146. Auke Jan Ijspeert, Jun Nakanishi, Heiko Hoffmann, Peter Pastor, and Stefan Schaal. Dynamical movement primitives: Learning attractor models for motor behaviors.Neural Computation, 25(2):328–373,

  13. [13]

    URL https: //doi.org/10.1162/NECO_a_00393

    doi: 10.1162/NECO_a_00393. URL https: //doi.org/10.1162/NECO_a_00393. Guanyu Jiang, Zhaochen Su, Xiaoye Qu, and Yi R. Fung. Xskill: Continual learning from experience and skills in multimodal agents,

  14. [14]

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried

    URL https://arxiv.org/abs/2603.12056. Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pag...

  15. [15]

    URL https://doi.org/10.18653/v1/2024.acl-long.50

    doi: 10.18653/V1/2024.ACL-LONG.50. URL https://doi.org/10.18653/v1/2024.acl-long.50. Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use, 2025a. URL https://arxiv.org/abs/2504.07981. Qingyao Li, Wei Xia, Kounianhua Du, Xinyi Da...

  16. [16]

    URL https://doi.org/10.1109/ICRA48891.2023.10160591

    doi: 10.1109/ICRA48891.2023.10160591. URL https://doi.org/10.1109/ICRA48891.2023.10160591. Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts,

  17. [17]

    URL https://arxiv.org/abs/2307.03172. Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Yifan Xu, Xixuan Song, Shudan Zhang, Hanyu Lai, Xinyi Liu, Hanlin Zhao, Jiadai Sun, Xinyue Yang, Yu Yang, Zehan Qi, Shuntian Yao, Xueqiao Sun, Siyi Cheng, Qinkai Zheng, Hao Yu, Hanchen Zhang, Wenyi Hong, Ming Ding, Lihang Pan, Xiaotao Gu, Aohan Zeng, Zhengxiao Du, Chan He...

  18. [18]

    How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

    URL https://arxiv.org/abs/2604.04323. Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah. Omniparser for pure vision based gui agent,

  19. [19]

    Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu

    URL https: //arxiv.org/abs/2408.00203. Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver,

  20. [20]

    SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

    URL https://arxiv.org/abs/2604.08377. Richard E. Mayer.Multimedia Learning. Cambridge University Press,

  21. [21]

    URL https://doi.org/ 10.1017/CBO9780511811678

    doi: 10.1017/CBO9780511811678. URL https://doi.org/ 10.1017/CBO9780511811678. Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems,

  22. [22]

    MemGPT: Towards LLMs as Operating Systems

    URL https://arxiv.org/abs/2310.08560. 11 Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior,

  23. [23]

    URL https://arxiv.org/abs/2304.03442. Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wa...

  24. [24]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    URL https://arxiv.org/abs/2501.12326. Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Android in the wild: A large-scale dataset for android device control,

  25. [25]

    URL https://arxiv.org/abs/2307.10088. Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. An- droidworld: A dynamic benchmarking environment for autonomous agents,

  26. [26]

    AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    URL https://arxiv.org/abs/2405.14573. Shuai Shao, Yixiang Liu, Bingwei Lu, and Weinan Zhang. Monoscale: Scaling multi-agent system with monotonic improvement, 2026a. URL https://arxiv.org/abs/2601.23219. Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo, Jingyi Yang, Xinhao Song, Linfeng Zhang, Weinan Zhang, Dongrui Liu, and Jing Shao. Your agent may m...

  27. [27]

    nips.cc/paper_files/paper/2023/hash/1b44b878bb782e6954cd888628510e90-Abstract-Conference.html

    URL http://papers. nips.cc/paper_files/paper/2023/hash/1b44b878bb782e6954cd888628510e90-Abstract-Conference.html. Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112(1–2):181–211,

  28. [28]

    Kimi K2.5: Visual Agentic Intelligence

    doi: 10.1016/S0004-3702(99)00052-1. URL https: //doi.org/10.1016/S0004-3702(99)00052-1. Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yim...

  29. [29]

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao

    URL https://arxiv.org/abs/2302.01560. Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. OS-ATLAS: A foundation action model for generalist GUI agents,

  30. [30]

    URL https://arxiv.org/abs/2602.08234. Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments,

  31. [31]

    13 Renjun Xu and Yang Yan

    URL https://arxiv.org/abs/2506.10387. 13 Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward,

  32. [32]

    Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

    URL https://arxiv.org/abs/2602.12430. Yibin Xu, Liang Yang, Hao Chen, Hua Wang, Zhi Chen, and Yaohua Tang. Deskvision: Large scale desktop region captioning for advanced gui agents,

  33. [33]

    Jingyi Yang, Shuai Shao, Dongrui Liu, and Jing Shao

    URL https://arxiv.org/abs/2503.11170. Jingyi Yang, Shuai Shao, Dongrui Liu, and Jing Shao. Riosworld: Benchmarking the risk of multimodal computer-use agents, 2025a. URL https://arxiv.org/abs/2506.00618. Pei Yang, Hai Ci, and Mike Zheng Shou. macosworld: A multilingual interactive benchmark for GUI agents, 2025b. URL https: //arxiv.org/abs/2506.04135. Yin...

  34. [34]

    Kangning Zhang, Yingjie Qin, Jiarui Jin, Yifan Liu, Ruilong Su, Weinan Zhang, and Yong Yu

    URL https://arxiv.org/abs/2312.13771. Kangning Zhang, Yingjie Qin, Jiarui Jin, Yifan Liu, Ruilong Su, Weinan Zhang, and Yong Yu. Dream: A dual representation learning model for multimodal recommendation,

  35. [35]

    Kangning Zhang, Wenxiang Jiao, Kounianhua Du, Yuan Lu, Weiwen Liu, Weinan Zhang, and Yong Yu

    URL https://arxiv.org/abs/2404.11119. Kangning Zhang, Wenxiang Jiao, Kounianhua Du, Yuan Lu, Weiwen Liu, Weinan Zhang, and Yong Yu. Looptool: Closing the data-training loop for robust llm tool calls,

  36. [36]

    Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su

    URL https://arxiv.org/abs/2511.09148. Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v(ision) is a generalist web agent, if grounded,

  37. [37]

    Boyuan Zheng, Michael Y

    URL https://arxiv.org/abs/2401.01614. Boyuan Zheng, Michael Y. Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su. Skillweaver: Web agents can self-improve by discovering and honing skills,

  38. [38]

    SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

    URL https://arxiv.org/abs/2504.07079. Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents,

  39. [39]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    URL https://arxiv.org/abs/2307.13854. 14 Appendix A Benchmark Statistics We use four visual-agent benchmarks.OSWorldis the primary GUI benchmark and contains Ubuntu desktop tasks across browsers, office software, creative tools, media applications, system settings, code editors, email, and multi- application workflows (Xie et al., 2024).macOSWorldprovides...

  40. [40]

    Recent LLM agents have made skills a practical interface for storing and composing procedural knowledge in language-conditioned environments

    Skills for agents.Skill reuse has a long history in temporal abstraction for reinforcement learning and motor primitives for robotics (Sutton et al., 1999; Ijspeert et al., 2013). Recent LLM agents have made skills a practical interface for storing and composing procedural knowledge in language-conditioned environments. Early systems connected language mo...