pith. sign in

arxiv: 2606.18448 · v1 · pith:VUH5ITOAnew · submitted 2026-06-16 · 💻 cs.CL

VISUALSKILL: Multimodal Skills for Computer-Use Agents

Pith reviewed 2026-06-27 00:31 UTC · model grok-4.3

classification 💻 cs.CL
keywords visual skillsmultimodal skillscomputer-use agentsGUI interactionskill librariesUI explorationhierarchical skillslong-horizon tasks
0
0 comments X

The pith

Retaining visual figures in skills for computer-use agents lifts average benchmark scores from 0.373 to 0.456 over text-only versions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that text-only skill libraries limit computer-use agents on long-horizon tasks because GUI interactions are inherently visual. VISUALSKILL stores figures alongside text in a hierarchical structure per application and delivers them on demand through a load_topic tool. A Claude-based agent using these skills scores 0.456 on CUA-World and OSExpert-Eval, an 8.3-point gain over matched text-only skills built from identical source material. The gain arises because figures let the agent locate UI elements and confirm state after actions. A sympathetic reader would see this as evidence that preserving visual content in reusable artifacts narrows the gap to reliable performance on unseen software.

Core claim

VISUALSKILL is a hierarchical multimodal skill tailored to each target application and organised as a central index over per-topic files. The agent consumes it through a load_topic MCP tool that fetches the relevant topic's text and figures on demand. Each skill is constructed with a two-stage pipeline that combines authored documentation with live-application UI exploration. On the two benchmarks this produces an average score of 0.456, an absolute gain of 8.3 points over a matched text-only skill that differs only in modality.

What carries the argument

VISUALSKILL, the hierarchical multimodal skill that retains visual figures from documentation and live UI exploration, accessed on demand via the load_topic tool.

If this is right

  • Agents locate UI elements more reliably when figures remain in the skill artifact.
  • Workflow state verification after each action becomes more accurate.
  • The 15.3-point lift over the no-skill baseline holds on both CUA-World and OSExpert-Eval.
  • Skills become application-specific through the combination of authored docs and live exploration.
  • The modality difference alone accounts for the observed performance gap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same on-demand figure mechanism could be tested on web-browsing or mobile agents where visual state changes rapidly.
  • Automatic refresh of figures when an application updates would be a natural next engineering step.
  • Lower token consumption might result if agents query figures only when needed rather than receiving long verbal descriptions.
  • The construction pipeline could be applied to non-GUI domains that still benefit from visual references, such as diagram-heavy technical manuals.

Load-bearing premise

The two-stage pipeline combining authored documentation with live-application UI exploration produces accurate, relevant figures without introducing noise or outdated visuals that could mislead the agent.

What would settle it

Replace the figures inside VISUALSKILL files with incorrect or outdated images and measure whether the 8.3-point advantage over the matched text-only skill disappears.

Figures

Figures reproduced from arXiv: 2606.18448 by Jacob Andreas, Jiabao Ji, Li An, Qiucheng Wu, Shiyu Chang, Yang Zhang, Yujian Liu, Ziyan Jiang.

Figure 1
Figure 1. Figure 1: A text-only skill struggles to describe the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The two-stage VISUALSKILL construction pipeline. Stage 1 parses the authored documentation into a topic hierarchy, extracting per-topic text bodies and the vendor-drawn figures shipped with the manual. Stage 2 drives the live application with an LLM-controlled explorer in two sub-passes — a free explorer that partitions the idle window and dispatches a worker per region, and a trajectory-targeted explorer … view at source ↗
Figure 3
Figure 3. Figure 3: File layout of the Writer skill (root folder [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Representative excerpt of the central SKILL.md (verbatim from the Stage 2 Writer skill; mark￾down link syntax simplified to plain titles). Each item names one topic and gives the when to use criterion that the agent matches against its current task before invoking load_topic (Section 2.2). The agent reads only this index up front; per-topic content is fetched on demand. S txt spells the same visual informa… view at source ↗
Figure 5
Figure 5. Figure 5: VISUALSKILL and its text-only control on topic t = formatting-text/character-formatting. Both panels are real guide.md bodies and were produced in the same LLM call from the same source context (Section 2.3); they cover the same procedural content but their text is not word-for-word identical. Left (S txt): the text-only control carries no figure (F txt t = ∅); the second paragraph absorbs the layout of th… view at source ↗
Figure 01
Figure 01. Figure 01: fig01.png [PITH_FULL_IMAGE:figures/full_fig_p013_01.png] view at source ↗
Figure 6
Figure 6. Figure 6: The literal tool-result structure for one topic, in both variants. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: One full agent turn end-to-end. The agent (i) scans the [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 01
Figure 01. Figure 01: fig01.png • vendor screenshot extracted from PDF page 85 [PITH_FULL_IMAGE:figures/full_fig_p017_01.png] view at source ↗
Figure 02
Figure 02. Figure 02: fig02.png • vendor screenshot extracted from PDF page 88 [PITH_FULL_IMAGE:figures/full_fig_p017_02.png] view at source ↗
Figure 9
Figure 9. Figure 9: Two of the worker’s twelve captures for region [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Initial state of exam_paper_formatting. The three target headers (“Part A: Multiple Choice (40 points)” and the two parallel headers below) are plain body text. The paragraph-style dropdown in the top￾left of the formatting toolbar reads “Default Paragraph Style”. The sub-goal is to change those three headers to “Heading 2”. ∼ (75,118) in 1920×1080) and a separate ▼ arrow (right, ∼ (196,156)). Clicking th… view at source ↗
Figure 11
Figure 11. Figure 11: Worker capture, step 4: the ▼ arrow has been clicked and the style menu is open. Every paragraph style is listed and rendered in its own font (Title large, Heading 1 large bold, Heading 2 large, . . . ). This is the figure that lands in the patched skill as the primary reference [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Worker capture, step 38: after selecting [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Case (i), Save-button mis-location on test [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
read the original abstract

Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the skill artifact as text only, despite the visual nature of GUI interaction. We propose VISUALSKILL: a hierarchical multimodal skill, tailored to each target application and organised as a central index over per-topic files, which the agent consumes through a load_topic MCP tool that fetches the relevant topic's text and figures on demand. We construct each skill with a two-stage pipeline that combines authored documentation with live-application UI exploration. On two CUA benchmarks, CUA-World and OSExpert-Eval, a Claude Code CLI agent backed by Claude Opus 4.6 reaches an average score of 0.456 with VISUALSKILL, a +15.3 point absolute lift over the no-skill baseline (0.303). Against a matched text-only skill that is generated from the same source content and differs from VISUALSKILL only in modality, VISUALSKILL yields a further +8.3 point absolute gain over the matched text-only skill (0.373 vs. 0.456), providing direct evidence that retaining visual figures in the skill artifact, rather than verbalizing them away, helps the agent both identify UI elements and verify workflow state after each action. Our code is available at https://github.com/XMHZZ2018/VisualSkills.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VISUALSKILL, a hierarchical multimodal skill library for computer-use agents (CUAs) that organizes per-topic files containing both text and figures, fetched on demand via a load_topic tool. Skills are built via a two-stage pipeline (authored documentation plus live-application UI exploration). On CUA-World and OSExpert-Eval, a Claude-based agent achieves 0.456 average score (+15.3 over no-skill baseline of 0.303; +8.3 over a matched text-only skill of 0.373), with the gain attributed to visual figures aiding UI element identification and post-action state verification. Code is released.

Significance. If the modality comparison is robust, the work supplies direct evidence that retaining visual artifacts in reusable skills improves long-horizon GUI agent performance beyond text-only representations, with implications for skill libraries in agentic systems. The matched text-only control and open code are strengths that facilitate verification.

major comments (2)
  1. [Construction pipeline (abstract and methods describing skill construction)] The two-stage pipeline (authored documentation combined with live-application UI exploration) is presented as the source of the skill artifacts, yet the manuscript reports no validation step, human review, or automated check confirming that captured UI screenshots match current application state, are task-relevant, or add non-redundant information beyond what can be verbalized. This assumption is load-bearing for the central claim that the +8.3 point gain (0.456 vs 0.373) can be attributed to modality rather than figure quality or noise.
  2. [Evaluation section (benchmarks and results)] The empirical comparison lacks reported details on prompt variation, figure quality controls, or error analysis that would establish robustness of the 0.456 vs 0.373 difference; the abstract notes a matched text-only control but full methods, dataset details, and per-task breakdowns are not provided to verify the result.
minor comments (2)
  1. [VISUALSKILL representation] The hierarchical index and load_topic MCP tool are described at a high level; a concrete example of a per-topic file (text + figures) would clarify consumption by the agent.
  2. [Construction pipeline] No discussion of potential staleness in live-exploration figures or how the pipeline handles application updates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and commit to revisions that strengthen the presentation of the construction pipeline and evaluation details without altering the core claims.

read point-by-point responses
  1. Referee: [Construction pipeline (abstract and methods describing skill construction)] The two-stage pipeline (authored documentation combined with live-application UI exploration) is presented as the source of the skill artifacts, yet the manuscript reports no validation step, human review, or automated check confirming that captured UI screenshots match current application state, are task-relevant, or add non-redundant information beyond what can be verbalized. This assumption is load-bearing for the central claim that the +8.3 point gain (0.456 vs 0.373) can be attributed to modality rather than figure quality or noise.

    Authors: We agree that the absence of an explicit validation description leaves the attribution open to the concern raised. The live UI exploration stage captures current application states by design, and the authored documentation stage ensures topical relevance, but these steps were not formally documented as quality controls. In the revision we will add a dedicated subsection describing the quality assurance process, including human review of a random sample of skills for state fidelity, task relevance, and non-redundancy relative to text, plus simple automated checks for image-text alignment. This addition will make the load-bearing assumption explicit and verifiable. revision: yes

  2. Referee: [Evaluation section (benchmarks and results)] The empirical comparison lacks reported details on prompt variation, figure quality controls, or error analysis that would establish robustness of the 0.456 vs 0.373 difference; the abstract notes a matched text-only control but full methods, dataset details, and per-task breakdowns are not provided to verify the result.

    Authors: The matched text-only control is constructed from identical source content and differs only in modality, as stated in the methods; this design isolates the visual contribution. We nevertheless concur that additional robustness information is warranted. The revision will expand the evaluation section with (i) results across multiple prompt templates, (ii) figure quality controls applied during skill construction, (iii) a concise error analysis categorizing failure modes, and (iv) per-task score breakdowns for both benchmarks. Dataset construction details and the full per-topic skill inventory will be moved to an appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison on fixed benchmarks with matched control

full rationale

The paper reports an empirical evaluation of VISUALSKILL on CUA-World and OSExpert-Eval benchmarks. The central claim is a measured performance lift (0.456 vs 0.373) when retaining visual figures versus a matched text-only condition generated from identical source content. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The two-stage pipeline is a construction method whose output quality is an external assumption, not a self-referential quantity that forces the reported delta. The result is therefore independent of any definitional or fitted reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The result rests on the empirical claim that visual figures add usable information beyond text; no free parameters are introduced, and the only invented entity is the VISUALSKILL format itself.

axioms (1)
  • domain assumption Standard benchmark evaluation protocols for computer-use agents are sufficient to measure skill utility.
    The paper reports average scores on CUA-World and OSExpert-Eval without additional validation of benchmark coverage.
invented entities (1)
  • VISUALSKILL hierarchical multimodal skill no independent evidence
    purpose: Central index over per-topic files containing text and figures, consumed via load_topic tool
    New skill artifact format proposed in the paper; no independent evidence outside the reported experiments.

pith-pipeline@v0.9.1-grok · 5813 in / 1259 out tokens · 35888 ms · 2026-06-27T00:31:15.158617+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

11 extracted references · 7 canonical work pages · 4 internal anchors

  1. [1]

    Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. 2025. Agent s2: A compositional generalist-specialist framework for computer use agents. arXiv preprint arXiv:2504.00906

  2. [2]

    Pranjal Aggarwal, Graham Neubig, and Sean Welleck. 2026. Gym-anything: Turn any software into an agent environment. arXiv preprint arXiv:2604.06126

  3. [3]

    Anthropic . 2026. Introducing claude sonnet 4.6. https://www.anthropic.com/news/claude-sonnet-4-6. Accessed: 2026-05-20

  4. [4]

    Tianyi Chen, Yinheng Li, Michael Solodko, Sen Wang, Nan Jiang, Tingyuan Cui, Junheng Hao, Jongwoo Ko, Sara Abdali, Leon Xu, and 1 others. 2026. Cua-skill: Develop skills for computer using agent. arXiv preprint arXiv:2601.21123

  5. [5]

    Guanyu Jiang, Zhaochen Su, Xiaoye Qu, and Yi R Fung. 2026. Xskill: Continual learning from experience and skills in multimodal agents. arXiv preprint arXiv:2603.12056

  6. [6]

    Jiateng Liu, Zhenhailong Wang, Rushi Wang, Bingxuan Li, Jeonghwan Kim, Aditi Tiwari, Pengfei Yu, Denghui Zhang, and Heng Ji. 2026. Osexpert: Computer-use agents learning professional skills via exploration. arXiv preprint arXiv:2603.07978

  7. [7]

    Simular AI . 2026. Agent s3. https://www.simular.ai/articles/agent-s3. Accessed: 2026-05-20

  8. [8]

    Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Wu, and 1 others. 2026. Opencua: Open foundations for computer-use agents. Advances in Neural Information Processing Systems, 38:139756--139806

  9. [9]

    Zora Zhiruo Wang, Apurva Gandhi, Graham Neubig, and Daniel Fried. 2025. Inducing programmatic skills for agentic tasks. arXiv preprint arXiv:2504.06821

  10. [10]

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, and 1 others. 2024. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37:52040--52094

  11. [11]

    Kangning Zhang, Shuai Shao, Qingyao Li, Jianghao Lin, Lingyue Fu, Shijian Wang, Wenxiang Jiao, Yuan Lu, Weiwen Liu, Weinan Zhang, and 1 others. 2026. Mmskills: Towards multimodal skills for general visual agents. arXiv preprint arXiv:2605.13527