Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns

Bolin Ding; Feijie Wu; Jiaheng Lu; Mosharaf Chowdhury; Shiqi He; Xinyu Ma; Yaliang Li; Yue Cui

arxiv: 2606.17645 · v1 · pith:Y3MGW7ZPnew · submitted 2026-06-16 · 💻 cs.AI · cs.CL· cs.LG

Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns

Shiqi He , Yue Cui , Feijie Wu , Xinyu Ma , Jiaheng Lu , Yaliang Li , Bolin Ding , Mosharaf Chowdhury This is my paper

Pith reviewed 2026-06-27 01:09 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords web agentsLLM agentsskill transfertransferable interaction patternslayout similarityWebArenaMind2Webskill reuse

0 comments

The pith

SkillMigrator reuses web skills across sites by matching layout structures rather than instructions or domains, reducing LLM actions 8-10% at matched success rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that storing web skills as transferable interaction patterns (TIPs) paired with structural sketches of page layouts enables reliable retrieval and reuse on entirely new sites. Retrieval uses layout similarity while references are grounded against the current accessibility snapshot, keeping the rest of the agent stack unchanged. This produces an 8-10% drop in average LLM actions on successful trajectories for WebArena and Mind2Web benchmarks without lowering success rate. A reader would care because fewer model calls per task directly cut latency and cost for deployed agents. The central shift is from instruction or metadata triggers to layout-based matching.

Core claim

SkillMigrator induces skills from trajectories and stores each as a TIP consisting of the skill plus a structural sketch of the page at induction time. At test time the system retrieves TIPs whose sketches match the current page layout and grounds the skill references on the live snapshot. The approach keeps accessibility snapshots and primitive tool calling fixed. On both WebArena and Mind2Web it achieves an 8-10% reduction in LLM-action count on successful trajectories while holding success rate constant.

What carries the argument

Transferable Interaction Pattern (TIP): a skill stored together with a structural sketch of the page layout at induction time, enabling layout-similarity retrieval and live-page reference grounding.

If this is right

Skills induced on one set of sites become usable on held-out sites without retraining or domain metadata.
Fewer LLM actions per successful trajectory directly reduces the number of model completions required.
Layout matching works even when element IDs and visible text differ between induction and test pages.
The method integrates with existing accessibility-based observations and fixed tool sets without changing the agent loop.
Maintaining success rate while shortening trajectories implies lower average horizon length on the benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If sketches remain stable under minor UI changes, the same TIP library could support ongoing reuse within evolving sites over time.
Layout-based indexing could apply to other structured interfaces that expose hierarchical or spatial layouts beyond the web.
Combining layout retrieval with instruction similarity might raise reuse rates above what either signal achieves alone.
The result suggests structural invariance can be a stronger transfer signal than semantic similarity for many web tasks.

Load-bearing premise

Structural sketches of page layouts at skill induction time stay similar enough across different sites to support accurate retrieval and correct reference grounding even when content and element identities change.

What would settle it

A collection of held-out sites where equivalent skills produce structurally dissimilar sketches, resulting in retrieval failures or grounding errors and no net reduction in action count.

Figures

Figures reproduced from arXiv: 2606.17645 by Bolin Ding, Feijie Wu, Jiaheng Lu, Mosharaf Chowdhury, Shiqi He, Xinyu Ma, Yaliang Li, Yue Cui.

**Figure 1.** Figure 1: Cross-domain skill reuse motivates SKILLMIGRATOR. Three websites drawn from very different domains—Shopify (e-commerce), GitLab (developer tools), and Postmill (online forum)— use different page layouts, field vocabularies, and submit-button labels. Yet the three subtasks reduce to the same programmatic pattern: fill a few labelled inputs, then click a single submit button. Same-colour fields (title-like, … view at source ↗

**Figure 2.** Figure 2: Motivating comparison of ASI, SkillWeaver, and PolySkill on a cumulative Mind2Web [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: End-to-end cross-domain example. A skill induced from a Postmill [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Runtime control flow. Passing the gate β routes the subtask through skill mode (Stage A then Stage B). Failing falls back to a single primitive step from πθ. 3.2 Slot Binding and Execution Once k ⋆ is chosen, the agent must associate each slot ξ ∈ Φk⋆ with a concrete value string before binding it to a control on the page. We follow the cross-domain slot-filling view of Liu et al. [18], Wang et al. [19]: a… view at source ↗

**Figure 5.** Figure 5: Success rate against average LLM-action count [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Sensitivity to the mixing weight α, the gate threshold β, and the LLM backbone, on the WebArena. Blue (left axis) is success rate. Orange (right axis) is the average LLM-action count N¯ . same-wording, different-structure pages. On Mind2Web cross-domain the same ablation costs 3.6 points (59.4→55.8), and no synonyms costs another 2.4 (59.4→57.0): paraphrase coverage matters most when the target site rename… view at source ↗

read the original abstract

Large language model (LLM) web agents are usually deployed as tool callers: each turn, the model reads a fresh page observation and emits one structured tool action. When every action is a low-level primitive, horizons grow quickly and so do policy-facing LLM completions, dominating latency and cost on benchmarks such as Mind2Web and WebArena. Recent systems therefore wrap repeated interaction fragments as web skills: callable tools built from successful trajectories or induced programs, so one call can replace several primitives. However, prior skill libraries are still triggered mainly by instruction similarity or coarse site metadata, which yields low skill reuse on held-out sites and leaves much of the potential step and token reduction on the table. We present SkillMigrator, an agent that learns reusable web skills and transfers them across sites by matching layout structure rather than specific element references. Each induced skill is stored as a transferable interaction pattern (TIP): the skill paired with a structural sketch of the snapshot at induction time. At test time, SkillMigrator retrieves TIPs by layout similarity and grounds their references on the live page. The rest of the stack is standard: accessibility-snapshot observations with stable references, and fixed tool calling over primitives plus skill invocations. Compared with the state-of-the-art approaches, SkillMigrator reduces the average LLM-action count on successful trajectories by 8-10% across both WebArena and Mind2Web at matched success rate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Layout sketches for cross-site skill retrieval is the actual new piece, but the abstract gives no way to check if it drives the claimed 8-10% action reduction.

read the letter

The paper's main move is to store web skills as TIPs that bundle the skill with a structural sketch of the page at induction time, then retrieve by layout similarity on new sites and ground the references on the live page. This is meant to beat instruction or metadata matching for reuse on held-out sites in WebArena and Mind2Web.

It frames the real bottleneck clearly: low-level actions make horizons expensive, and prior skill libraries do not transfer enough. The reported result is an 8-10% drop in average LLM actions on successful trajectories at matched success rate. The rest of the stack (accessibility snapshots, tool calling) is standard.

The soft spot is that nothing in the abstract shows how the sketches are built, how similarity is measured, what retrieval precision looks like, or any ablation that removes the layout matcher. Without those, it is impossible to tell whether the gain comes from the claimed mechanism or from simply having more skills or different prompting. The assumption that layout structures stay stable and discriminative across content and element changes is load-bearing but untested here.

This is for people working on practical LLM web agents who want to cut step counts on current benchmarks. A reader already following WebArena or Mind2Web work could extract the TIP idea and test it themselves.

If the full paper supplies the missing controls, retrieval numbers, and ablations, it deserves referee time. Based on the abstract alone, the central claim cannot be evaluated yet.

Referee Report

2 major / 1 minor

Summary. The paper presents SkillMigrator, which induces web skills as Transferable Interaction Patterns (TIPs) stored with structural sketches of page layouts at induction time. At test time, TIPs are retrieved by layout similarity and grounded on the live page using accessibility snapshots and standard tool calling; the central claim is an 8-10% reduction in average LLM-action count on successful trajectories at matched success rate on WebArena and Mind2Web relative to prior skill libraries triggered by instruction similarity or site metadata.

Significance. If the reduction is robust and attributable to the layout-based mechanism, the work would offer a concrete improvement in efficiency for LLM web agents by enabling higher skill reuse across held-out sites, directly addressing the low reuse rates noted for existing approaches.

major comments (2)

[Abstract] Abstract: the 8-10% action-count reduction at matched success rate is stated without experimental details, baselines, variance, or description of how success-rate matching was performed, so the central empirical result cannot be assessed from the provided text.
[Method (TIP retrieval)] Method section on TIP retrieval and grounding: the claim that gains arise specifically from layout-similarity retrieval (rather than simply having more skills or different prompting) rests on the untested assumption that structural sketches are invariant and discriminative across content and element-ID changes; no retrieval-precision metrics, false-positive rates, or ablation removing the layout matcher are supplied to support this.

minor comments (1)

[Abstract] Abstract: 'state-of-the-art approaches' is referenced but the specific baselines are not named.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve the presentation of results and strengthen the supporting evidence for the core mechanism.

read point-by-point responses

Referee: [Abstract] Abstract: the 8-10% action-count reduction at matched success rate is stated without experimental details, baselines, variance, or description of how success-rate matching was performed, so the central empirical result cannot be assessed from the provided text.

Authors: We agree that the abstract would benefit from additional context. In the revised manuscript we will expand the abstract to briefly specify the baselines (prior skill libraries using instruction similarity or site metadata), the two benchmarks (WebArena and Mind2Web), the success-rate matching procedure (reporting average action counts on successful trajectories at comparable overall success rates), and note that variance is reported in the main results tables. revision: yes
Referee: [Method (TIP retrieval)] Method section on TIP retrieval and grounding: the claim that gains arise specifically from layout-similarity retrieval (rather than simply having more skills or different prompting) rests on the untested assumption that structural sketches are invariant and discriminative across content and element-ID changes; no retrieval-precision metrics, false-positive rates, or ablation removing the layout matcher are supplied to support this.

Authors: The reported gains are obtained by comparing SkillMigrator against prior skill libraries that trigger on instruction similarity or site metadata; the performance difference is therefore attributable to the change in retrieval mechanism. We nevertheless recognize that explicit diagnostics would strengthen the argument. In the revision we will add retrieval-precision and false-positive metrics for the layout matcher together with an ablation that disables layout similarity while keeping the rest of the skill library and prompting unchanged. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark result with no derivation chain

full rationale

The paper reports an empirical performance gain (8-10% reduction in LLM-action count at matched success rate on WebArena and Mind2Web) from a system that retrieves TIPs via layout similarity. No equations, fitted parameters, or mathematical derivations appear in the abstract or described claims. The central result is presented as a benchmark outcome rather than a quantity derived from inputs by construction. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results are evident. The derivation chain is therefore self-contained as an engineering evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities; review limited to abstract only.

pith-pipeline@v0.9.1-grok · 5813 in / 1012 out tokens · 35191 ms · 2026-06-27T01:09:07.723385+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 1 canonical work pages

[1]

Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

2022
[2]

Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

2023
[3]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024

2024
[4]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

2023
[5]

Agentoccam: A simple yet strong baseline for llm-based web agents.arXiv preprint arXiv:2410.13825, 2024

Ke Yang, Yao Liu, Sapana Chaudhary, Rasool Fakoor, Pratik Chaudhari, George Karypis, and Huzefa Rangwala. Agentoccam: A simple yet strong baseline for llm-based web agents.arXiv preprint arXiv:2410.13825, 2024

arXiv 2024
[6]

Agent workflow memory

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. arXiv preprint arXiv:2409.07429, 2024

Pith/arXiv arXiv 2024
[7]

Inducing programmatic skills for agentic tasks.arXiv preprint arXiv:2504.06821, 2025

Zora Zhiruo Wang, Apurva Gandhi, Graham Neubig, and Daniel Fried. Inducing programmatic skills for agentic tasks.arXiv preprint arXiv:2504.06821, 2025

arXiv 2025
[8]

Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su

Boyuan Zheng, Michael Y . Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su. Skillweaver: Web agents can self-improve by discovering and honing skills.arXiv preprint arXiv:2504.07079, 2025

Pith/arXiv arXiv 2025
[9]

WALT: Web agents that learn tools.arXiv preprint arXiv:2510.01524, 2025

Viraj Prabhu, Yutong Dai, Matthew Fernandez, Jing Gu, Krithika Ramakrishnan, Yanqi Luo, Silvio Savarese, Caiming Xiong, Junnan Li, Zeyuan Chen, and Ran Xu. WALT: Web agents that learn tools.arXiv preprint arXiv:2510.01524, 2025

arXiv 2025
[10]

Webxskill: Skill learning for autonomous web agents.arXiv preprint arXiv:2604.13318, 2026

Zhaoyang Wang, Qianhui Wu, Xuchao Zhang, Chaoyun Zhang, Wenlin Yao, Fazle Elahi Faisal, Baolin Peng, Si Qin, Suman Nath, Qingwei Lin, et al. Webxskill: Skill learning for autonomous web agents.arXiv preprint arXiv:2604.13318, 2026

Pith/arXiv arXiv 2026
[11]

Contractskill: Repairable contract-based skills for multimodal web agents.arXiv preprint arXiv:2603.20340, 2026

Zijian Lu, Yiping Zuo, Yupeng Nie, Xin He, Weibei Fan, Lianyong Qi, and Shi Jin. Contractskill: Repairable contract-based skills for multimodal web agents.arXiv preprint arXiv:2603.20340, 2026

Pith/arXiv arXiv 2026
[12]

PolySkill: Learning generalizable skills through polymorphic abstraction for continual learning

Simon Yu, Gang Li, Weiyan Shi, and Peng Qi. PolySkill: Learning generalizable skills through polymorphic abstraction for continual learning. InInternational Conference on Learning Representations, 2026

2026
[13]

Workarena: How capable are web agents at solving common knowledge work tasks?arXiv preprint arXiv:2403.07718, 2024

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, et al. Workarena: How capable are web agents at solving common knowledge work tasks?arXiv preprint arXiv:2403.07718, 2024

Pith/arXiv arXiv 2024
[14]

Webvoyager: Building an end-to-end web agent with large multimodal models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6864–6890, 2024

2024
[15]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024. 10

2024
[16]

Sutton, Doina Precup, and Satinder Singh

Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112(1–2): 181–211, 1999

1999
[17]

Sentence-BERT: Sentence embeddings using siamese BERT- networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3982–3992, 2019

2019
[18]

Coach: A coarse-to-fine approach for cross-domain slot filling

Zihan Liu, Genta Indra Winata, Peng Xu, and Pascale Fung. Coach: A coarse-to-fine approach for cross-domain slot filling. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 19–25, 2020. URL https://aclanthology.org/ 2020.acl-main.3/

2020
[19]

Bridge to target domain by prototypical contrastive learning and label confusion: Re-explore zero-shot learning for slot filling

Liwen Wang, Xuefeng Li, Jiachi Liu, Keqing He, Yuanmeng Yan, and Weiran Xu. Bridge to target domain by prototypical contrastive learning and label confusion: Re-explore zero-shot learning for slot filling. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9474–9480, 2021. URL https://aclanthology. org/...

2021
[20]

The tree-to-tree correction problem.Journal of the ACM, 26(3):422–433, 1979

Kuo-Chung Tai. The tree-to-tree correction problem.Journal of the ACM, 26(3):422–433, 1979

1979
[21]

Tree edit distance: Robust and memory-efficient

Mateusz Pawlik and Nikolaus Augsten. Tree edit distance: Robust and memory-efficient. Information Systems, 56:157–173, 2016

2016
[22]

Harold W. Kuhn. The Hungarian method for the assignment problem.Naval Research Logistics Quarterly, 2(1–2):83–97, 1955

1955
[23]

Algorithms for the assignment and transportation problems.Journal of the Society for Industrial and Applied Mathematics, 5(1):32–38, 1957

James Munkres. Algorithms for the assignment and transportation problems.Journal of the Society for Industrial and Applied Mathematics, 5(1):32–38, 1957

1957
[24]

Towards zero-shot frame semantic parsing for domain scaling

Ankur Bapna, Gokhan Tür, Dilek Hakkani-Tür, and Larry Heck. Towards zero-shot frame semantic parsing for domain scaling. InInterspeech 2017, pages 2476–2480, 2017. doi: 10.21437/Interspeech.2017-518

work page doi:10.21437/interspeech.2017-518 2017
[25]

Littman, and Anthony R

Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains.Artificial Intelligence, 101(1–2):99–134, 1998

1998
[26]

World of bits: An open-domain platform for web-based agents

Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An open-domain platform for web-based agents. InInternational Conference on Machine Learning, pages 3135–3144. PMLR, 2017

2017
[27]

An illusion of progress? assessing the current state of web agents.arXiv preprint arXiv:2504.01382, 2025

Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, and Yu Su. An illusion of progress? assessing the current state of web agents.arXiv preprint arXiv:2504.01382, 2025

arXiv 2025
[28]

Agentscope 1.0: A developer-centric framework for building agentic applications.CoRR, abs/2508.16279, 2025

Dawei Gao, Zitao Li, Yuexiang Xie, Weirui Kuang, Liuyi Yao, Bingchen Qian, Zhijian Ma, Yue Cui, Haohao Luo, Shen Li, Lu Yi, Yi Yu, Shiqi He, Zhiling Luo, Wenmeng Zhou, Zhicheng Zhang, Xuguang He, Ziqian Chen, Weikai Liao, Farruh Isakulovich Kushnazarov, Yaliang Li, Bolin Ding, and Jingren Zhou. Agentscope 1.0: A developer-centric framework for building ag...

arXiv 2025
[29]

Branch-and- browse: Efficient and controllable web exploration with tree-structured reasoning and action memory.arXiv preprint arXiv:2510.19838, 2025

Shiqi He, Yue Cui, Xinyu Ma, Yaliang Li, Bolin Ding, and Mosharaf Chowdhury. Branch-and- browse: Efficient and controllable web exploration with tree-structured reasoning and action memory.arXiv preprint arXiv:2510.19838, 2025

Pith/arXiv arXiv 2025
[30]

Odysseys: Benchmarking web agents on realistic long horizon tasks.arXiv preprint arXiv:2604.24964, 2026

Lawrence Keunho Jang, Jing Yu Koh, Daniel Fried, and Ruslan Salakhutdinov. Odysseys: Benchmarking web agents on realistic long horizon tasks.arXiv preprint arXiv:2604.24964, 2026

Pith/arXiv arXiv 2026
[31]

Molmoweb: Open visual web agent and open data for the open web.arXiv preprint arXiv:2604.08516, 2026

Tanmay Gupta, Piper Wolters, Zixian Ma, Peter Sushko, Rock Yuren Pang, Diego Llanes, Yue Yang, Taira Anderson, Boyuan Zheng, Zhongzheng Ren, et al. Molmoweb: Open visual web agent and open data for the open web.arXiv preprint arXiv:2604.08516, 2026. 11 Appendix A Detailed Problem Formulation This appendix expands the description of the web-agent setting t...

Pith/arXiv arXiv 2026

[1] [1]

Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

2022

[2] [2]

Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

2023

[3] [3]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024

2024

[4] [4]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

2023

[5] [5]

Agentoccam: A simple yet strong baseline for llm-based web agents.arXiv preprint arXiv:2410.13825, 2024

Ke Yang, Yao Liu, Sapana Chaudhary, Rasool Fakoor, Pratik Chaudhari, George Karypis, and Huzefa Rangwala. Agentoccam: A simple yet strong baseline for llm-based web agents.arXiv preprint arXiv:2410.13825, 2024

arXiv 2024

[6] [6]

Agent workflow memory

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory. arXiv preprint arXiv:2409.07429, 2024

Pith/arXiv arXiv 2024

[7] [7]

Inducing programmatic skills for agentic tasks.arXiv preprint arXiv:2504.06821, 2025

Zora Zhiruo Wang, Apurva Gandhi, Graham Neubig, and Daniel Fried. Inducing programmatic skills for agentic tasks.arXiv preprint arXiv:2504.06821, 2025

arXiv 2025

[8] [8]

Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su

Boyuan Zheng, Michael Y . Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su. Skillweaver: Web agents can self-improve by discovering and honing skills.arXiv preprint arXiv:2504.07079, 2025

Pith/arXiv arXiv 2025

[9] [9]

WALT: Web agents that learn tools.arXiv preprint arXiv:2510.01524, 2025

Viraj Prabhu, Yutong Dai, Matthew Fernandez, Jing Gu, Krithika Ramakrishnan, Yanqi Luo, Silvio Savarese, Caiming Xiong, Junnan Li, Zeyuan Chen, and Ran Xu. WALT: Web agents that learn tools.arXiv preprint arXiv:2510.01524, 2025

arXiv 2025

[10] [10]

Webxskill: Skill learning for autonomous web agents.arXiv preprint arXiv:2604.13318, 2026

Zhaoyang Wang, Qianhui Wu, Xuchao Zhang, Chaoyun Zhang, Wenlin Yao, Fazle Elahi Faisal, Baolin Peng, Si Qin, Suman Nath, Qingwei Lin, et al. Webxskill: Skill learning for autonomous web agents.arXiv preprint arXiv:2604.13318, 2026

Pith/arXiv arXiv 2026

[11] [11]

Contractskill: Repairable contract-based skills for multimodal web agents.arXiv preprint arXiv:2603.20340, 2026

Zijian Lu, Yiping Zuo, Yupeng Nie, Xin He, Weibei Fan, Lianyong Qi, and Shi Jin. Contractskill: Repairable contract-based skills for multimodal web agents.arXiv preprint arXiv:2603.20340, 2026

Pith/arXiv arXiv 2026

[12] [12]

PolySkill: Learning generalizable skills through polymorphic abstraction for continual learning

Simon Yu, Gang Li, Weiyan Shi, and Peng Qi. PolySkill: Learning generalizable skills through polymorphic abstraction for continual learning. InInternational Conference on Learning Representations, 2026

2026

[13] [13]

Workarena: How capable are web agents at solving common knowledge work tasks?arXiv preprint arXiv:2403.07718, 2024

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, et al. Workarena: How capable are web agents at solving common knowledge work tasks?arXiv preprint arXiv:2403.07718, 2024

Pith/arXiv arXiv 2024

[14] [14]

Webvoyager: Building an end-to-end web agent with large multimodal models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6864–6890, 2024

2024

[15] [15]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024. 10

2024

[16] [16]

Sutton, Doina Precup, and Satinder Singh

Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112(1–2): 181–211, 1999

1999

[17] [17]

Sentence-BERT: Sentence embeddings using siamese BERT- networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3982–3992, 2019

2019

[18] [18]

Coach: A coarse-to-fine approach for cross-domain slot filling

Zihan Liu, Genta Indra Winata, Peng Xu, and Pascale Fung. Coach: A coarse-to-fine approach for cross-domain slot filling. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 19–25, 2020. URL https://aclanthology.org/ 2020.acl-main.3/

2020

[19] [19]

Bridge to target domain by prototypical contrastive learning and label confusion: Re-explore zero-shot learning for slot filling

Liwen Wang, Xuefeng Li, Jiachi Liu, Keqing He, Yuanmeng Yan, and Weiran Xu. Bridge to target domain by prototypical contrastive learning and label confusion: Re-explore zero-shot learning for slot filling. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9474–9480, 2021. URL https://aclanthology. org/...

2021

[20] [20]

The tree-to-tree correction problem.Journal of the ACM, 26(3):422–433, 1979

Kuo-Chung Tai. The tree-to-tree correction problem.Journal of the ACM, 26(3):422–433, 1979

1979

[21] [21]

Tree edit distance: Robust and memory-efficient

Mateusz Pawlik and Nikolaus Augsten. Tree edit distance: Robust and memory-efficient. Information Systems, 56:157–173, 2016

2016

[22] [22]

Harold W. Kuhn. The Hungarian method for the assignment problem.Naval Research Logistics Quarterly, 2(1–2):83–97, 1955

1955

[23] [23]

Algorithms for the assignment and transportation problems.Journal of the Society for Industrial and Applied Mathematics, 5(1):32–38, 1957

James Munkres. Algorithms for the assignment and transportation problems.Journal of the Society for Industrial and Applied Mathematics, 5(1):32–38, 1957

1957

[24] [24]

Towards zero-shot frame semantic parsing for domain scaling

Ankur Bapna, Gokhan Tür, Dilek Hakkani-Tür, and Larry Heck. Towards zero-shot frame semantic parsing for domain scaling. InInterspeech 2017, pages 2476–2480, 2017. doi: 10.21437/Interspeech.2017-518

work page doi:10.21437/interspeech.2017-518 2017

[25] [25]

Littman, and Anthony R

Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains.Artificial Intelligence, 101(1–2):99–134, 1998

1998

[26] [26]

World of bits: An open-domain platform for web-based agents

Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An open-domain platform for web-based agents. InInternational Conference on Machine Learning, pages 3135–3144. PMLR, 2017

2017

[27] [27]

An illusion of progress? assessing the current state of web agents.arXiv preprint arXiv:2504.01382, 2025

Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, and Yu Su. An illusion of progress? assessing the current state of web agents.arXiv preprint arXiv:2504.01382, 2025

arXiv 2025

[28] [28]

Agentscope 1.0: A developer-centric framework for building agentic applications.CoRR, abs/2508.16279, 2025

Dawei Gao, Zitao Li, Yuexiang Xie, Weirui Kuang, Liuyi Yao, Bingchen Qian, Zhijian Ma, Yue Cui, Haohao Luo, Shen Li, Lu Yi, Yi Yu, Shiqi He, Zhiling Luo, Wenmeng Zhou, Zhicheng Zhang, Xuguang He, Ziqian Chen, Weikai Liao, Farruh Isakulovich Kushnazarov, Yaliang Li, Bolin Ding, and Jingren Zhou. Agentscope 1.0: A developer-centric framework for building ag...

arXiv 2025

[29] [29]

Branch-and- browse: Efficient and controllable web exploration with tree-structured reasoning and action memory.arXiv preprint arXiv:2510.19838, 2025

Shiqi He, Yue Cui, Xinyu Ma, Yaliang Li, Bolin Ding, and Mosharaf Chowdhury. Branch-and- browse: Efficient and controllable web exploration with tree-structured reasoning and action memory.arXiv preprint arXiv:2510.19838, 2025

Pith/arXiv arXiv 2025

[30] [30]

Odysseys: Benchmarking web agents on realistic long horizon tasks.arXiv preprint arXiv:2604.24964, 2026

Lawrence Keunho Jang, Jing Yu Koh, Daniel Fried, and Ruslan Salakhutdinov. Odysseys: Benchmarking web agents on realistic long horizon tasks.arXiv preprint arXiv:2604.24964, 2026

Pith/arXiv arXiv 2026

[31] [31]

Molmoweb: Open visual web agent and open data for the open web.arXiv preprint arXiv:2604.08516, 2026

Tanmay Gupta, Piper Wolters, Zixian Ma, Peter Sushko, Rock Yuren Pang, Diego Llanes, Yue Yang, Taira Anderson, Boyuan Zheng, Zhongzheng Ren, et al. Molmoweb: Open visual web agent and open data for the open web.arXiv preprint arXiv:2604.08516, 2026. 11 Appendix A Detailed Problem Formulation This appendix expands the description of the web-agent setting t...

Pith/arXiv arXiv 2026