DRIVE: Modeling Skills at the Reasoning and Interaction Levels for Web Agents under Continual Learning

Hao Chen; Haoyuan Chen; Jian Huang; Maolin He; Rong Zhou; Sihang Zhou; Siwei Wang; Xirui Liu; Yanning Hou

arxiv: 2605.23939 · v1 · pith:BCGEEOVCnew · submitted 2026-04-28 · 💻 cs.AI · cs.LG

DRIVE: Modeling Skills at the Reasoning and Interaction Levels for Web Agents under Continual Learning

Xirui Liu , Sihang Zhou , Yanning Hou , Rong Zhou , Haoyuan Chen , Maolin He , Siwei Wang , Hao Chen

show 1 more author

Jian Huang

This is my paper

Pith reviewed 2026-07-01 08:59 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords web agentsskill modelingcontinual learningreasoning skillsinteraction skillstask decompositionweb navigationdual-level framework

0 comments

The pith

Web agents improve by separating transferable reasoning skills in natural language from page-specific interaction skills in code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Web agents must handle both abstract task logic that works across sites and concrete page manipulations that do not. Storing both kinds of experience in one form creates a tradeoff: abstract versions lose executability while concrete versions fail to generalize. DRIVE splits historical experience into natural language reasoning skills for task decomposition and programmatic interaction skills for element manipulation. A scene-aware mechanism retrieves and combines the right skills for each new page, while skill-level reflection updates the libraries when failures occur. Experiments on five WebArena domains show this separation raises average task success from the skill-free baseline.

Core claim

DRIVE models skills at two levels by extracting natural language reasoning skills that capture transferable task logic such as searching routes before booking and programmatic interaction skills that ground those actions to specific page operations, then coordinates them through a scene-aware retrieval process that selects skills based on task semantics and current page context, allowing agents to accumulate capabilities across domains without entanglement of abstract and concrete knowledge.

What carries the argument

Dual-level skill modeling framework that separates natural language reasoning skills from programmatic interaction skills, coordinated by a scene-aware mechanism and refined through skill-level reflection.

If this is right

Reasoning skills can be reused across sites with different layouts because they remain in natural language.
Interaction skills can be updated independently when page structures change without rewriting task logic.
Skill-level reflection isolates whether failures stem from missing reasoning or missing interaction knowledge.
Agents accumulate capabilities over time by expanding the two libraries separately rather than retraining end-to-end.
Ablations confirm the two skill types deliver distinct complementary gains rather than redundant ones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation might reduce interference when an agent must switch between multiple websites in one session.
It could be extended to other agent settings where high-level plans must map to low-level actuators that change over time.
Library growth might eventually require explicit conflict detection between reasoning skills acquired from different task families.

Load-bearing premise

Historical experience can be cleanly separated into natural language reasoning skills and programmatic interaction skills, and a scene-aware mechanism can retrieve and coordinate them reliably on unseen websites without new failure modes.

What would settle it

Measure task success rate on a held-out set of websites that introduce interaction patterns absent from training data; if the separated model performs no better than the uniform baseline, the separation provides no net gain.

Figures

Figures reproduced from arXiv: 2605.23939 by Hao Chen, Haoyuan Chen, Jian Huang, Maolin He, Rong Zhou, Sihang Zhou, Siwei Wang, Xirui Liu, Yanning Hou.

**Figure 2.** Figure 2: Continuous capability accumulation in DRIVE. As the number of training trajectories increases [PITH_FULL_IMAGE:figures/full_fig_p020_2.png] view at source ↗

**Figure 3.** Figure 3: Validation of failure attribution. (a) Confusion matrix between human-adjudicated and LM-based [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗

**Figure 4.** Figure 4: Case study resolving an interaction error. The baseline agent fails due to brittle execution on dynamic interface elements (e.g., state and zip-code field dependencies). In contrast, DRIVE invokes a programmatic interaction skill that encodes a robust procedure for structured form completion, successfully bypassing the page-specific constraints. AgentOccam Task:Search for "switch accessories" Initial Progr… view at source ↗

**Figure 5.** Figure 5: Case study resolving a reasoning error. The baseline agent exhibits flawed stopping-condition judgment, unnecessarily clicking into a product page when the search results already satisfy the query. DRIVE leverages a natural-language reasoning skill to correctly frame the task intent and halt execution at the appropriate state. structured form completion under page constraints [PITH_FULL_IMAGE:figures/full… view at source ↗

read the original abstract

Web agents require both high-level reasoning (for task decomposition) and low-level interactions (for page elements manipulation) to conduct different tasks. However, these knowledge types differ fundamentally: reasoning knowledge (e.g., booking a flight requires first searching for routes) is abstract and transferable across websites, while interaction knowledge (e.g., clicking the Search button at a specific coordinate on Site A) depends heavily on page-specific contexts. Existing methods store experiences uniformly. This creates a dilemma: abstract representations lose executability on concrete pages, while concrete representations fail to generalize across domains. This entanglement limits capability accumulation: on new websites, agents either fail to recognize reusable task logic due to surface-level differences or attempt infeasible actions from outdated page structures. To disentangle them, we propose DRIVE, a dual-level skill modeling framework separating historical experience into natural language reasoning skills, which capture transferable task logic, and programmatic interaction skills, grounding abstract actions to executable operations. A scene-aware coordination mechanism adaptively retrieves and invokes these dual-level skills based on task semantics. DRIVE also uses skill-level reflection to identify hierarchy-specific failure modes, enabling targeted skill library expansion and refinement. Experiments across five WebArena domains show DRIVE attains an average task success rate of 52.8%, exceeding the skill-free baseline by 7.3 percentage points. Further ablations show reasoning and interaction skills provide distinct, complementary benefits, supporting separation of transferable task logic from executable page-level operations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DRIVE's split of skills into transferable NL reasoning and page-specific programmatic interactions is a direct response to the uniform-storage problem, but the empirical case rests on same-domain tests and lacks detail on whether the separation stays clean.

read the letter

The main thing to know is that DRIVE factors agent experience into natural language reasoning skills for abstract task logic and programmatic interaction skills for executable page actions, then uses a scene-aware retriever to pull and coordinate the right pair plus a reflection step to refine the library.

The paper does a clean job naming the core dilemma: uniform storage either strips executability from abstract knowledge or prevents generalization when page details change. The ablations are presented as showing that the two levels supply complementary gains rather than one subsuming the other, which supports keeping them separate.

The reported outcome is a 52.8% average success rate on WebArena tasks, 7.3 points above the no-skill baseline. That number is concrete and the framework includes an explicit mechanism for expanding the skill set over time.

The soft spot is the load-bearing assumption that the two skill types can be extracted and retrieved without introducing coordination errors or losing necessary context. Many WebArena tasks embed reasoning inside element selection, so a scene descriptor that misses layout cues could easily mismatch a reasoning skill with the wrong interaction skill. All the numbers come from the same five domains used to build the skill library, so there is no direct check on behavior for truly new sites. The abstract also gives no error bars, significance tests, or description of how the baseline was implemented or how skills were initially extracted.

This is for people working on continual learning for web agents. It deserves a serious referee because the problem is real and the proposed split is a targeted engineering response, even if the current evidence is preliminary.

I would send it to review and expect questions on out-of-domain testing and the extraction pipeline.

Referee Report

2 major / 1 minor

Summary. The paper proposes DRIVE, a dual-level skill modeling framework for web agents under continual learning. It separates historical experience into natural language reasoning skills (abstract, transferable task logic such as flight booking decomposition) and programmatic interaction skills (page-grounded executable operations). A scene-aware coordination mechanism retrieves and invokes these skills, supplemented by skill-level reflection for targeted library updates. On five WebArena domains, DRIVE reports 52.8% average task success rate, exceeding the skill-free baseline by 7.3 percentage points, with ablations indicating that reasoning and interaction skills yield distinct, complementary benefits.

Significance. If the separation and coordination claims hold under proper controls, the work would advance continual learning for web agents by addressing the entanglement of abstract and concrete knowledge, enabling better cross-domain accumulation. The concrete performance number, baseline comparison, and ablation outcomes on complementary benefits constitute a falsifiable empirical contribution that could be built upon if extended to out-of-domain validation.

major comments (2)

[Abstract] Abstract (experiments paragraph): the central claim that reasoning and interaction skills can be cleanly separated without introducing new failure modes on unseen sites is load-bearing, yet all reported results and ablations operate inside the same five WebArena domains used for skill extraction; this does not test transfer to sites with novel page structures where scene descriptors or retrieval might mismatch.
[Abstract] Abstract: the reported 52.8% success rate and 7.3 pp improvement over baseline are presented without any reference to the number of tasks, number of trials per task, statistical significance tests, error bars, or exact computation method, which prevents verification of the empirical claim's reliability.

minor comments (1)

[Abstract] The abstract could more explicitly reference the full methods and experimental sections to allow readers to locate details on the scene-aware retriever and reflection mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the two major comments below and will incorporate revisions to improve clarity and transparency in the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract (experiments paragraph): the central claim that reasoning and interaction skills can be cleanly separated without introducing new failure modes on unseen sites is load-bearing, yet all reported results and ablations operate inside the same five WebArena domains used for skill extraction; this does not test transfer to sites with novel page structures where scene descriptors or retrieval might mismatch.

Authors: We agree that the current evaluation is limited to the five WebArena domains used for skill extraction and does not include explicit tests on entirely novel sites with unseen page structures. The ablations demonstrate complementary benefits within these domains, but the load-bearing claim about clean separation without new failure modes on unseen sites is not directly supported by out-of-domain results. We will revise the abstract to clarify the evaluation scope, tone down the generalization claim, and add a limitations paragraph discussing potential mismatches in scene descriptors or retrieval for novel sites, along with plans for future out-of-domain validation. revision: yes
Referee: [Abstract] Abstract: the reported 52.8% success rate and 7.3 pp improvement over baseline are presented without any reference to the number of tasks, number of trials per task, statistical significance tests, error bars, or exact computation method, which prevents verification of the empirical claim's reliability.

Authors: We agree that the abstract lacks sufficient experimental details for verification. The full paper (Section 4) specifies evaluation on 250 tasks (50 per domain) with 3 independent trials per task, using the standard WebArena binary success metric averaged across runs. We will update the abstract to include these details, along with error bars and reference to statistical testing where applicable in the results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework evaluated on external benchmarks

full rationale

The paper proposes a dual-level skill separation framework and reports empirical success rates (52.8% vs. baseline) plus ablations on WebArena domains. No equations, fitted parameters, or derivations are present. The separation of reasoning vs. interaction skills is an architectural assumption tested via experiments rather than derived from its own outputs. No self-citation chains, uniqueness theorems, or renamings reduce the central claim to its inputs by construction. The result is self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that reasoning and interaction knowledge are separable without loss; no free parameters, invented physical entities, or additional axioms are visible in the abstract.

axioms (1)

domain assumption Reasoning knowledge is abstract and transferable while interaction knowledge is page-specific and non-transferable
This premise is stated directly in the opening motivation of the abstract and underpins the entire dual-level design.

pith-pipeline@v0.9.1-grok · 5818 in / 1361 out tokens · 31059 ms · 2026-07-01T08:59:16.170208+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 22 canonical work pages · 9 internal anchors

[1]

GPT-4V(ision) is a Generalist Web Agent, if Grounded

B. Zheng, B. Gou, J. Kil, H. Sun, Y. Su, Gpt-4v (ision) is a generalist web agent, if grounded, arXiv preprint arXiv:2401.01614 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

H. He, W. Yao, K. Ma, W. Yu, Y. Dai, H. Zhang, Z. Lan, D. Yu, Webvoyager: Buildinganend-to-endwebagentwithlargemultimodalmodels,in: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 6864–6890

2024
[3]

H. Lai, X. Liu, I. L. Iong, S. Yao, Y. Chen, P. Shen, H. Yu, H. Zhang, X. Zhang, Y.Dong,etal.,Autowebglm: Alargelanguagemodel-basedwebnavigatingagent, in: Proceedingsofthe30thACMSIGKDDConferenceonKnowledgeDiscovery and Data Mining, 2024, pp. 5295–5306

2024
[4]

C.Liu,Y.Wang,D.Li,X.Wang,Domain-incrementallearningwithoutforgetting based on random vector functional link networks, Pattern recognition 151 (2024) 110430

2024
[5]

Zheng, B

B. Zheng, B. Gou, S. Salisbury, Z. Du, H. Sun, Y. Su, Webolympus: An open platformforwebagentsonlivewebsites,in: Proceedingsofthe2024Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2024, pp. 187–197

2024
[6]

Y. Pan, D. Kong, S. Zhou, C. Cui, Y. Leng, B. Jiang, H. Liu, Y. Shang, S. Zhou, T. Wu, et al., Webcanvas: Benchmarking web agents in online environments, 2024, URL https://arxiv.org/abs/2406.12373

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

S. Ye, H. Shi, D. Shih, H. Yun, T. Roosta, T. Shu, Realwebassist: A bench- mark for long-horizon web assistance with real-world users, arXiv preprint arXiv:2504.10445 (2025). 31

work page arXiv 2025
[8]

T. Xue, W. Qi, T. Shi, C. H. Song, B. Gou, D. Song, H. Sun, Y. Su, An il- lusion of progress? assessing the current state of web agents, arXiv preprint arXiv:2504.01382 (2025)

work page arXiv 2025
[9]

D. Lee, J. Lee, K. Kim, J. Tack, J. Shin, Y. W. Teh, K. Lee, Learning to con- textualize web pages for enhanced decision making by llm agents, arXiv preprint arXiv:2503.10689 (2025)

work page arXiv 2025
[10]

B. Gou, R. Wang, B. Zheng, Y. Xie, C. Chang, Y. Shu, H. Sun, Y. Su, Navigating the digital world as humans do: Universal visual grounding for gui agents, arXiv preprint arXiv:2410.05243 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, Y. Cao, React: Syn- ergizing reasoning and acting in language models, in: The eleventh international conference on learning representations, 2022

2022
[12]

Han, J.-w

Y.-n. Han, J.-w. Liu, Adaptive instance similarity embedding for online continual learning, Pattern Recognition 149 (2024) 110238

2024
[13]

A. Zhao, D. Huang, Q. Xu, M. Lin, Y.-J. Liu, G. Huang, Expel: Llm agents are experiential learners, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38, 2024, pp. 19632–19642

2024
[14]

Z.Z.Wang,J.Mao,D.Fried,G.Neubig,Agentworkflowmemory,arXivpreprint arXiv:2409.07429 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Y. Liu, C. Si, K. R. Narasimhan, S. Yao, Contextual experience replay for self- improvement of language agents, in: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 14179–14198

2025
[16]

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

B. Zheng, M. Y. Fatemi, X. Jin, Z. Z. Wang, A. Gandhi, Y. Song, Y. Gu, J. Srini- vasa, G. Liu, G. Neubig, et al., Skillweaver: Web agents can self-improve by discovering and honing skills, 2025, URL https://arxiv.org/abs/2504.07079. 32

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Y. Zhou, Q. Yang, K. Lin, M. Bai, X. Zhou, Y.-X. Wang, S. Levine, L. E. Li, Proposer-agent-evaluator(pae): Autonomousskilldiscoveryforfoundationmodel internet agents, in: Forty-second International Conference on Machine Learning, 2025

2025
[18]

Prabhu, Y

V. Prabhu, Y. Dai, M. Fernandez, J. Gu, K. Ramakrishnan, Y. Luo, S. Savarese, C. Xiong, J. Li, Z. Chen, et al., Walt: Web agents that learn tools, arXiv preprint arXiv:2510.01524 (2025)

work page arXiv 2025
[19]

Zhong, F

H. Zhong, F. Faisal, L. França, T. Leesatapornwongsa, A. Szekeres, K. Rong, S. Nath, Actionengine: From reactive to programmatic gui agents via state ma- chine memory, arXiv preprint arXiv:2602.20502 (2026)

work page arXiv 2026
[20]

W.Liu,X.-J.Wu,F.Zhu,M.-M.Yu,C.Wang,C.-L.Liu,Classincrementallearn- ing with self-supervised pre-training and prototype learning, Pattern Recognition 157 (2025) 110943

2025
[21]

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y. Bisk, D. Fried, et al., Webarena: A realistic web environment for building autonomous agents, arXiv preprint arXiv:2307.13854 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Z.Wang,Y.Sun,X.Zhang,B.Xu,Z.Yang,H.Lin,Continuallearningwithhigh- orderexperiencereplayfordynamicnetworkembedding,PatternRecognition159 (2025) 111093

2025
[23]

G. Liu, S. Geng, S. Li, H. Cui, S. Zhang, X. Liu, T. Liu, Webcoach: Self-evolving web agents with cross-session memory guidance, arXiv preprint arXiv:2511.12997 (2025)

work page arXiv 2025
[24]

ExpSeek: Self-Triggered Experience Seeking for Web Agents

W. Zhang, X. Zhang, H. Yu, S. Nie, B. Wu, J. Yue, T. Liu, Y. Li, Expseek: Self- triggered experience seeking for web agents, arXiv preprint arXiv:2601.08605 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

CASCADE : Cumulative agentic skill creation through autonomous development and evolution

X. Huang, J. Chen, Y. Fei, Z. Li, P. Schwaller, G. Ceder, Cascade: Cumula- tive agentic skill creation through autonomous development and evolution, arXiv preprint arXiv:2512.23880 (2025). 33

work page arXiv 2025
[26]

J.Qiu,X.Qi,T.Zhang,X.Juan,J.Guo,Y.Lu,Y.Wang,Z.Yao,Q.Ren,X.Jiang, et al., Alita: Generalist agent enabling scalable agentic reasoning with mini- mal predefinition and maximal self-evolution, arXiv preprint arXiv:2505.20286 (2025)

work page arXiv 2025
[27]

Z. Qi, X. Liu, I. L. Iong, H. Lai, X. Sun, W. Zhao, Y. Yang, X. Yang, J. Sun, S.Yao,T.Zhang,W.Xu,J.Tang,Y.Dong,Webrl: Trainingllmwebagentsviaself- evolving online curriculum reinforcement learning (2025).arXiv:2411.02337

work page arXiv 2025
[28]

7920–7939

Z.Wei, W.Yao, Y.Liu, W.Zhang, Q.Lu, L.Qiu, C.Yu, P.Xu, C.Zhang, B.Yin, etal.,Webagent-r1: Trainingwebagentsviaend-to-endmulti-turnreinforcement learning,in: Proceedingsofthe2025ConferenceonEmpiricalMethodsinNatural Language Processing, 2025, pp. 7920–7939

2025
[29]

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

P. Putta, E. Mills, N. Garg, S. Motwani, C. Finn, D. Garg, R. Rafailov, Agent q: Advanced reasoning and learning for autonomous ai agents, arXiv preprint arXiv:2408.07199 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

K. Yang, Y. Liu, S. Chaudhary, R. Fakoor, P. Chaudhari, G. Karypis, H. Rang- wala, Agentoccam: A simple yet strong baseline for llm-based web agents, arXiv preprint arXiv:2410.13825 (2024)

work page arXiv 2024
[31]

Gandhi, G

A. Gandhi, G. Neubig, Go-browse: Training web agents with structured explo- ration, arXiv preprint arXiv:2506.03533 (2025)

work page arXiv 2025
[32]

Sodhi, S

P. Sodhi, S. Branavan, Y. Artzi, R. McDonald, Step: Stacked llm policies for web actions, arXiv preprint arXiv:2310.03720 (2023)

work page arXiv 2023
[33]

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H.Wei,etal.,Qwen2.5technicalreport,arXivpreprintarXiv:2412.15115(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

J. Dong, Y. Cong, G. Sun, T. Zhang, Lifelong robotic visual-tactile perception learning, Pattern Recognition 121 (2022) 108176

2022
[35]

J. Yin, X. Zhang, L. Wu, X. Wang, Context-aware prompt learning for test-time vision recognition with frozen vision-language model, Pattern Recognition 162 (2025) 111359. 34

2025
[36]

J. Y. Koh, R. Lo, L. Jang, V. Duvvur, M. Lim, P.-Y. Huang, G. Neubig, S. Zhou, R. Salakhutdinov, D. Fried, Visualwebarena: Evaluating multimodal agents on realistic visual web tasks, in: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 881–905

2024
[37]

Y. Qian, K. Qian, X. He, L. Chen, J. Zhang, T. Zhang, H. Wei, L. Wang, H. Wu, B. Mao, Zero-permission manipulation: Can we trust large multimodal model powered gui agents?, arXiv preprint arXiv:2601.12349 (2026). 35

work page arXiv 2026

[1] [1]

GPT-4V(ision) is a Generalist Web Agent, if Grounded

B. Zheng, B. Gou, J. Kil, H. Sun, Y. Su, Gpt-4v (ision) is a generalist web agent, if grounded, arXiv preprint arXiv:2401.01614 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

H. He, W. Yao, K. Ma, W. Yu, Y. Dai, H. Zhang, Z. Lan, D. Yu, Webvoyager: Buildinganend-to-endwebagentwithlargemultimodalmodels,in: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 6864–6890

2024

[3] [3]

H. Lai, X. Liu, I. L. Iong, S. Yao, Y. Chen, P. Shen, H. Yu, H. Zhang, X. Zhang, Y.Dong,etal.,Autowebglm: Alargelanguagemodel-basedwebnavigatingagent, in: Proceedingsofthe30thACMSIGKDDConferenceonKnowledgeDiscovery and Data Mining, 2024, pp. 5295–5306

2024

[4] [4]

C.Liu,Y.Wang,D.Li,X.Wang,Domain-incrementallearningwithoutforgetting based on random vector functional link networks, Pattern recognition 151 (2024) 110430

2024

[5] [5]

Zheng, B

B. Zheng, B. Gou, S. Salisbury, Z. Du, H. Sun, Y. Su, Webolympus: An open platformforwebagentsonlivewebsites,in: Proceedingsofthe2024Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2024, pp. 187–197

2024

[6] [6]

Y. Pan, D. Kong, S. Zhou, C. Cui, Y. Leng, B. Jiang, H. Liu, Y. Shang, S. Zhou, T. Wu, et al., Webcanvas: Benchmarking web agents in online environments, 2024, URL https://arxiv.org/abs/2406.12373

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

S. Ye, H. Shi, D. Shih, H. Yun, T. Roosta, T. Shu, Realwebassist: A bench- mark for long-horizon web assistance with real-world users, arXiv preprint arXiv:2504.10445 (2025). 31

work page arXiv 2025

[8] [8]

T. Xue, W. Qi, T. Shi, C. H. Song, B. Gou, D. Song, H. Sun, Y. Su, An il- lusion of progress? assessing the current state of web agents, arXiv preprint arXiv:2504.01382 (2025)

work page arXiv 2025

[9] [9]

D. Lee, J. Lee, K. Kim, J. Tack, J. Shin, Y. W. Teh, K. Lee, Learning to con- textualize web pages for enhanced decision making by llm agents, arXiv preprint arXiv:2503.10689 (2025)

work page arXiv 2025

[10] [10]

B. Gou, R. Wang, B. Zheng, Y. Xie, C. Chang, Y. Shu, H. Sun, Y. Su, Navigating the digital world as humans do: Universal visual grounding for gui agents, arXiv preprint arXiv:2410.05243 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, Y. Cao, React: Syn- ergizing reasoning and acting in language models, in: The eleventh international conference on learning representations, 2022

2022

[12] [12]

Han, J.-w

Y.-n. Han, J.-w. Liu, Adaptive instance similarity embedding for online continual learning, Pattern Recognition 149 (2024) 110238

2024

[13] [13]

A. Zhao, D. Huang, Q. Xu, M. Lin, Y.-J. Liu, G. Huang, Expel: Llm agents are experiential learners, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38, 2024, pp. 19632–19642

2024

[14] [14]

Z.Z.Wang,J.Mao,D.Fried,G.Neubig,Agentworkflowmemory,arXivpreprint arXiv:2409.07429 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Y. Liu, C. Si, K. R. Narasimhan, S. Yao, Contextual experience replay for self- improvement of language agents, in: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 14179–14198

2025

[16] [16]

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

B. Zheng, M. Y. Fatemi, X. Jin, Z. Z. Wang, A. Gandhi, Y. Song, Y. Gu, J. Srini- vasa, G. Liu, G. Neubig, et al., Skillweaver: Web agents can self-improve by discovering and honing skills, 2025, URL https://arxiv.org/abs/2504.07079. 32

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Y. Zhou, Q. Yang, K. Lin, M. Bai, X. Zhou, Y.-X. Wang, S. Levine, L. E. Li, Proposer-agent-evaluator(pae): Autonomousskilldiscoveryforfoundationmodel internet agents, in: Forty-second International Conference on Machine Learning, 2025

2025

[18] [18]

Prabhu, Y

V. Prabhu, Y. Dai, M. Fernandez, J. Gu, K. Ramakrishnan, Y. Luo, S. Savarese, C. Xiong, J. Li, Z. Chen, et al., Walt: Web agents that learn tools, arXiv preprint arXiv:2510.01524 (2025)

work page arXiv 2025

[19] [19]

Zhong, F

H. Zhong, F. Faisal, L. França, T. Leesatapornwongsa, A. Szekeres, K. Rong, S. Nath, Actionengine: From reactive to programmatic gui agents via state ma- chine memory, arXiv preprint arXiv:2602.20502 (2026)

work page arXiv 2026

[20] [20]

W.Liu,X.-J.Wu,F.Zhu,M.-M.Yu,C.Wang,C.-L.Liu,Classincrementallearn- ing with self-supervised pre-training and prototype learning, Pattern Recognition 157 (2025) 110943

2025

[21] [21]

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y. Bisk, D. Fried, et al., Webarena: A realistic web environment for building autonomous agents, arXiv preprint arXiv:2307.13854 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Z.Wang,Y.Sun,X.Zhang,B.Xu,Z.Yang,H.Lin,Continuallearningwithhigh- orderexperiencereplayfordynamicnetworkembedding,PatternRecognition159 (2025) 111093

2025

[23] [23]

G. Liu, S. Geng, S. Li, H. Cui, S. Zhang, X. Liu, T. Liu, Webcoach: Self-evolving web agents with cross-session memory guidance, arXiv preprint arXiv:2511.12997 (2025)

work page arXiv 2025

[24] [24]

ExpSeek: Self-Triggered Experience Seeking for Web Agents

W. Zhang, X. Zhang, H. Yu, S. Nie, B. Wu, J. Yue, T. Liu, Y. Li, Expseek: Self- triggered experience seeking for web agents, arXiv preprint arXiv:2601.08605 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

CASCADE : Cumulative agentic skill creation through autonomous development and evolution

X. Huang, J. Chen, Y. Fei, Z. Li, P. Schwaller, G. Ceder, Cascade: Cumula- tive agentic skill creation through autonomous development and evolution, arXiv preprint arXiv:2512.23880 (2025). 33

work page arXiv 2025

[26] [26]

J.Qiu,X.Qi,T.Zhang,X.Juan,J.Guo,Y.Lu,Y.Wang,Z.Yao,Q.Ren,X.Jiang, et al., Alita: Generalist agent enabling scalable agentic reasoning with mini- mal predefinition and maximal self-evolution, arXiv preprint arXiv:2505.20286 (2025)

work page arXiv 2025

[27] [27]

Z. Qi, X. Liu, I. L. Iong, H. Lai, X. Sun, W. Zhao, Y. Yang, X. Yang, J. Sun, S.Yao,T.Zhang,W.Xu,J.Tang,Y.Dong,Webrl: Trainingllmwebagentsviaself- evolving online curriculum reinforcement learning (2025).arXiv:2411.02337

work page arXiv 2025

[28] [28]

7920–7939

Z.Wei, W.Yao, Y.Liu, W.Zhang, Q.Lu, L.Qiu, C.Yu, P.Xu, C.Zhang, B.Yin, etal.,Webagent-r1: Trainingwebagentsviaend-to-endmulti-turnreinforcement learning,in: Proceedingsofthe2025ConferenceonEmpiricalMethodsinNatural Language Processing, 2025, pp. 7920–7939

2025

[29] [29]

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

P. Putta, E. Mills, N. Garg, S. Motwani, C. Finn, D. Garg, R. Rafailov, Agent q: Advanced reasoning and learning for autonomous ai agents, arXiv preprint arXiv:2408.07199 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

K. Yang, Y. Liu, S. Chaudhary, R. Fakoor, P. Chaudhari, G. Karypis, H. Rang- wala, Agentoccam: A simple yet strong baseline for llm-based web agents, arXiv preprint arXiv:2410.13825 (2024)

work page arXiv 2024

[31] [31]

Gandhi, G

A. Gandhi, G. Neubig, Go-browse: Training web agents with structured explo- ration, arXiv preprint arXiv:2506.03533 (2025)

work page arXiv 2025

[32] [32]

Sodhi, S

P. Sodhi, S. Branavan, Y. Artzi, R. McDonald, Step: Stacked llm policies for web actions, arXiv preprint arXiv:2310.03720 (2023)

work page arXiv 2023

[33] [33]

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H.Wei,etal.,Qwen2.5technicalreport,arXivpreprintarXiv:2412.15115(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

J. Dong, Y. Cong, G. Sun, T. Zhang, Lifelong robotic visual-tactile perception learning, Pattern Recognition 121 (2022) 108176

2022

[35] [35]

J. Yin, X. Zhang, L. Wu, X. Wang, Context-aware prompt learning for test-time vision recognition with frozen vision-language model, Pattern Recognition 162 (2025) 111359. 34

2025

[36] [36]

J. Y. Koh, R. Lo, L. Jang, V. Duvvur, M. Lim, P.-Y. Huang, G. Neubig, S. Zhou, R. Salakhutdinov, D. Fried, Visualwebarena: Evaluating multimodal agents on realistic visual web tasks, in: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 881–905

2024

[37] [37]

Y. Qian, K. Qian, X. He, L. Chen, J. Zhang, T. Zhang, H. Wei, L. Wang, H. Wu, B. Mao, Zero-permission manipulation: Can we trust large multimodal model powered gui agents?, arXiv preprint arXiv:2601.12349 (2026). 35

work page arXiv 2026