pith. sign in

arxiv: 2605.23939 · v1 · pith:BCGEEOVCnew · submitted 2026-04-28 · 💻 cs.AI · cs.LG

DRIVE: Modeling Skills at the Reasoning and Interaction Levels for Web Agents under Continual Learning

Pith reviewed 2026-07-01 08:59 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords web agentsskill modelingcontinual learningreasoning skillsinteraction skillstask decompositionweb navigationdual-level framework
0
0 comments X

The pith

Web agents improve by separating transferable reasoning skills in natural language from page-specific interaction skills in code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Web agents must handle both abstract task logic that works across sites and concrete page manipulations that do not. Storing both kinds of experience in one form creates a tradeoff: abstract versions lose executability while concrete versions fail to generalize. DRIVE splits historical experience into natural language reasoning skills for task decomposition and programmatic interaction skills for element manipulation. A scene-aware mechanism retrieves and combines the right skills for each new page, while skill-level reflection updates the libraries when failures occur. Experiments on five WebArena domains show this separation raises average task success from the skill-free baseline.

Core claim

DRIVE models skills at two levels by extracting natural language reasoning skills that capture transferable task logic such as searching routes before booking and programmatic interaction skills that ground those actions to specific page operations, then coordinates them through a scene-aware retrieval process that selects skills based on task semantics and current page context, allowing agents to accumulate capabilities across domains without entanglement of abstract and concrete knowledge.

What carries the argument

Dual-level skill modeling framework that separates natural language reasoning skills from programmatic interaction skills, coordinated by a scene-aware mechanism and refined through skill-level reflection.

If this is right

  • Reasoning skills can be reused across sites with different layouts because they remain in natural language.
  • Interaction skills can be updated independently when page structures change without rewriting task logic.
  • Skill-level reflection isolates whether failures stem from missing reasoning or missing interaction knowledge.
  • Agents accumulate capabilities over time by expanding the two libraries separately rather than retraining end-to-end.
  • Ablations confirm the two skill types deliver distinct complementary gains rather than redundant ones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation might reduce interference when an agent must switch between multiple websites in one session.
  • It could be extended to other agent settings where high-level plans must map to low-level actuators that change over time.
  • Library growth might eventually require explicit conflict detection between reasoning skills acquired from different task families.

Load-bearing premise

Historical experience can be cleanly separated into natural language reasoning skills and programmatic interaction skills, and a scene-aware mechanism can retrieve and coordinate them reliably on unseen websites without new failure modes.

What would settle it

Measure task success rate on a held-out set of websites that introduce interaction patterns absent from training data; if the separated model performs no better than the uniform baseline, the separation provides no net gain.

Figures

Figures reproduced from arXiv: 2605.23939 by Hao Chen, Haoyuan Chen, Jian Huang, Maolin He, Rong Zhou, Sihang Zhou, Siwei Wang, Xirui Liu, Yanning Hou.

Figure 1
Figure 1. Figure 1: Overview of DRIVE. DRIVE consists of an offline stage for skill abstraction and evolution and [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Continuous capability accumulation in DRIVE. As the number of training trajectories increases [PITH_FULL_IMAGE:figures/full_fig_p020_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Validation of failure attribution. (a) Confusion matrix between human-adjudicated and LM-based [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Case study resolving an interaction error. The baseline agent fails due to brittle execution on dynamic interface elements (e.g., state and zip-code field dependencies). In contrast, DRIVE invokes a programmatic interaction skill that encodes a robust procedure for structured form completion, successfully bypassing the page-specific constraints. AgentOccam Task:Search for "switch accessories" Initial Progr… view at source ↗
Figure 5
Figure 5. Figure 5: Case study resolving a reasoning error. The baseline agent exhibits flawed stopping-condition judgment, unnecessarily clicking into a product page when the search results already satisfy the query. DRIVE leverages a natural-language reasoning skill to correctly frame the task intent and halt execution at the appropriate state. structured form completion under page constraints [PITH_FULL_IMAGE:figures/full… view at source ↗
read the original abstract

Web agents require both high-level reasoning (for task decomposition) and low-level interactions (for page elements manipulation) to conduct different tasks. However, these knowledge types differ fundamentally: reasoning knowledge (e.g., booking a flight requires first searching for routes) is abstract and transferable across websites, while interaction knowledge (e.g., clicking the Search button at a specific coordinate on Site A) depends heavily on page-specific contexts. Existing methods store experiences uniformly. This creates a dilemma: abstract representations lose executability on concrete pages, while concrete representations fail to generalize across domains. This entanglement limits capability accumulation: on new websites, agents either fail to recognize reusable task logic due to surface-level differences or attempt infeasible actions from outdated page structures. To disentangle them, we propose DRIVE, a dual-level skill modeling framework separating historical experience into natural language reasoning skills, which capture transferable task logic, and programmatic interaction skills, grounding abstract actions to executable operations. A scene-aware coordination mechanism adaptively retrieves and invokes these dual-level skills based on task semantics. DRIVE also uses skill-level reflection to identify hierarchy-specific failure modes, enabling targeted skill library expansion and refinement. Experiments across five WebArena domains show DRIVE attains an average task success rate of 52.8%, exceeding the skill-free baseline by 7.3 percentage points. Further ablations show reasoning and interaction skills provide distinct, complementary benefits, supporting separation of transferable task logic from executable page-level operations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes DRIVE, a dual-level skill modeling framework for web agents under continual learning. It separates historical experience into natural language reasoning skills (abstract, transferable task logic such as flight booking decomposition) and programmatic interaction skills (page-grounded executable operations). A scene-aware coordination mechanism retrieves and invokes these skills, supplemented by skill-level reflection for targeted library updates. On five WebArena domains, DRIVE reports 52.8% average task success rate, exceeding the skill-free baseline by 7.3 percentage points, with ablations indicating that reasoning and interaction skills yield distinct, complementary benefits.

Significance. If the separation and coordination claims hold under proper controls, the work would advance continual learning for web agents by addressing the entanglement of abstract and concrete knowledge, enabling better cross-domain accumulation. The concrete performance number, baseline comparison, and ablation outcomes on complementary benefits constitute a falsifiable empirical contribution that could be built upon if extended to out-of-domain validation.

major comments (2)
  1. [Abstract] Abstract (experiments paragraph): the central claim that reasoning and interaction skills can be cleanly separated without introducing new failure modes on unseen sites is load-bearing, yet all reported results and ablations operate inside the same five WebArena domains used for skill extraction; this does not test transfer to sites with novel page structures where scene descriptors or retrieval might mismatch.
  2. [Abstract] Abstract: the reported 52.8% success rate and 7.3 pp improvement over baseline are presented without any reference to the number of tasks, number of trials per task, statistical significance tests, error bars, or exact computation method, which prevents verification of the empirical claim's reliability.
minor comments (1)
  1. [Abstract] The abstract could more explicitly reference the full methods and experimental sections to allow readers to locate details on the scene-aware retriever and reflection mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the two major comments below and will incorporate revisions to improve clarity and transparency in the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract (experiments paragraph): the central claim that reasoning and interaction skills can be cleanly separated without introducing new failure modes on unseen sites is load-bearing, yet all reported results and ablations operate inside the same five WebArena domains used for skill extraction; this does not test transfer to sites with novel page structures where scene descriptors or retrieval might mismatch.

    Authors: We agree that the current evaluation is limited to the five WebArena domains used for skill extraction and does not include explicit tests on entirely novel sites with unseen page structures. The ablations demonstrate complementary benefits within these domains, but the load-bearing claim about clean separation without new failure modes on unseen sites is not directly supported by out-of-domain results. We will revise the abstract to clarify the evaluation scope, tone down the generalization claim, and add a limitations paragraph discussing potential mismatches in scene descriptors or retrieval for novel sites, along with plans for future out-of-domain validation. revision: yes

  2. Referee: [Abstract] Abstract: the reported 52.8% success rate and 7.3 pp improvement over baseline are presented without any reference to the number of tasks, number of trials per task, statistical significance tests, error bars, or exact computation method, which prevents verification of the empirical claim's reliability.

    Authors: We agree that the abstract lacks sufficient experimental details for verification. The full paper (Section 4) specifies evaluation on 250 tasks (50 per domain) with 3 independent trials per task, using the standard WebArena binary success metric averaged across runs. We will update the abstract to include these details, along with error bars and reference to statistical testing where applicable in the results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework evaluated on external benchmarks

full rationale

The paper proposes a dual-level skill separation framework and reports empirical success rates (52.8% vs. baseline) plus ablations on WebArena domains. No equations, fitted parameters, or derivations are present. The separation of reasoning vs. interaction skills is an architectural assumption tested via experiments rather than derived from its own outputs. No self-citation chains, uniqueness theorems, or renamings reduce the central claim to its inputs by construction. The result is self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that reasoning and interaction knowledge are separable without loss; no free parameters, invented physical entities, or additional axioms are visible in the abstract.

axioms (1)
  • domain assumption Reasoning knowledge is abstract and transferable while interaction knowledge is page-specific and non-transferable
    This premise is stated directly in the opening motivation of the abstract and underpins the entire dual-level design.

pith-pipeline@v0.9.1-grok · 5818 in / 1361 out tokens · 31059 ms · 2026-07-01T08:59:16.170208+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 22 canonical work pages · 9 internal anchors

  1. [1]

    GPT-4V(ision) is a Generalist Web Agent, if Grounded

    B. Zheng, B. Gou, J. Kil, H. Sun, Y. Su, Gpt-4v (ision) is a generalist web agent, if grounded, arXiv preprint arXiv:2401.01614 (2024)

  2. [2]

    H. He, W. Yao, K. Ma, W. Yu, Y. Dai, H. Zhang, Z. Lan, D. Yu, Webvoyager: Buildinganend-to-endwebagentwithlargemultimodalmodels,in: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 6864–6890

  3. [3]

    H. Lai, X. Liu, I. L. Iong, S. Yao, Y. Chen, P. Shen, H. Yu, H. Zhang, X. Zhang, Y.Dong,etal.,Autowebglm: Alargelanguagemodel-basedwebnavigatingagent, in: Proceedingsofthe30thACMSIGKDDConferenceonKnowledgeDiscovery and Data Mining, 2024, pp. 5295–5306

  4. [4]

    C.Liu,Y.Wang,D.Li,X.Wang,Domain-incrementallearningwithoutforgetting based on random vector functional link networks, Pattern recognition 151 (2024) 110430

  5. [5]

    Zheng, B

    B. Zheng, B. Gou, S. Salisbury, Z. Du, H. Sun, Y. Su, Webolympus: An open platformforwebagentsonlivewebsites,in: Proceedingsofthe2024Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2024, pp. 187–197

  6. [6]

    Y. Pan, D. Kong, S. Zhou, C. Cui, Y. Leng, B. Jiang, H. Liu, Y. Shang, S. Zhou, T. Wu, et al., Webcanvas: Benchmarking web agents in online environments, 2024, URL https://arxiv.org/abs/2406.12373

  7. [7]

    S. Ye, H. Shi, D. Shih, H. Yun, T. Roosta, T. Shu, Realwebassist: A bench- mark for long-horizon web assistance with real-world users, arXiv preprint arXiv:2504.10445 (2025). 31

  8. [8]

    T. Xue, W. Qi, T. Shi, C. H. Song, B. Gou, D. Song, H. Sun, Y. Su, An il- lusion of progress? assessing the current state of web agents, arXiv preprint arXiv:2504.01382 (2025)

  9. [9]

    D. Lee, J. Lee, K. Kim, J. Tack, J. Shin, Y. W. Teh, K. Lee, Learning to con- textualize web pages for enhanced decision making by llm agents, arXiv preprint arXiv:2503.10689 (2025)

  10. [10]

    B. Gou, R. Wang, B. Zheng, Y. Xie, C. Chang, Y. Shu, H. Sun, Y. Su, Navigating the digital world as humans do: Universal visual grounding for gui agents, arXiv preprint arXiv:2410.05243 (2024)

  11. [11]

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, Y. Cao, React: Syn- ergizing reasoning and acting in language models, in: The eleventh international conference on learning representations, 2022

  12. [12]

    Han, J.-w

    Y.-n. Han, J.-w. Liu, Adaptive instance similarity embedding for online continual learning, Pattern Recognition 149 (2024) 110238

  13. [13]

    A. Zhao, D. Huang, Q. Xu, M. Lin, Y.-J. Liu, G. Huang, Expel: Llm agents are experiential learners, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38, 2024, pp. 19632–19642

  14. [14]

    Z.Z.Wang,J.Mao,D.Fried,G.Neubig,Agentworkflowmemory,arXivpreprint arXiv:2409.07429 (2024)

  15. [15]

    Y. Liu, C. Si, K. R. Narasimhan, S. Yao, Contextual experience replay for self- improvement of language agents, in: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 14179–14198

  16. [16]

    SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

    B. Zheng, M. Y. Fatemi, X. Jin, Z. Z. Wang, A. Gandhi, Y. Song, Y. Gu, J. Srini- vasa, G. Liu, G. Neubig, et al., Skillweaver: Web agents can self-improve by discovering and honing skills, 2025, URL https://arxiv.org/abs/2504.07079. 32

  17. [17]

    Y. Zhou, Q. Yang, K. Lin, M. Bai, X. Zhou, Y.-X. Wang, S. Levine, L. E. Li, Proposer-agent-evaluator(pae): Autonomousskilldiscoveryforfoundationmodel internet agents, in: Forty-second International Conference on Machine Learning, 2025

  18. [18]

    Prabhu, Y

    V. Prabhu, Y. Dai, M. Fernandez, J. Gu, K. Ramakrishnan, Y. Luo, S. Savarese, C. Xiong, J. Li, Z. Chen, et al., Walt: Web agents that learn tools, arXiv preprint arXiv:2510.01524 (2025)

  19. [19]

    Zhong, F

    H. Zhong, F. Faisal, L. França, T. Leesatapornwongsa, A. Szekeres, K. Rong, S. Nath, Actionengine: From reactive to programmatic gui agents via state ma- chine memory, arXiv preprint arXiv:2602.20502 (2026)

  20. [20]

    W.Liu,X.-J.Wu,F.Zhu,M.-M.Yu,C.Wang,C.-L.Liu,Classincrementallearn- ing with self-supervised pre-training and prototype learning, Pattern Recognition 157 (2025) 110943

  21. [21]

    S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y. Bisk, D. Fried, et al., Webarena: A realistic web environment for building autonomous agents, arXiv preprint arXiv:2307.13854 (2023)

  22. [22]

    Z.Wang,Y.Sun,X.Zhang,B.Xu,Z.Yang,H.Lin,Continuallearningwithhigh- orderexperiencereplayfordynamicnetworkembedding,PatternRecognition159 (2025) 111093

  23. [23]

    G. Liu, S. Geng, S. Li, H. Cui, S. Zhang, X. Liu, T. Liu, Webcoach: Self-evolving web agents with cross-session memory guidance, arXiv preprint arXiv:2511.12997 (2025)

  24. [24]

    ExpSeek: Self-Triggered Experience Seeking for Web Agents

    W. Zhang, X. Zhang, H. Yu, S. Nie, B. Wu, J. Yue, T. Liu, Y. Li, Expseek: Self- triggered experience seeking for web agents, arXiv preprint arXiv:2601.08605 (2026)

  25. [25]

    CASCADE : Cumulative agentic skill creation through autonomous development and evolution

    X. Huang, J. Chen, Y. Fei, Z. Li, P. Schwaller, G. Ceder, Cascade: Cumula- tive agentic skill creation through autonomous development and evolution, arXiv preprint arXiv:2512.23880 (2025). 33

  26. [26]

    J.Qiu,X.Qi,T.Zhang,X.Juan,J.Guo,Y.Lu,Y.Wang,Z.Yao,Q.Ren,X.Jiang, et al., Alita: Generalist agent enabling scalable agentic reasoning with mini- mal predefinition and maximal self-evolution, arXiv preprint arXiv:2505.20286 (2025)

  27. [27]

    Z. Qi, X. Liu, I. L. Iong, H. Lai, X. Sun, W. Zhao, Y. Yang, X. Yang, J. Sun, S.Yao,T.Zhang,W.Xu,J.Tang,Y.Dong,Webrl: Trainingllmwebagentsviaself- evolving online curriculum reinforcement learning (2025).arXiv:2411.02337

  28. [28]

    7920–7939

    Z.Wei, W.Yao, Y.Liu, W.Zhang, Q.Lu, L.Qiu, C.Yu, P.Xu, C.Zhang, B.Yin, etal.,Webagent-r1: Trainingwebagentsviaend-to-endmulti-turnreinforcement learning,in: Proceedingsofthe2025ConferenceonEmpiricalMethodsinNatural Language Processing, 2025, pp. 7920–7939

  29. [29]

    Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

    P. Putta, E. Mills, N. Garg, S. Motwani, C. Finn, D. Garg, R. Rafailov, Agent q: Advanced reasoning and learning for autonomous ai agents, arXiv preprint arXiv:2408.07199 (2024)

  30. [30]

    K. Yang, Y. Liu, S. Chaudhary, R. Fakoor, P. Chaudhari, G. Karypis, H. Rang- wala, Agentoccam: A simple yet strong baseline for llm-based web agents, arXiv preprint arXiv:2410.13825 (2024)

  31. [31]

    Gandhi, G

    A. Gandhi, G. Neubig, Go-browse: Training web agents with structured explo- ration, arXiv preprint arXiv:2506.03533 (2025)

  32. [32]

    Sodhi, S

    P. Sodhi, S. Branavan, Y. Artzi, R. McDonald, Step: Stacked llm policies for web actions, arXiv preprint arXiv:2310.03720 (2023)

  33. [33]

    A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H.Wei,etal.,Qwen2.5technicalreport,arXivpreprintarXiv:2412.15115(2024)

  34. [34]

    J. Dong, Y. Cong, G. Sun, T. Zhang, Lifelong robotic visual-tactile perception learning, Pattern Recognition 121 (2022) 108176

  35. [35]

    J. Yin, X. Zhang, L. Wu, X. Wang, Context-aware prompt learning for test-time vision recognition with frozen vision-language model, Pattern Recognition 162 (2025) 111359. 34

  36. [36]

    J. Y. Koh, R. Lo, L. Jang, V. Duvvur, M. Lim, P.-Y. Huang, G. Neubig, S. Zhou, R. Salakhutdinov, D. Fried, Visualwebarena: Evaluating multimodal agents on realistic visual web tasks, in: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 881–905

  37. [37]

    Y. Qian, K. Qian, X. He, L. Chen, J. Zhang, T. Zhang, H. Wei, L. Wang, H. Wu, B. Mao, Zero-permission manipulation: Can we trust large multimodal model powered gui agents?, arXiv preprint arXiv:2601.12349 (2026). 35