pith. machine review for the scientific record. sign in

arxiv: 2604.09571 · v1 · submitted 2026-02-20 · 💻 cs.HC · cs.AI

Recognition: no theorem link

Tuning Qwen2.5-VL to Improve Its Web Interaction Skills

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:47 UTC · model grok-4.3

classification 💻 cs.HC cs.AI
keywords vision-language modelsweb interactionfine-tuningmouse controlQwen2.5-VLagentic tasksvisual input
0
0 comments X

The pith

Fine-tuning a vision-language model raises its success rate on web clicking tasks from 86 percent to 94 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates using vision-language models as independent agents for web tasks based solely on visual input. It identifies three challenges in Qwen2.5-VL-32B: poor localization of elements and cursor, sensitivity to how instructions are worded, and a tendency to assume its actions succeed without checking. To fix these, the authors apply a two-stage fine-tuning process focused on basic mouse movement and clicking. This approach improves performance on a custom benchmark of single-click tasks. If correct, it shows a practical way to make VLMs more dependable for automating simple web interactions.

Core claim

By applying a two-stage fine-tuning pipeline to Qwen2.5-VL-32B, first teaching the model to decide if the cursor needs to move to a target element described in natural language and then to issue and verify single mouse commands one at a time, the success rate on challenging single-click web tasks increases from 86% to 94%. This directly tackles the model's inaccurate localization, phrasing sensitivity, and overoptimistic bias toward its own actions.

What carries the argument

A two-stage fine-tuning pipeline that trains the model to first assess whether the cursor hovers over the target and then execute one mouse action at a time while verifying the outcome.

If this is right

  • The model becomes better at determining cursor position relative to targets from images alone.
  • It learns to issue single commands rather than assuming multi-step success.
  • Overall reliability in visual web control increases substantially.
  • Open-source VLMs can handle basic web agent tasks more effectively after targeted tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The technique of forcing step-by-step verification could apply to other agentic tasks involving visual feedback.
  • Extending this to multi-action sequences might enable more complex web automation without additional human oversight.
  • Similar training could reduce instruction sensitivity in other vision-language models.

Load-bearing premise

The custom benchmark of single-click web tasks is representative of real-world web interaction challenges and the observed improvements will generalize beyond the tested model and task distribution.

What would settle it

Evaluating the fine-tuned model on a diverse set of unseen real-world websites or more complex multi-click tasks, where the success rate fails to reach or exceed 94% compared to the untuned version.

Figures

Figures reproduced from arXiv: 2604.09571 by Alexander Ilin, Alexandra Yakovleva, Harri Valpola, Henrik P\"arssinen, Juho Kannala.

Figure 1
Figure 1. Figure 1: Overview of the proposed two-stage fine-tuning [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
read the original abstract

Recent advances in vision-language models (VLMs) have sparked growing interest in using them to automate web tasks, yet their feasibility as independent agents that reason and act purely from visual input remains underexplored. We investigate this setting using Qwen2.5-VL-32B, one of the strongest open-source VLMs available, and focus on improving its reliability in web-based control. Through initial experimentation, we observe three key challenges: (i) inaccurate localization of target elements, the cursor, and their relative positions, (ii) sensitivity to instruction phrasing, and (iii) an overoptimistic bias toward its own actions, often assuming they succeed rather than analyzing their actual outcomes. To address these issues, we fine-tune Qwen2.5-VL-32B for a basic web interaction task: moving the mouse and clicking on a page element described in natural language. Our training pipeline consists of two stages: (1) teaching the model to determine whether the cursor already hovers over the target element or whether movement is required, and (2) training it to execute a single command (a mouse move or a mouse click) at a time, verifying the resulting state of the environment before planning the next action. Evaluated on a custom benchmark of single-click web tasks, our approach increases success rates from 86% to 94% under the most challenging setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that fine-tuning Qwen2.5-VL-32B via a two-stage pipeline—first teaching cursor hover detection and then single-command execution with state verification—addresses localization errors, phrasing sensitivity, and action bias, yielding an 8pp success-rate gain (86% to 94%) on a custom benchmark of single-click web tasks.

Significance. If the performance delta is reproducible and generalizes, the work supplies a concrete, failure-mode-targeted fine-tuning recipe that could improve the reliability of open VLMs as visual web agents. The staged training design is a clear methodological strength that directly targets the three challenges identified in the abstract.

major comments (1)
  1. [Evaluation] Evaluation section: The headline result (86% → 94% success under the most challenging setting) is load-bearing for the central claim, yet the manuscript supplies no task count, website diversity statistics, run-to-run variance, statistical significance tests, or ablation against other fine-tuning regimes. Without these quantities the observed delta cannot be distinguished from benchmark-specific artifacts.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'most challenging setting' is undefined; a brief parenthetical (e.g., 'zero-shot websites' or 'long-horizon pages') would clarify scope.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation section. We agree that additional quantitative details are required to support the reported performance gains and will revise the manuscript to incorporate them.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The headline result (86% → 94% success under the most challenging setting) is load-bearing for the central claim, yet the manuscript supplies no task count, website diversity statistics, run-to-run variance, statistical significance tests, or ablation against other fine-tuning regimes. Without these quantities the observed delta cannot be distinguished from benchmark-specific artifacts.

    Authors: We agree that these details are necessary for a rigorous evaluation. In the revised manuscript we will report the exact number of tasks and websites in the benchmark, along with diversity statistics (e.g., distribution across domains and page complexities). We will also add results from multiple independent runs with different random seeds to quantify run-to-run variance, include statistical significance tests (such as paired t-tests) comparing the baseline and fine-tuned models, and provide ablations against single-stage fine-tuning and other relevant regimes. These changes will clarify that the 8pp improvement is reproducible and not benchmark-specific. revision: yes

Circularity Check

0 steps flagged

Empirical fine-tuning results contain no circular derivation chain

full rationale

The paper describes a two-stage fine-tuning procedure on Qwen2.5-VL-32B for single-click web tasks followed by direct evaluation on a custom benchmark, reporting measured success-rate improvement (86% to 94%). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim is an observed empirical delta on held-out tasks, not a quantity forced by construction from the training data or prior self-references. This is a standard non-circular empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The work implicitly relies on standard supervised fine-tuning assumptions and the validity of the unreported custom benchmark.

pith-pipeline@v0.9.0 · 5567 in / 1091 out tokens · 26804 ms · 2026-05-15T20:47:35.716931+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 3 internal anchors

  1. [1]

    Minttu Alakuijala, Ya Gao, Georgy Ananov, Samuel Kaski, Pekka Marttinen, Alexander Ilin, and Harri Valpola. 2025. Memento No More: Coaching AI Agents to Master Multiple Tasks via Hints Internalization. arXiv:2502.01562 [cs.LG] https://arxiv.org/abs/2502.01562

  2. [2]

    Tanvir Bhathal and Asanshay Gupta. 2025. Websight: A vision-first architecture for robust web agents. arXiv:2508.16987 [cs.AI] https://arxiv.org/abs/2508.16987

  3. [3]

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. 2024. SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Bangkok, Thailand, 9313–9332. https://aclanthology.org/2...

  4. [4]

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. Mind2Web: Towards a Generalist Agent for the Web. arXiv:2306.06070 [cs.CL] https://arxiv.org/abs/2306.06070

  5. [5]

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. 2024. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. arXiv:2401.13919 [cs.CL] https: //arxiv.org/abs/2401.13919

  6. [6]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InProceedings of the International Conference on Learning Representations. https://openreview.net/forum?id=nZeVKeeFYf9

  7. [7]

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried

  8. [8]

    arXiv:2401.13649 [cs.CL] https://arxiv.org/abs/2401.13649

    VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. arXiv:2401.13649 [cs.CL] https://arxiv.org/abs/2401.13649

  9. [9]

    Kalle Kujanpää, Pekka Marttinen, Harri Valpola, and Alexander Ilin. 2025. Effi- cient Knowledge Injection in LLMs via Self-Distillation.Transactions on Machine Learning Research(2025). https://openreview.net/forum?id=drYpdSnRJk

  10. [10]

    Runliang Niu, Jindong Li, Shiqi Wang, Yali Fu, Xiyu Hu, Xueyuan Leng, He Kong, Yi Chang, and Qi Wang. 2024. ScreenAgent: A Vision Language Model-driven Computer Control Agent. arXiv:2402.07945 [cs.HC] https://arxiv.org/abs/2402. 07945

  11. [11]

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. 2025. UI-TARS: Pioneering Automated GUI Interaction with Native Agents. arXiv:2501.12326 [cs.AI] https: //arxiv.org/abs/2501.12326

  12. [12]

    Charlie Snell, Dan Klein, and Ruiqi Zhong. 2022. Learning by Distilling Context. arXiv:2209.15189 [cs.CL] https://arxiv.org/abs/2209.15189

  13. [13]

    Qwen Team. 2025. Qwen2.5-VL. https://qwenlm.github.io/blog/qwen2.5-vl/

  14. [14]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InProceedings of the International Conference on Learning Representations

  15. [15]

    Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. 2024. GPT-4V(ision) is a Generalist Web Agent, if Grounded. InForty-first International Conference on Machine Learning. https://openreview.net/forum?id=piecKJ2DlB

  16. [16]

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, et al. 2024. WebArena: A Realistic Web Environment for Building Autonomous Agents. InProceedings of the International Conference on Learning Representations. https://webarena.dev