arxiv: 2604.09571 · v1 · submitted 2026-02-20 · 💻 cs.HC · cs.AI

Recognition: no theorem link

Tuning Qwen2.5-VL to Improve Its Web Interaction Skills

Alexandra Yakovleva , Henrik P\"arssinen , Harri Valpola , Juho Kannala , Alexander Ilin

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:47 UTC · model grok-4.3

classification 💻 cs.HC cs.AI

keywords vision-language modelsweb interactionfine-tuningmouse controlQwen2.5-VLagentic tasksvisual input

0 comments

The pith

Fine-tuning a vision-language model raises its success rate on web clicking tasks from 86 percent to 94 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates using vision-language models as independent agents for web tasks based solely on visual input. It identifies three challenges in Qwen2.5-VL-32B: poor localization of elements and cursor, sensitivity to how instructions are worded, and a tendency to assume its actions succeed without checking. To fix these, the authors apply a two-stage fine-tuning process focused on basic mouse movement and clicking. This approach improves performance on a custom benchmark of single-click tasks. If correct, it shows a practical way to make VLMs more dependable for automating simple web interactions.

Core claim

By applying a two-stage fine-tuning pipeline to Qwen2.5-VL-32B, first teaching the model to decide if the cursor needs to move to a target element described in natural language and then to issue and verify single mouse commands one at a time, the success rate on challenging single-click web tasks increases from 86% to 94%. This directly tackles the model's inaccurate localization, phrasing sensitivity, and overoptimistic bias toward its own actions.

What carries the argument

A two-stage fine-tuning pipeline that trains the model to first assess whether the cursor hovers over the target and then execute one mouse action at a time while verifying the outcome.

If this is right

The model becomes better at determining cursor position relative to targets from images alone.
It learns to issue single commands rather than assuming multi-step success.
Overall reliability in visual web control increases substantially.
Open-source VLMs can handle basic web agent tasks more effectively after targeted tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique of forcing step-by-step verification could apply to other agentic tasks involving visual feedback.
Extending this to multi-action sequences might enable more complex web automation without additional human oversight.
Similar training could reduce instruction sensitivity in other vision-language models.

Load-bearing premise

The custom benchmark of single-click web tasks is representative of real-world web interaction challenges and the observed improvements will generalize beyond the tested model and task distribution.

What would settle it

Evaluating the fine-tuned model on a diverse set of unseen real-world websites or more complex multi-click tasks, where the success rate fails to reach or exceed 94% compared to the untuned version.

Figures

Figures reproduced from arXiv: 2604.09571 by Alexander Ilin, Alexandra Yakovleva, Harri Valpola, Henrik P\"arssinen, Juho Kannala.

read the original abstract

Recent advances in vision-language models (VLMs) have sparked growing interest in using them to automate web tasks, yet their feasibility as independent agents that reason and act purely from visual input remains underexplored. We investigate this setting using Qwen2.5-VL-32B, one of the strongest open-source VLMs available, and focus on improving its reliability in web-based control. Through initial experimentation, we observe three key challenges: (i) inaccurate localization of target elements, the cursor, and their relative positions, (ii) sensitivity to instruction phrasing, and (iii) an overoptimistic bias toward its own actions, often assuming they succeed rather than analyzing their actual outcomes. To address these issues, we fine-tune Qwen2.5-VL-32B for a basic web interaction task: moving the mouse and clicking on a page element described in natural language. Our training pipeline consists of two stages: (1) teaching the model to determine whether the cursor already hovers over the target element or whether movement is required, and (2) training it to execute a single command (a mouse move or a mouse click) at a time, verifying the resulting state of the environment before planning the next action. Evaluated on a custom benchmark of single-click web tasks, our approach increases success rates from 86% to 94% under the most challenging setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports an 8-point gain on single-click web tasks after two-stage fine-tuning of Qwen2.5-VL, but the custom benchmark details are too thin to judge how much of that gain is real.

read the letter

The main thing to know is that they fine-tuned Qwen2.5-VL-32B in two stages for basic web control: first to decide whether the cursor already sits on the target element, then to issue one mouse move or click and check the result before planning the next step. This lifts success from 86% to 94% on their hardest single-click cases. They also name three practical problems in the base model—bad localization of cursor and elements, sensitivity to wording, and the habit of assuming success without looking—which the training tries to fix directly. That breakdown is useful and the staged approach is a straightforward way to make the model more cautious about positioning and verification. The work stays grounded in visual input only and focuses on sequential single actions, which matches the setting they describe. The soft spot is the evaluation. Everything rests on an internal benchmark of single-click tasks, yet the abstract gives no task count, site variety, run-to-run variance, or comparison to other fine-tuning setups. Without those numbers the 8-point delta is hard to interpret; it could shrink or disappear once the test distribution changes. The scope is also narrow—one model, only mouse moves and clicks—so claims about broader web-agent reliability are still early. This is for people already working on vision-language agents for UI tasks who need a concrete recipe for the localization and verification problems. A reader who wants to try the same two-stage idea on their own model would get value from the method description. I would send it to peer review because the core claim is simple to test and the failure modes they list are real, but the referees will have to insist on benchmark transparency and basic statistics before the result can be trusted.

Referee Report

1 major / 1 minor

Summary. The paper claims that fine-tuning Qwen2.5-VL-32B via a two-stage pipeline—first teaching cursor hover detection and then single-command execution with state verification—addresses localization errors, phrasing sensitivity, and action bias, yielding an 8pp success-rate gain (86% to 94%) on a custom benchmark of single-click web tasks.

Significance. If the performance delta is reproducible and generalizes, the work supplies a concrete, failure-mode-targeted fine-tuning recipe that could improve the reliability of open VLMs as visual web agents. The staged training design is a clear methodological strength that directly targets the three challenges identified in the abstract.

major comments (1)

[Evaluation] Evaluation section: The headline result (86% → 94% success under the most challenging setting) is load-bearing for the central claim, yet the manuscript supplies no task count, website diversity statistics, run-to-run variance, statistical significance tests, or ablation against other fine-tuning regimes. Without these quantities the observed delta cannot be distinguished from benchmark-specific artifacts.

minor comments (1)

[Abstract] Abstract: The phrase 'most challenging setting' is undefined; a brief parenthetical (e.g., 'zero-shot websites' or 'long-horizon pages') would clarify scope.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation section. We agree that additional quantitative details are required to support the reported performance gains and will revise the manuscript to incorporate them.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The headline result (86% → 94% success under the most challenging setting) is load-bearing for the central claim, yet the manuscript supplies no task count, website diversity statistics, run-to-run variance, statistical significance tests, or ablation against other fine-tuning regimes. Without these quantities the observed delta cannot be distinguished from benchmark-specific artifacts.

Authors: We agree that these details are necessary for a rigorous evaluation. In the revised manuscript we will report the exact number of tasks and websites in the benchmark, along with diversity statistics (e.g., distribution across domains and page complexities). We will also add results from multiple independent runs with different random seeds to quantify run-to-run variance, include statistical significance tests (such as paired t-tests) comparing the baseline and fine-tuned models, and provide ablations against single-stage fine-tuning and other relevant regimes. These changes will clarify that the 8pp improvement is reproducible and not benchmark-specific. revision: yes

Circularity Check

0 steps flagged

Empirical fine-tuning results contain no circular derivation chain

full rationale

The paper describes a two-stage fine-tuning procedure on Qwen2.5-VL-32B for single-click web tasks followed by direct evaluation on a custom benchmark, reporting measured success-rate improvement (86% to 94%). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim is an observed empirical delta on held-out tasks, not a quantity forced by construction from the training data or prior self-references. This is a standard non-circular empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The work implicitly relies on standard supervised fine-tuning assumptions and the validity of the unreported custom benchmark.

pith-pipeline@v0.9.0 · 5567 in / 1091 out tokens · 26804 ms · 2026-05-15T20:47:35.716931+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 3 internal anchors

[1]

Minttu Alakuijala, Ya Gao, Georgy Ananov, Samuel Kaski, Pekka Marttinen, Alexander Ilin, and Harri Valpola. 2025. Memento No More: Coaching AI Agents to Master Multiple Tasks via Hints Internalization. arXiv:2502.01562 [cs.LG] https://arxiv.org/abs/2502.01562

work page arXiv 2025
[2]

Tanvir Bhathal and Asanshay Gupta. 2025. Websight: A vision-first architecture for robust web agents. arXiv:2508.16987 [cs.AI] https://arxiv.org/abs/2508.16987

work page arXiv 2025
[3]

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. 2024. SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Bangkok, Thailand, 9313–9332. https://aclanthology.org/2...

work page 2024
[4]

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. Mind2Web: Towards a Generalist Agent for the Web. arXiv:2306.06070 [cs.CL] https://arxiv.org/abs/2306.06070

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. 2024. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. arXiv:2401.13919 [cs.CL] https: //arxiv.org/abs/2401.13919

work page internal anchor Pith review arXiv 2024
[6]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InProceedings of the International Conference on Learning Representations. https://openreview.net/forum?id=nZeVKeeFYf9

work page 2022
[7]

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried

work page
[8]

arXiv:2401.13649 [cs.CL] https://arxiv.org/abs/2401.13649

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. arXiv:2401.13649 [cs.CL] https://arxiv.org/abs/2401.13649

work page arXiv
[9]

Kalle Kujanpää, Pekka Marttinen, Harri Valpola, and Alexander Ilin. 2025. Effi- cient Knowledge Injection in LLMs via Self-Distillation.Transactions on Machine Learning Research(2025). https://openreview.net/forum?id=drYpdSnRJk

work page 2025
[10]

Runliang Niu, Jindong Li, Shiqi Wang, Yali Fu, Xiyu Hu, Xueyuan Leng, He Kong, Yi Chang, and Qi Wang. 2024. ScreenAgent: A Vision Language Model-driven Computer Control Agent. arXiv:2402.07945 [cs.HC] https://arxiv.org/abs/2402. 07945

work page arXiv 2024
[11]

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. 2025. UI-TARS: Pioneering Automated GUI Interaction with Native Agents. arXiv:2501.12326 [cs.AI] https: //arxiv.org/abs/2501.12326

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Charlie Snell, Dan Klein, and Ruiqi Zhong. 2022. Learning by Distilling Context. arXiv:2209.15189 [cs.CL] https://arxiv.org/abs/2209.15189

work page arXiv 2022
[13]

Qwen Team. 2025. Qwen2.5-VL. https://qwenlm.github.io/blog/qwen2.5-vl/

work page 2025
[14]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InProceedings of the International Conference on Learning Representations

work page 2023
[15]

Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. 2024. GPT-4V(ision) is a Generalist Web Agent, if Grounded. InForty-first International Conference on Machine Learning. https://openreview.net/forum?id=piecKJ2DlB

work page 2024
[16]

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, et al. 2024. WebArena: A Realistic Web Environment for Building Autonomous Agents. InProceedings of the International Conference on Learning Representations. https://webarena.dev

work page 2024