arxiv: 2604.13019 · v1 · submitted 2026-04-14 · 💻 cs.CV

Recognition: unknown

See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback

Gaurav Mittal, Himangi Mittal, Nelson Daniel Troncoso, Yu Hu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:54 UTC · model grok-4.3

classification 💻 cs.CV

keywords GUI groundingmulti-turn refinementvisual feedbackcomputer use agentscoding interfacesclick precisioniterative reasoningsoftware engineering agents

0 comments

The pith

Multi-turn refinement using visual feedback from prior attempts achieves higher click precision and task success in GUI grounding for dense coding interfaces than single-shot prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that agents for computer use can locate and interact with tiny screen elements more reliably by trying, seeing the result, and adjusting rather than guessing the coordinates once. In crowded coding interfaces where single predictions often miss by a few pixels, the iterative loop lets the model correct its own displacement errors and handle shifting UI elements. Tests across GPT, Claude, and Qwen on coding benchmarks report clear gains in both exact click accuracy and end-to-end task completion. Readers should care because most current agents still fail at the basic step of clicking the right pixel in real software tools.

Core claim

Instead of a single-step execution, our agent engages in an iterative refinement process, utilizing visual feedback from previous attempts to reach the target element. This closed-loop grounding mechanism allows the agent to self-correct displacement errors and adapt to dynamic UI changes. We evaluate our approach across GPT-5.4, Claude, and Qwen on a suite of complex coding benchmarks, demonstrating that multi-turn refinement significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate.

What carries the argument

The closed-loop grounding mechanism that feeds visual results from prior cursor placements back into the model for iterative position refinement.

If this is right

Click precision rises in high-density coding interfaces where single predictions routinely fail.
Overall task success rates increase for software engineering benchmarks.
The agent can adapt to dynamic UI changes without retraining.
Displacement errors from initial predictions are reduced through self-correction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same loop could be applied to other dense graphical interfaces such as design tools or data-visualization software.
Iteration may let general-purpose models reach usable reliability without task-specific fine-tuning on every interface.
Combining visual feedback with other signals like text logs could further stabilize agent behavior on long workflows.

Load-bearing premise

The model can correctly read the visual outcome of its last click and then produce a better next click without adding new mistakes or losing track of changes in the interface.

What would settle it

Running the same dense-IDE click tasks with and without visual feedback and finding that multi-turn accuracy stays the same or drops.

Figures

Figures reproduced from arXiv: 2604.13019 by Gaurav Mittal, Himangi Mittal, Nelson Daniel Troncoso, Yu Hu.

**Figure 1.** Figure 1: Data collection system overview and data flow. The pipeline utilizes a process-separated architecture to bridge [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

read the original abstract

Computer Use Agents (CUAs) fundamentally rely on graphical user interface (GUI) grounding to translate language instructions into executable screen actions, but editing-level grounding in dense coding interfaces, where sub-pixel accuracy is required to interact with dense IDE elements, remains underexplored. Existing approaches typically rely on single-shot coordinate prediction, which lacks a mechanism for error correction and often fails in high-density interfaces. In this technical report, we conduct an empirical study of pixel-precise cursor localization in coding environments. Instead of a single-step execution, our agent engages in an iterative refinement process, utilizing visual feedback from previous attempts to reach the target element. This closed-loop grounding mechanism allows the agent to self-correct displacement errors and adapt to dynamic UI changes. We evaluate our approach across GPT-5.4, Claude, and Qwen on a suite of complex coding benchmarks, demonstrating that multi-turn refinement significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate. Our results suggest that iterative visual reasoning is a critical component for the next generation of reliable software engineering agents. Code: https://github.com/microsoft/precision-cua-bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Multi-turn visual feedback lets agents self-correct GUI clicks in dense coding UIs and beats single-shot prediction on the tested benchmarks.

read the letter

The main thing to know is that this technical report shows a multi-turn refinement loop using visual feedback from prior attempts improves click precision and task success over single-shot coordinate prediction in coding environments. The authors test this with GPT-5.4, Claude, and Qwen on a set of complex coding benchmarks and report clear gains from the closed-loop process that allows error correction and adaptation to UI changes. They also release the code, which makes the setup checkable. The idea directly targets the sub-pixel accuracy problem in dense IDEs where one-shot methods often fail. The evidence lines up without internal contradictions or circular claims; the comparison uses the same models and benchmarks for both conditions, and the stress test found no load-bearing flaws in the outperformance result. The central argument holds up on the data presented. A minor soft spot is that success still depends on the model correctly reading the feedback image and choosing the right next action without compounding errors. The paper demonstrates this works on their benchmarks but does not deeply explore failure modes like fast-changing UIs or highly ambiguous elements. The evaluation stays within coding tasks, so broader generalization is left open. This is useful for people building computer-use agents aimed at software engineering workflows. Readers who need practical evidence on iterative grounding rather than just another single-shot model will get value from the numbers and the open code. It deserves a serious referee to examine the exact metrics, turn budgets, and statistical details. I recommend sending it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper claims that existing single-shot GUI grounding methods fail in dense coding interfaces requiring sub-pixel accuracy, and proposes a multi-turn 'See, Point, Refine' iterative process that uses visual feedback from prior attempts to self-correct cursor localization errors. It evaluates this closed-loop approach against single-shot baselines on GPT-5.4, Claude, and Qwen using complex coding benchmarks, asserting significant gains in click precision and overall task success rate, and concludes that iterative visual reasoning is essential for reliable software engineering agents.

Significance. If the reported outperformance is robustly demonstrated, the work would provide concrete evidence that closed-loop visual feedback improves reliability in high-density GUI tasks, with direct implications for computer-use agents in software engineering. The empirical focus on coding environments and provision of a code repository are positive, but the absence of detailed methods and results in the manuscript as presented substantially weakens the ability to assess whether the central claim holds.

major comments (2)

[Abstract / Evaluation] Abstract and evaluation description: the central claim of 'significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate' is asserted without any specification of the click-precision metric (e.g., pixel-error threshold or success criterion), the exact baselines used, number of trials or tasks, statistical tests, or controls for interaction budget and number of refinement turns. These omissions are load-bearing because the superiority result cannot be evaluated or reproduced from the given information.
[Approach] Methods description: the iterative refinement process is described at a high level ('utilizing visual feedback from previous attempts') but lacks concrete details on prompt construction for feedback incorporation, termination criteria, handling of dynamic UI changes, or how the multi-turn budget is allocated across models. Without these, it is impossible to determine whether the reported gains arise from the proposed mechanism or from uncontrolled differences in total compute or prompting.

minor comments (2)

[Abstract] The abstract refers to 'a suite of complex coding benchmarks' without naming them or providing a citation; this should be expanded for clarity even if details appear later.
[Abstract] The GitHub link is provided but the manuscript does not indicate whether the benchmark tasks, prompts, or evaluation scripts are included in the repository.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we will make to the manuscript to improve clarity and reproducibility.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and evaluation description: the central claim of 'significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate' is asserted without any specification of the click-precision metric (e.g., pixel-error threshold or success criterion), the exact baselines used, number of trials or tasks, statistical tests, or controls for interaction budget and number of refinement turns. These omissions are load-bearing because the superiority result cannot be evaluated or reproduced from the given information.

Authors: We agree that the abstract and evaluation description in the current manuscript do not provide these specifications, which limits the ability to assess and reproduce the central claim. The experimental details exist in our full evaluation protocol and linked code repository, but they are not sufficiently summarized in the text. We will revise the abstract to briefly note the click-precision metric, success criterion, and key controls, and we will expand the evaluation section to explicitly list the baselines, number of tasks and trials, statistical tests performed, and how interaction budgets and refinement turns were controlled across conditions. This will make the superiority result evaluable directly from the manuscript. revision: yes
Referee: [Approach] Methods description: the iterative refinement process is described at a high level ('utilizing visual feedback from previous attempts') but lacks concrete details on prompt construction for feedback incorporation, termination criteria, handling of dynamic UI changes, or how the multi-turn budget is allocated across models. Without these, it is impossible to determine whether the reported gains arise from the proposed mechanism or from uncontrolled differences in total compute or prompting.

Authors: We agree that the current high-level description of the iterative process does not include the requested concrete details, which is necessary to isolate the contribution of the visual feedback mechanism. We will revise the approach section to add a detailed description of prompt construction for incorporating prior visual feedback, the termination criteria used, how dynamic UI changes are handled via fresh screenshots, and the allocation of the multi-turn budget (including how it is matched to single-shot baselines). These additions will clarify that performance differences are attributable to the closed-loop refinement rather than variations in total compute or prompting strategy. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical technical report describing an iterative visual refinement process for GUI grounding in coding interfaces. It compares multi-turn agent performance against single-shot baselines across GPT-5.4, Claude, and Qwen on coding benchmarks, reporting higher click precision and task success. No equations, derivations, parameter fittings, or self-referential definitions appear in the provided text. The central claim rests on direct experimental comparison rather than any reduction of outputs to inputs by construction, self-citation chains, or renamed known results. The evaluation design is externally falsifiable via the linked benchmark and code repository.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the work is presented as an empirical study without mathematical derivations or new postulated components.

pith-pipeline@v0.9.0 · 5515 in / 1053 out tokens · 41951 ms · 2026-05-10T15:54:16.329493+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 10 canonical work pages

[1]

Agent S2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906, 2025

Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framework for computer use agents. arXiv preprint arXiv:2504.00906, 2025

work page arXiv 2025
[2]

Gui-eyes: Tool-augmented perception for visual grounding in gui agents.arXiv preprint arXiv:2601.09770, 2026

Chen Chen, Jiawei Shao, Dakuan Lu, Haoyi Hu, Xiangcheng Liu, Hantao Yao, and Wu Liu. Gui-eyes: Tool-augmented perception for visual grounding in gui agents. arXiv preprint arXiv:2601.09770, 2026

work page arXiv 2026
[3]

Seeclick: Harnessing gui grounding for advanced visual gui agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents, 2024. URL https://arxiv.org/abs/2401.10935

work page arXiv 2024
[4]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36: 0 28091--28114, 2023

2023
[5]

Navigat- ing the digital world as humans do: Universal visual grounding for gui agents.arXiv preprint arXiv:2410.05243, 2024

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents. arXiv preprint arXiv:2410.05243, 2024

work page arXiv 2024
[6]

Webvoyager: Building an end-to-end web agent with large multimodal models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6864--6890, 2024

2024
[7]

Cogagent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14281--14290, 2024

2024
[8]

A data-driven approach for learning to control computers

Peter C Humphreys, David Raposo, Tobias Pohlen, Gregory Thornton, Rachita Chhaparia, Alistair Muldal, Josh Abramson, Petko Georgiev, Adam Santoro, and Timothy Lillicrap. A data-driven approach for learning to control computers. In International Conference on Machine Learning, pages 9466--9482. PMLR, 2022

2022
[9]

Screenspot-pro: Gui grounding for professional high- resolution computer use.arXiv, abs/2504.07981, 2025

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use, 2025. URL https://arxiv.org/abs/2504.07981

work page arXiv 2025
[10]

arXiv preprint arXiv:2504.14239 , year=

Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners. arXiv preprint arXiv:2504.14239, 2025

work page arXiv 2025
[11]

World of bits: An open-domain platform for web-based agents

Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pages 3135--3144. PMLR, 2017

2017
[12]

GUI-G2: Gaussian reward modeling for gui grounding.arXiv preprint arXiv:2507.15846, 2025

Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, et al. Gui-g 2 : Gaussian reward modeling for gui grounding. arXiv preprint arXiv:2507.15846, 2025

work page arXiv 2025
[13]

Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, et al. Opencua: Open foundations for computer-use agents. arXiv preprint arXiv:2508.09123, 2025

work page arXiv 2025
[14]

arXiv preprint arXiv:2506.03143 , year=

Qianhui Wu, Kanzhi Cheng, Rui Yang, Chaoyun Zhang, Jianwei Yang, Huiqiang Jiang, Jian Mu, Baolin Peng, Bo Qiao, Reuben Tan, et al. Gui-actor: Coordinate-free visual grounding for gui agents. arXiv preprint arXiv:2506.03143, 2025

work page arXiv 2025
[15]

java21" shown on the file path of the file manager. Text 1 between text Click once at the position before

Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, et al. Gta1: Gui test-time scaling agent. arXiv preprint arXiv:2507.05791, 2025

work page arXiv 2025
[16]

Tongui: Internet-scale trajectories from multimodal web tutorials for generalized gui agents

Bofei Zhang, Zirui Shang, Zhi Gao, Wang Zhang, Rui Xie, Xiaojian Ma, Tao Yuan, Xinxiao Wu, Song-Chun Zhu, and Qing Li. Tongui: Internet-scale trajectories from multimodal web tutorials for generalized gui agents. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 12367--12375, 2026

2026
[17]

Appagent: Multimodal agents as smartphone users

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1--20, 2025

2025