Recognition: unknown
See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback
Pith reviewed 2026-05-10 15:54 UTC · model grok-4.3
The pith
Multi-turn refinement using visual feedback from prior attempts achieves higher click precision and task success in GUI grounding for dense coding interfaces than single-shot prediction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Instead of a single-step execution, our agent engages in an iterative refinement process, utilizing visual feedback from previous attempts to reach the target element. This closed-loop grounding mechanism allows the agent to self-correct displacement errors and adapt to dynamic UI changes. We evaluate our approach across GPT-5.4, Claude, and Qwen on a suite of complex coding benchmarks, demonstrating that multi-turn refinement significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate.
What carries the argument
The closed-loop grounding mechanism that feeds visual results from prior cursor placements back into the model for iterative position refinement.
If this is right
- Click precision rises in high-density coding interfaces where single predictions routinely fail.
- Overall task success rates increase for software engineering benchmarks.
- The agent can adapt to dynamic UI changes without retraining.
- Displacement errors from initial predictions are reduced through self-correction.
Where Pith is reading between the lines
- The same loop could be applied to other dense graphical interfaces such as design tools or data-visualization software.
- Iteration may let general-purpose models reach usable reliability without task-specific fine-tuning on every interface.
- Combining visual feedback with other signals like text logs could further stabilize agent behavior on long workflows.
Load-bearing premise
The model can correctly read the visual outcome of its last click and then produce a better next click without adding new mistakes or losing track of changes in the interface.
What would settle it
Running the same dense-IDE click tasks with and without visual feedback and finding that multi-turn accuracy stays the same or drops.
Figures
read the original abstract
Computer Use Agents (CUAs) fundamentally rely on graphical user interface (GUI) grounding to translate language instructions into executable screen actions, but editing-level grounding in dense coding interfaces, where sub-pixel accuracy is required to interact with dense IDE elements, remains underexplored. Existing approaches typically rely on single-shot coordinate prediction, which lacks a mechanism for error correction and often fails in high-density interfaces. In this technical report, we conduct an empirical study of pixel-precise cursor localization in coding environments. Instead of a single-step execution, our agent engages in an iterative refinement process, utilizing visual feedback from previous attempts to reach the target element. This closed-loop grounding mechanism allows the agent to self-correct displacement errors and adapt to dynamic UI changes. We evaluate our approach across GPT-5.4, Claude, and Qwen on a suite of complex coding benchmarks, demonstrating that multi-turn refinement significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate. Our results suggest that iterative visual reasoning is a critical component for the next generation of reliable software engineering agents. Code: https://github.com/microsoft/precision-cua-bench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that existing single-shot GUI grounding methods fail in dense coding interfaces requiring sub-pixel accuracy, and proposes a multi-turn 'See, Point, Refine' iterative process that uses visual feedback from prior attempts to self-correct cursor localization errors. It evaluates this closed-loop approach against single-shot baselines on GPT-5.4, Claude, and Qwen using complex coding benchmarks, asserting significant gains in click precision and overall task success rate, and concludes that iterative visual reasoning is essential for reliable software engineering agents.
Significance. If the reported outperformance is robustly demonstrated, the work would provide concrete evidence that closed-loop visual feedback improves reliability in high-density GUI tasks, with direct implications for computer-use agents in software engineering. The empirical focus on coding environments and provision of a code repository are positive, but the absence of detailed methods and results in the manuscript as presented substantially weakens the ability to assess whether the central claim holds.
major comments (2)
- [Abstract / Evaluation] Abstract and evaluation description: the central claim of 'significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate' is asserted without any specification of the click-precision metric (e.g., pixel-error threshold or success criterion), the exact baselines used, number of trials or tasks, statistical tests, or controls for interaction budget and number of refinement turns. These omissions are load-bearing because the superiority result cannot be evaluated or reproduced from the given information.
- [Approach] Methods description: the iterative refinement process is described at a high level ('utilizing visual feedback from previous attempts') but lacks concrete details on prompt construction for feedback incorporation, termination criteria, handling of dynamic UI changes, or how the multi-turn budget is allocated across models. Without these, it is impossible to determine whether the reported gains arise from the proposed mechanism or from uncontrolled differences in total compute or prompting.
minor comments (2)
- [Abstract] The abstract refers to 'a suite of complex coding benchmarks' without naming them or providing a citation; this should be expanded for clarity even if details appear later.
- [Abstract] The GitHub link is provided but the manuscript does not indicate whether the benchmark tasks, prompts, or evaluation scripts are included in the repository.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we will make to the manuscript to improve clarity and reproducibility.
read point-by-point responses
-
Referee: [Abstract / Evaluation] Abstract and evaluation description: the central claim of 'significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate' is asserted without any specification of the click-precision metric (e.g., pixel-error threshold or success criterion), the exact baselines used, number of trials or tasks, statistical tests, or controls for interaction budget and number of refinement turns. These omissions are load-bearing because the superiority result cannot be evaluated or reproduced from the given information.
Authors: We agree that the abstract and evaluation description in the current manuscript do not provide these specifications, which limits the ability to assess and reproduce the central claim. The experimental details exist in our full evaluation protocol and linked code repository, but they are not sufficiently summarized in the text. We will revise the abstract to briefly note the click-precision metric, success criterion, and key controls, and we will expand the evaluation section to explicitly list the baselines, number of tasks and trials, statistical tests performed, and how interaction budgets and refinement turns were controlled across conditions. This will make the superiority result evaluable directly from the manuscript. revision: yes
-
Referee: [Approach] Methods description: the iterative refinement process is described at a high level ('utilizing visual feedback from previous attempts') but lacks concrete details on prompt construction for feedback incorporation, termination criteria, handling of dynamic UI changes, or how the multi-turn budget is allocated across models. Without these, it is impossible to determine whether the reported gains arise from the proposed mechanism or from uncontrolled differences in total compute or prompting.
Authors: We agree that the current high-level description of the iterative process does not include the requested concrete details, which is necessary to isolate the contribution of the visual feedback mechanism. We will revise the approach section to add a detailed description of prompt construction for incorporating prior visual feedback, the termination criteria used, how dynamic UI changes are handled via fresh screenshots, and the allocation of the multi-turn budget (including how it is matched to single-shot baselines). These additions will clarify that performance differences are attributable to the closed-loop refinement rather than variations in total compute or prompting strategy. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is an empirical technical report describing an iterative visual refinement process for GUI grounding in coding interfaces. It compares multi-turn agent performance against single-shot baselines across GPT-5.4, Claude, and Qwen on coding benchmarks, reporting higher click precision and task success. No equations, derivations, parameter fittings, or self-referential definitions appear in the provided text. The central claim rests on direct experimental comparison rather than any reduction of outputs to inputs by construction, self-citation chains, or renamed known results. The evaluation design is externally falsifiable via the linked benchmark and code repository.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framework for computer use agents. arXiv preprint arXiv:2504.00906, 2025
-
[2]
Chen Chen, Jiawei Shao, Dakuan Lu, Haoyi Hu, Xiangcheng Liu, Hantao Yao, and Wu Liu. Gui-eyes: Tool-augmented perception for visual grounding in gui agents. arXiv preprint arXiv:2601.09770, 2026
-
[3]
Seeclick: Harnessing gui grounding for advanced visual gui agents
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents, 2024. URL https://arxiv.org/abs/2401.10935
-
[4]
Mind2web: Towards a generalist agent for the web
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36: 0 28091--28114, 2023
2023
-
[5]
Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents. arXiv preprint arXiv:2410.05243, 2024
-
[6]
Webvoyager: Building an end-to-end web agent with large multimodal models
Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6864--6890, 2024
2024
-
[7]
Cogagent: A visual language model for gui agents
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14281--14290, 2024
2024
-
[8]
A data-driven approach for learning to control computers
Peter C Humphreys, David Raposo, Tobias Pohlen, Gregory Thornton, Rachita Chhaparia, Alistair Muldal, Josh Abramson, Petko Georgiev, Adam Santoro, and Timothy Lillicrap. A data-driven approach for learning to control computers. In International Conference on Machine Learning, pages 9466--9482. PMLR, 2022
2022
-
[9]
Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use, 2025. URL https://arxiv.org/abs/2504.07981
-
[10]
arXiv preprint arXiv:2504.14239 , year=
Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners. arXiv preprint arXiv:2504.14239, 2025
-
[11]
World of bits: An open-domain platform for web-based agents
Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pages 3135--3144. PMLR, 2017
2017
-
[12]
GUI-G2: Gaussian reward modeling for gui grounding.arXiv preprint arXiv:2507.15846, 2025
Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, et al. Gui-g 2 : Gaussian reward modeling for gui grounding. arXiv preprint arXiv:2507.15846, 2025
-
[13]
Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025
Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, et al. Opencua: Open foundations for computer-use agents. arXiv preprint arXiv:2508.09123, 2025
-
[14]
arXiv preprint arXiv:2506.03143 , year=
Qianhui Wu, Kanzhi Cheng, Rui Yang, Chaoyun Zhang, Jianwei Yang, Huiqiang Jiang, Jian Mu, Baolin Peng, Bo Qiao, Reuben Tan, et al. Gui-actor: Coordinate-free visual grounding for gui agents. arXiv preprint arXiv:2506.03143, 2025
-
[15]
Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, et al. Gta1: Gui test-time scaling agent. arXiv preprint arXiv:2507.05791, 2025
-
[16]
Tongui: Internet-scale trajectories from multimodal web tutorials for generalized gui agents
Bofei Zhang, Zirui Shang, Zhi Gao, Wang Zhang, Rui Xie, Xiaojian Ma, Tao Yuan, Xinxiao Wu, Song-Chun Zhu, and Qing Li. Tongui: Internet-scale trajectories from multimodal web tutorials for generalized gui agents. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 12367--12375, 2026
2026
-
[17]
Appagent: Multimodal agents as smartphone users
Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1--20, 2025
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.