PrecisionCUA: Iterative Visual Refinement for Pixel-Precise Cursor Grounding in Code Editors
Pith reviewed 2026-05-10 15:54 UTC · model grok-4.3
The pith
Multi-turn refinement using visual feedback from prior attempts achieves higher click precision and task success in GUI grounding for dense coding interfaces than single-shot prediction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Instead of a single-step execution, our agent engages in an iterative refinement process, utilizing visual feedback from previous attempts to reach the target element. This closed-loop grounding mechanism allows the agent to self-correct displacement errors and adapt to dynamic UI changes. We evaluate our approach across GPT-5.4, Claude, and Qwen on a suite of complex coding benchmarks, demonstrating that multi-turn refinement significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate.
What carries the argument
The closed-loop grounding mechanism that feeds visual results from prior cursor placements back into the model for iterative position refinement.
If this is right
- Click precision rises in high-density coding interfaces where single predictions routinely fail.
- Overall task success rates increase for software engineering benchmarks.
- The agent can adapt to dynamic UI changes without retraining.
- Displacement errors from initial predictions are reduced through self-correction.
Where Pith is reading between the lines
- The same loop could be applied to other dense graphical interfaces such as design tools or data-visualization software.
- Iteration may let general-purpose models reach usable reliability without task-specific fine-tuning on every interface.
- Combining visual feedback with other signals like text logs could further stabilize agent behavior on long workflows.
Load-bearing premise
The model can correctly read the visual outcome of its last click and then produce a better next click without adding new mistakes or losing track of changes in the interface.
What would settle it
Running the same dense-IDE click tasks with and without visual feedback and finding that multi-turn accuracy stays the same or drops.
Figures
read the original abstract
Computer Use Agents (CUAs) fundamentally rely on graphical user interface (GUI) grounding to translate language instructions into executable screen actions, but editing-level grounding in dense coding interfaces (such as VS Code and Cursor), where sub-pixel accuracy is required to interact with dense IDE elements, remains underexplored. Existing approaches typically rely on single-shot coordinate prediction, which lacks a mechanism for error correction and often fails in high-density interfaces. In this technical report, we conduct an empirical study of pixel-precise cursor localization in coding environments. Instead of a single-step execution, our agent engages in an iterative refinement process, utilizing visual feedback from previous attempts to reach the target element. This closed-loop grounding mechanism allows the agent to self-correct displacement errors and adapt to dynamic UI changes. We evaluate our approach across Claude, Qwen, and GPT on a suite of complex coding benchmarks, demonstrating that multi-turn refinement significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate. Our results suggest that iterative visual reasoning is a critical component for the next generation of reliable software engineering agents. Code: https://github.com/microsoft/precision-cua-bench/tree/main.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that existing single-shot GUI grounding methods fail in dense coding interfaces requiring sub-pixel accuracy, and proposes a multi-turn 'See, Point, Refine' iterative process that uses visual feedback from prior attempts to self-correct cursor localization errors. It evaluates this closed-loop approach against single-shot baselines on GPT-5.4, Claude, and Qwen using complex coding benchmarks, asserting significant gains in click precision and overall task success rate, and concludes that iterative visual reasoning is essential for reliable software engineering agents.
Significance. If the reported outperformance is robustly demonstrated, the work would provide concrete evidence that closed-loop visual feedback improves reliability in high-density GUI tasks, with direct implications for computer-use agents in software engineering. The empirical focus on coding environments and provision of a code repository are positive, but the absence of detailed methods and results in the manuscript as presented substantially weakens the ability to assess whether the central claim holds.
major comments (2)
- [Abstract / Evaluation] Abstract and evaluation description: the central claim of 'significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate' is asserted without any specification of the click-precision metric (e.g., pixel-error threshold or success criterion), the exact baselines used, number of trials or tasks, statistical tests, or controls for interaction budget and number of refinement turns. These omissions are load-bearing because the superiority result cannot be evaluated or reproduced from the given information.
- [Approach] Methods description: the iterative refinement process is described at a high level ('utilizing visual feedback from previous attempts') but lacks concrete details on prompt construction for feedback incorporation, termination criteria, handling of dynamic UI changes, or how the multi-turn budget is allocated across models. Without these, it is impossible to determine whether the reported gains arise from the proposed mechanism or from uncontrolled differences in total compute or prompting.
minor comments (2)
- [Abstract] The abstract refers to 'a suite of complex coding benchmarks' without naming them or providing a citation; this should be expanded for clarity even if details appear later.
- [Abstract] The GitHub link is provided but the manuscript does not indicate whether the benchmark tasks, prompts, or evaluation scripts are included in the repository.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we will make to the manuscript to improve clarity and reproducibility.
read point-by-point responses
-
Referee: [Abstract / Evaluation] Abstract and evaluation description: the central claim of 'significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate' is asserted without any specification of the click-precision metric (e.g., pixel-error threshold or success criterion), the exact baselines used, number of trials or tasks, statistical tests, or controls for interaction budget and number of refinement turns. These omissions are load-bearing because the superiority result cannot be evaluated or reproduced from the given information.
Authors: We agree that the abstract and evaluation description in the current manuscript do not provide these specifications, which limits the ability to assess and reproduce the central claim. The experimental details exist in our full evaluation protocol and linked code repository, but they are not sufficiently summarized in the text. We will revise the abstract to briefly note the click-precision metric, success criterion, and key controls, and we will expand the evaluation section to explicitly list the baselines, number of tasks and trials, statistical tests performed, and how interaction budgets and refinement turns were controlled across conditions. This will make the superiority result evaluable directly from the manuscript. revision: yes
-
Referee: [Approach] Methods description: the iterative refinement process is described at a high level ('utilizing visual feedback from previous attempts') but lacks concrete details on prompt construction for feedback incorporation, termination criteria, handling of dynamic UI changes, or how the multi-turn budget is allocated across models. Without these, it is impossible to determine whether the reported gains arise from the proposed mechanism or from uncontrolled differences in total compute or prompting.
Authors: We agree that the current high-level description of the iterative process does not include the requested concrete details, which is necessary to isolate the contribution of the visual feedback mechanism. We will revise the approach section to add a detailed description of prompt construction for incorporating prior visual feedback, the termination criteria used, how dynamic UI changes are handled via fresh screenshots, and the allocation of the multi-turn budget (including how it is matched to single-shot baselines). These additions will clarify that performance differences are attributable to the closed-loop refinement rather than variations in total compute or prompting strategy. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is an empirical technical report describing an iterative visual refinement process for GUI grounding in coding interfaces. It compares multi-turn agent performance against single-shot baselines across GPT-5.4, Claude, and Qwen on coding benchmarks, reporting higher click precision and task success. No equations, derivations, parameter fittings, or self-referential definitions appear in the provided text. The central claim rests on direct experimental comparison rather than any reduction of outputs to inputs by construction, self-citation chains, or renamed known results. The evaluation design is externally falsifiable via the linked benchmark and code repository.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.