arxiv: 2604.14113 · v1 · submitted 2026-04-15 · 💻 cs.CV · cs.AI· cs.CL

Recognition: unknown

UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

Fei Tang , Bofan Chen , Zhengxi Lu , Tongbo Chen , Songqin Nong , Tao Jiang , Wenhao Xu , Weiming Lu

show 3 more authors

Jun Xiao Yueting Zhuang Yongliang Shen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:02 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords GUI groundinguncertainty quantificationadaptive zoom-intest-time inferencevisual localizationconfidence estimationscreenshot analysisno-training adaptation

0 comments

The pith

Uncertainty signals decide when and how much to zoom into GUI screenshots for better element localization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that treating both the decision to zoom and the crop size as problems of quantifying prediction uncertainty allows a model to improve GUI grounding accuracy without any retraining. A sympathetic reader would care because uniform zoom-in methods either apply costly high-resolution inference to every case or fail to help precisely where small icons and dense layouts cause errors. The method works by combining two cheap signals from a single forward pass to trigger zoom only on uncertain predictions and then to size the crop to the observed variance. If correct, this means test-time adaptation can become more efficient by spending extra compute only where it is likely to reduce localization mistakes.

Core claim

UI-Zoomer is a training-free framework that treats the trigger and the scale of zoom-in as an uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic forward passes with token-level generation confidence to decide whether localization is uncertain enough to warrant zooming. When triggered, an uncertainty-driven crop sizing module decomposes total prediction variance into inter-sample positional spread and intra-sample box extent, then derives a per-instance crop radius from the law of total variance. Experiments show this selective approach yields consistent gains over strong baselines on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 across several

What carries the argument

The confidence-aware gate that fuses spatial consensus among stochastic candidates with token-level generation confidence, paired with the uncertainty-driven crop sizing module that applies the law of total variance to derive per-instance crop radii.

If this is right

Consistent accuracy gains across ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 benchmarks.
Improvements hold for multiple underlying model architectures without retraining.
Zoom-in is applied only to uncertain instances, limiting extra compute to cases that need it.
Crop size is derived directly from observed prediction variance rather than fixed heuristics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same uncertainty signals could be reused to adapt other test-time decisions such as choosing between multiple candidate prompts or resolutions in vision-language tasks.
If the variance decomposition generalizes, similar uncertainty-driven cropping might improve localization in non-GUI domains such as medical imaging or autonomous driving scenes.
The approach opens a path toward fully training-free pipelines that chain several cheap uncertainty checks before committing to expensive high-resolution inference.

Load-bearing premise

The fused spatial-consensus and token-level confidence signals reliably identify cases where zoom-in will actually reduce localization error rather than introduce new errors or waste compute.

What would settle it

A collection of test cases in which the method triggers zoom-in but the re-inference at higher resolution produces strictly larger localization error than the original low-resolution prediction, or in which skipping zoom produces better accuracy than zooming.

read the original abstract

GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose \textbf{UI-Zoomer}, a training-free adaptive zoom-in framework that treats both the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to selectively trigger zoom-in only when localization is uncertain. When triggered, an uncertainty-driven crop sizing module decomposes prediction variance into inter-sample positional spread and intra-sample box extent, deriving a per-instance crop radius via the law of total variance. Extensive experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 demonstrate consistent improvements over strong baselines across multiple model architectures, achieving gains of up to +13.4\%, +10.3\%, and +4.2\% respectively, with no additional training required.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UI-Zoomer makes zoom-in adaptive via fused uncertainty signals and variance-based sizing, but the abstract leaves the gate's reliability unproven.

read the letter

The core contribution is a training-free way to decide per-instance whether to zoom and how far, by fusing stochastic spatial consensus with token-level confidence for the trigger and then decomposing prediction variance into positional spread and box extent to set the crop radius. This moves beyond the uniform fixed-size zooms in prior test-time work and targets the real pain points of small icons and dense layouts in GUI grounding. The approach is clean on paper: no extra training, no free parameters, and a direct use of the model's own outputs for both decisions. If the full experiments confirm that the gate fires mostly on hard cases and that the sized crops reduce error there, this could be a simple lever for screen agents and multimodal interfaces. The reported gains across ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 on multiple backbones are the kind of numbers that matter for deployment. The main weakness is the missing link between the uncertainty signals and actual error reduction. The abstract gives aggregate improvements but no per-instance breakdown, no ablation of the two uncertainty sources, no trigger-rate statistics, and no conditioned results showing zoom helps precisely when the gate triggers. Without those, it is hard to rule out that the gains come mostly from extra zoom compute rather than smart selection. The law-of-total-variance step is formally fine, but its practical payoff depends on the correlation the stress-test flags. This is for people building or evaluating GUI agents who already run inference at test time and want a lightweight accuracy boost. It deserves peer review because the idea is implementable and the benchmarks are standard, even if the validation needs tightening on the causal role of the gate.

Referee Report

2 major / 1 minor

Summary. The paper proposes UI-Zoomer, a training-free adaptive zoom-in framework for GUI grounding that treats both the trigger decision and crop scale as uncertainty quantification problems. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to trigger zoom-in selectively on uncertain cases; when triggered, crop radius is derived by decomposing prediction variance into inter-sample positional spread and intra-sample box extent via the law of total variance. Experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 report consistent gains (up to +13.4%, +10.3%, +4.2%) over strong baselines across multiple model architectures without additional training.

Significance. If the uncertainty signals reliably identify cases where zoom-in reduces localization error, the method would provide a practical, zero-cost enhancement for GUI agents handling small icons and dense layouts. The training-free nature and use of standard variance formulas on existing model outputs are notable strengths that could generalize across vision-language models.

major comments (2)

[Abstract] Abstract: The headline gains are attributed to the adaptive gate and variance-based sizing, yet the manuscript supplies no quantitative evidence (e.g., gate trigger accuracy, correlation between fused uncertainty and actual localization error, or per-instance error reduction conditioned on the gate decision) that the signals reliably select zoom-in cases rather than wasting compute or introducing new errors on small icons.
[Abstract] Abstract and Experiments: No ablation is reported on the two uncertainty components (spatial consensus vs. token-level confidence) or on the law-of-total-variance crop sizing, so it is impossible to determine whether the reported improvements are causally due to the adaptive mechanism or simply to running a second forward pass on selected examples.

minor comments (1)

The abstract would be clearer if it briefly defined the two uncertainty signals and stated the exact number of stochastic samples used for spatial consensus.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the positive assessment of the significance of our work and for the constructive major comments. We agree that additional quantitative evidence and ablations are needed to better substantiate the claims about the uncertainty mechanisms. We will revise the manuscript accordingly to include these analyses.

read point-by-point responses

Referee: [Abstract] Abstract: The headline gains are attributed to the adaptive gate and variance-based sizing, yet the manuscript supplies no quantitative evidence (e.g., gate trigger accuracy, correlation between fused uncertainty and actual localization error, or per-instance error reduction conditioned on the gate decision) that the signals reliably select zoom-in cases rather than wasting compute or introducing new errors on small icons.

Authors: We acknowledge this limitation in the current version. While the overall benchmark improvements demonstrate the practical utility of UI-Zoomer, we did not provide instance-level diagnostics on the gate's reliability. In the revised manuscript, we will add a dedicated analysis section reporting: (i) the trigger rate and accuracy of the confidence-aware gate (comparing against ground-truth error reduction), (ii) the correlation between the fused uncertainty score and localization error on the test sets, and (iii) per-instance error reduction statistics conditioned on whether zoom-in was triggered. This will provide direct evidence that the uncertainty signals effectively identify cases benefiting from zoom-in. revision: yes
Referee: [Abstract] Abstract and Experiments: No ablation is reported on the two uncertainty components (spatial consensus vs. token-level confidence) or on the law-of-total-variance crop sizing, so it is impossible to determine whether the reported improvements are causally due to the adaptive mechanism or simply to running a second forward pass on selected examples.

Authors: We agree that isolating the contributions of each component is important. The current experiments focus on end-to-end gains across models and datasets, but lack component-wise ablations. We will add comprehensive ablations in the revised version, including variants that disable spatial consensus, disable token-level confidence, replace variance-based sizing with fixed radii, and compare against non-adaptive baselines such as always performing zoom-in or triggering randomly. These will demonstrate that the gains stem from the adaptive uncertainty-driven decisions rather than merely from additional inference passes. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies standard statistical identities to model outputs.

full rationale

The paper's central mechanism fuses spatial consensus (variance over stochastic forward passes) with token-level confidence for the gate, then applies the law of total variance to decompose positional spread versus box extent for crop radius. Both steps are direct applications of existing statistical definitions to the model's own sampled outputs rather than fitted parameters or self-referential quantities. No equations reduce claimed improvements to inputs by construction, and no load-bearing self-citations or uniqueness theorems are invoked. Reported gains are empirical on external benchmarks (ScreenSpot-Pro, UI-Vision, ScreenSpot-v2) and therefore falsifiable independently of the method's internal definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that uncertainty signals derived from stochastic forward passes and token probabilities can be fused and decomposed to decide both trigger and scale without introducing bias; no free parameters or new entities are explicitly introduced in the abstract.

axioms (2)

domain assumption Spatial consensus among stochastic candidates and token-level generation confidence can be fused into a reliable uncertainty gate for localization tasks.
Invoked in the description of the confidence-aware gate.
domain assumption Prediction variance decomposes via the law of total variance into inter-sample positional spread and intra-sample box extent for determining crop radius.
Directly stated as the basis for the uncertainty-driven crop sizing module.

pith-pipeline@v0.9.0 · 5524 in / 1399 out tokens · 36613 ms · 2026-05-10T14:02:42.393603+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs
cs.CV 2026-05 conditional novelty 7.0

GUI grounding in VLMs is bottlenecked by prefill-stage candidate selection that decoding cannot fix, so Re-Prefill uses attention to extract and re-inject target tokens for up to 4.3% gains on ScreenSpot-Pro.

Reference graph

Works this paper leans on

26 extracted references · 25 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

arXiv preprint arXiv:2410.08164 , year =

Claude 3.7 sonnet system card. URLhttps://api.semanticscholar.org/CorpusID:276612236. Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s: An open agentic framework that uses computers like a human.arXiv preprint arXiv:2410.08164,

work page arXiv
[2]

Agent S2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906, 2025

Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906,

work page arXiv
[3]

arXiv preprint arXiv:2107.13731 (2021)

URLhttps://arxiv.org/abs/ 2107.13731. Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv e-prints, pp. arXiv–2502,

work page arXiv
[4]

Gui-eyes: Tool-augmented perception for visual grounding in gui agents.arXiv preprint arXiv:2601.09770, 2026

Chen Chen, Jiawei Shao, Dakuan Lu, Haoyi Hu, Xiangcheng Liu, Hantao Yao, and Wu Liu. Gui-eyes: Tool-augmented perception for visual grounding in gui agents.arXiv preprint arXiv:2601.09770,

work page arXiv
[5]

Test-time reinforcement learning for gui grounding via region consistency,

Yong Du, Yuchen Yan, Fei Tang, Zhengxi Lu, Chang Zong, Weiming Lu, Shengpei Jiang, and Yongliang Shen. Test-time reinforcement learning for gui grounding via region consistency.arXiv preprint arXiv:2508.05615,

work page arXiv
[6]

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu

12 Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents.arXiv preprint arXiv:2410.05243,

work page arXiv
[7]

Ui-venus technical report: Building high-performance ui agents with rft

Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, et al. Ui-venus technical report: Building high-performance ui agents with rft. arXiv preprint arXiv:2508.10833,

work page arXiv
[8]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062,

work page internal anchor Pith review arXiv
[9]

Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P

URLhttps://arxiv.org/abs/2312.08914. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page arXiv
[10]

Appagentx: Evolving gui agents as proficient smartphone users.arXiv preprint arXiv:2503.02268, 2025

URLhttps://arxiv.org/abs/2503.02268. Hyunseok Lee, Jeonghoon Kim, Beomjun Kim, Jihoon Tack, Chansong Jo, Jaehong Lee, Cheonbok Park, Sookyo In, Jinwoo Shin, and Kang Min Yoo. Reguide: Data efficient gui grounding via spatial reasoning and search.arXiv preprint arXiv:2505.15259,

work page arXiv
[11]

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou

URLhttps://arxiv.org/abs/2408.11824. Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent,

work page arXiv
[12]

URL https: //arxiv.org/abs/2411.17465. Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, Junjie Gao, Junjun Shan, Kangning Liu, Shudan Zhang, Shuntian Yao, Siyi Cheng, Wentao Yao, Wenyi Zhao, Xinghan Liu, Xinyi Liu, Xinying Chen, Xinyue Yang, Yang Yang, Yifan Xu, Yu Yang, Yujia Wang,...

work page arXiv
[13]

Autoglm: Autonomous foundation agents for guis.arXiv preprint arXiv:2411.00820, 2024

URL https://arxiv.org/abs/2411.00820. Yuhang Liu, Zeyu Liu, Shuanghe Zhu, Pengxiang Li, Congkai Xie, Jiasheng Wang, Xavier Hu, Xiaotian Han, Jianbo Yuan, Xinyao Wang, et al. Infigui-g1: Advancing gui grounding with adaptive exploration policy optimization. arXiv preprint arXiv:2508.05731,

work page arXiv
[14]

Ui-r1: Enhancing action prediction of gui agents by reinforcement learning

Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Guanjing Xiong, and Hongsheng Li. Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning. 2025a. URL https://arxiv.org/abs/2503.21620. Zhengxi Lu, Jiabo Ye, Fei Tang, Yongliang Shen, Haiyang Xu, Ziwei Zheng, Weiming Lu, Ming Yan, Fei Huang...

work page arXiv
[15]

Ui- vision: A desktop-centric gui benchmark for visual perception and interaction.arXiv preprint arXiv:2503.15661, 2025

Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Juan A Rodriguez, Montek Kalsi, Rabiul Awal, Nicolas Chapados, M Tamer Özsu, Aishwarya Agrawal, David Vazquez, et al. Ui-vision: A desktop-centric gui benchmark for visual perception and interaction.arXiv preprint arXiv:2503.15661,

work page arXiv
[16]

Improved gui grounding via iterative narrowing.arXiv preprint arXiv:2411.13591, 2024

Anthony Nguyen. Improved gui grounding via iterative narrowing.arXiv preprint arXiv:2411.13591,

work page arXiv
[17]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

URLhttps://arxiv.org/abs/2501.12326. 13 Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters,

work page Pith review arXiv
[18]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

URLhttps://arxiv.org/abs/2408.03314. Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Gui-g2: Gaussian reward modeling for gui grounding, 2025a. URLhttps://arxiv.org/abs/2507.15846. Fei Tang, Yongliang Shen, Hang Zhang, Siqi Chen, Guiyang Hou, Wenq...

work page internal anchor Pith review arXiv
[19]

Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.arXiv preprint arXiv:2406.01014, 2024

URLhttps://arxiv.org/abs/2406.01014. Qingni Wang, Yue Fan, and Xin Eric Wang. Safeground: Know when to trust gui grounding models via uncertainty calibration,

work page arXiv
[20]

Hang Wu, Hongkai Chen, Yujun Cai, Chang Liu, Qingwen Ye, Ming-Hsuan Yang, and Yiwei Wang

URLhttps://arxiv.org/abs/2602.02419. Hang Wu, Hongkai Chen, Yujun Cai, Chang Liu, Qingwen Ye, Ming-Hsuan Yang, and Yiwei Wang. Dimo-gui: Advancing test-time scaling in gui grounding via modality-aware visual reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 26257–26267, 2025a. Qianhui Wu, Kanzhi Cheng...

work page arXiv 2025
[21]

Scaling computer-use grounding via user interface decomposition and synthesis.arXiv preprint arXiv:2505.13227, 2025

Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, et al. Scaling computer-use grounding via user interface decomposition and synthesis.arXiv preprint arXiv:2505.13227,

work page arXiv
[23]

Aguvis: Unified pure vision agents for autonomous gui interaction.arXiv preprint arXiv:2412.04454, 2024

URLhttps://arxiv.org/abs/2412.04454. Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v,

work page arXiv
[24]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

URLhttps://arxiv.org/abs/2310.11441. Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, et al. Gta1: Gui test-time scaling agent.arXiv preprint arXiv:2507.05791,

work page internal anchor Pith review arXiv
[25]

URLhttps://arxiv.org/abs/2505.12370. Chaoyun Zhang, He Huang, Chiming Ni, Jian Mu, Si Qin, Shilin He, Lu Wang, Fangkai Yang, Pu Zhao, Chao Du, Liqun Li, Yu Kang, Zhao Jiang, Suzhen Zheng, Rujia Wang, Jiaxu Qian, Minghua Ma, Jian-Guang Lou, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. Ufo2: The desktop agentos

work page arXiv
[26]

Zhang, H

URLhttps://arxiv.org/abs/ 2504.14603. 14 Contents 1 Introduction 1 2 Related W ork 3 2.1 GUI Grounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Test-Time Scaling for GUI Grounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 Method 3 3.1 Problem Setup . . . . . . . . . . . . . . . ....

work page arXiv
[27]

Figure 7Complete prompt template used in our experiments

We use this prompt consistently across all experimental settings to ensure a fair comparison among different models and evaluation scenarios. Figure 7Complete prompt template used in our experiments. A.2 Comprehensive Comparison on ScreenSpot-v2 and UI-Vision Table 10 and Table 11 present the complete experimental results of our proposed UI-Zoomer on theS...

2024