Recognition: unknown
UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding
Pith reviewed 2026-05-10 14:02 UTC · model grok-4.3
The pith
Uncertainty signals decide when and how much to zoom into GUI screenshots for better element localization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
UI-Zoomer is a training-free framework that treats the trigger and the scale of zoom-in as an uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic forward passes with token-level generation confidence to decide whether localization is uncertain enough to warrant zooming. When triggered, an uncertainty-driven crop sizing module decomposes total prediction variance into inter-sample positional spread and intra-sample box extent, then derives a per-instance crop radius from the law of total variance. Experiments show this selective approach yields consistent gains over strong baselines on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 across several
What carries the argument
The confidence-aware gate that fuses spatial consensus among stochastic candidates with token-level generation confidence, paired with the uncertainty-driven crop sizing module that applies the law of total variance to derive per-instance crop radii.
If this is right
- Consistent accuracy gains across ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 benchmarks.
- Improvements hold for multiple underlying model architectures without retraining.
- Zoom-in is applied only to uncertain instances, limiting extra compute to cases that need it.
- Crop size is derived directly from observed prediction variance rather than fixed heuristics.
Where Pith is reading between the lines
- The same uncertainty signals could be reused to adapt other test-time decisions such as choosing between multiple candidate prompts or resolutions in vision-language tasks.
- If the variance decomposition generalizes, similar uncertainty-driven cropping might improve localization in non-GUI domains such as medical imaging or autonomous driving scenes.
- The approach opens a path toward fully training-free pipelines that chain several cheap uncertainty checks before committing to expensive high-resolution inference.
Load-bearing premise
The fused spatial-consensus and token-level confidence signals reliably identify cases where zoom-in will actually reduce localization error rather than introduce new errors or waste compute.
What would settle it
A collection of test cases in which the method triggers zoom-in but the re-inference at higher resolution produces strictly larger localization error than the original low-resolution prediction, or in which skipping zoom produces better accuracy than zooming.
read the original abstract
GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose \textbf{UI-Zoomer}, a training-free adaptive zoom-in framework that treats both the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to selectively trigger zoom-in only when localization is uncertain. When triggered, an uncertainty-driven crop sizing module decomposes prediction variance into inter-sample positional spread and intra-sample box extent, deriving a per-instance crop radius via the law of total variance. Extensive experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 demonstrate consistent improvements over strong baselines across multiple model architectures, achieving gains of up to +13.4\%, +10.3\%, and +4.2\% respectively, with no additional training required.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes UI-Zoomer, a training-free adaptive zoom-in framework for GUI grounding that treats both the trigger decision and crop scale as uncertainty quantification problems. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to trigger zoom-in selectively on uncertain cases; when triggered, crop radius is derived by decomposing prediction variance into inter-sample positional spread and intra-sample box extent via the law of total variance. Experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 report consistent gains (up to +13.4%, +10.3%, +4.2%) over strong baselines across multiple model architectures without additional training.
Significance. If the uncertainty signals reliably identify cases where zoom-in reduces localization error, the method would provide a practical, zero-cost enhancement for GUI agents handling small icons and dense layouts. The training-free nature and use of standard variance formulas on existing model outputs are notable strengths that could generalize across vision-language models.
major comments (2)
- [Abstract] Abstract: The headline gains are attributed to the adaptive gate and variance-based sizing, yet the manuscript supplies no quantitative evidence (e.g., gate trigger accuracy, correlation between fused uncertainty and actual localization error, or per-instance error reduction conditioned on the gate decision) that the signals reliably select zoom-in cases rather than wasting compute or introducing new errors on small icons.
- [Abstract] Abstract and Experiments: No ablation is reported on the two uncertainty components (spatial consensus vs. token-level confidence) or on the law-of-total-variance crop sizing, so it is impossible to determine whether the reported improvements are causally due to the adaptive mechanism or simply to running a second forward pass on selected examples.
minor comments (1)
- The abstract would be clearer if it briefly defined the two uncertainty signals and stated the exact number of stochastic samples used for spatial consensus.
Simulated Author's Rebuttal
We sincerely thank the referee for the positive assessment of the significance of our work and for the constructive major comments. We agree that additional quantitative evidence and ablations are needed to better substantiate the claims about the uncertainty mechanisms. We will revise the manuscript accordingly to include these analyses.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline gains are attributed to the adaptive gate and variance-based sizing, yet the manuscript supplies no quantitative evidence (e.g., gate trigger accuracy, correlation between fused uncertainty and actual localization error, or per-instance error reduction conditioned on the gate decision) that the signals reliably select zoom-in cases rather than wasting compute or introducing new errors on small icons.
Authors: We acknowledge this limitation in the current version. While the overall benchmark improvements demonstrate the practical utility of UI-Zoomer, we did not provide instance-level diagnostics on the gate's reliability. In the revised manuscript, we will add a dedicated analysis section reporting: (i) the trigger rate and accuracy of the confidence-aware gate (comparing against ground-truth error reduction), (ii) the correlation between the fused uncertainty score and localization error on the test sets, and (iii) per-instance error reduction statistics conditioned on whether zoom-in was triggered. This will provide direct evidence that the uncertainty signals effectively identify cases benefiting from zoom-in. revision: yes
-
Referee: [Abstract] Abstract and Experiments: No ablation is reported on the two uncertainty components (spatial consensus vs. token-level confidence) or on the law-of-total-variance crop sizing, so it is impossible to determine whether the reported improvements are causally due to the adaptive mechanism or simply to running a second forward pass on selected examples.
Authors: We agree that isolating the contributions of each component is important. The current experiments focus on end-to-end gains across models and datasets, but lack component-wise ablations. We will add comprehensive ablations in the revised version, including variants that disable spatial consensus, disable token-level confidence, replace variance-based sizing with fixed radii, and compare against non-adaptive baselines such as always performing zoom-in or triggering randomly. These will demonstrate that the gains stem from the adaptive uncertainty-driven decisions rather than merely from additional inference passes. revision: yes
Circularity Check
No significant circularity; derivation applies standard statistical identities to model outputs.
full rationale
The paper's central mechanism fuses spatial consensus (variance over stochastic forward passes) with token-level confidence for the gate, then applies the law of total variance to decompose positional spread versus box extent for crop radius. Both steps are direct applications of existing statistical definitions to the model's own sampled outputs rather than fitted parameters or self-referential quantities. No equations reduce claimed improvements to inputs by construction, and no load-bearing self-citations or uniqueness theorems are invoked. Reported gains are empirical on external benchmarks (ScreenSpot-Pro, UI-Vision, ScreenSpot-v2) and therefore falsifiable independently of the method's internal definitions.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Spatial consensus among stochastic candidates and token-level generation confidence can be fused into a reliable uncertainty gate for localization tasks.
- domain assumption Prediction variance decomposes via the law of total variance into inter-sample positional spread and intra-sample box extent for determining crop radius.
Forward citations
Cited by 1 Pith paper
-
What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs
GUI grounding in VLMs is bottlenecked by prefill-stage candidate selection that decoding cannot fix, so Re-Prefill uses attention to extract and re-inject target tokens for up to 4.3% gains on ScreenSpot-Pro.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2410.08164 , year =
Claude 3.7 sonnet system card. URLhttps://api.semanticscholar.org/CorpusID:276612236. Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s: An open agentic framework that uses computers like a human.arXiv preprint arXiv:2410.08164,
-
[2]
Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906,
-
[3]
arXiv preprint arXiv:2107.13731 (2021)
URLhttps://arxiv.org/abs/ 2107.13731. Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv e-prints, pp. arXiv–2502,
-
[4]
Chen Chen, Jiawei Shao, Dakuan Lu, Haoyi Hu, Xiangcheng Liu, Hantao Yao, and Wu Liu. Gui-eyes: Tool-augmented perception for visual grounding in gui agents.arXiv preprint arXiv:2601.09770,
-
[5]
Test-time reinforcement learning for gui grounding via region consistency,
Yong Du, Yuchen Yan, Fei Tang, Zhengxi Lu, Chang Zong, Weiming Lu, Shengpei Jiang, and Yongliang Shen. Test-time reinforcement learning for gui grounding via region consistency.arXiv preprint arXiv:2508.05615,
-
[6]
Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu
12 Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents.arXiv preprint arXiv:2410.05243,
-
[7]
Ui-venus technical report: Building high-performance ui agents with rft
Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, et al. Ui-venus technical report: Building high-performance ui agents with rft. arXiv preprint arXiv:2508.10833,
-
[8]
Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062,
work page internal anchor Pith review arXiv
-
[9]
Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P
URLhttps://arxiv.org/abs/2312.08914. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,
-
[10]
Appagentx: Evolving gui agents as proficient smartphone users.arXiv preprint arXiv:2503.02268, 2025
URLhttps://arxiv.org/abs/2503.02268. Hyunseok Lee, Jeonghoon Kim, Beomjun Kim, Jihoon Tack, Chansong Jo, Jaehong Lee, Cheonbok Park, Sookyo In, Jinwoo Shin, and Kang Min Yoo. Reguide: Data efficient gui grounding via spatial reasoning and search.arXiv preprint arXiv:2505.15259,
-
[11]
URLhttps://arxiv.org/abs/2408.11824. Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent,
-
[12]
URL https: //arxiv.org/abs/2411.17465. Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, Junjie Gao, Junjun Shan, Kangning Liu, Shudan Zhang, Shuntian Yao, Siyi Cheng, Wentao Yao, Wenyi Zhao, Xinghan Liu, Xinyi Liu, Xinying Chen, Xinyue Yang, Yang Yang, Yifan Xu, Yu Yang, Yujia Wang,...
-
[13]
Autoglm: Autonomous foundation agents for guis.arXiv preprint arXiv:2411.00820, 2024
URL https://arxiv.org/abs/2411.00820. Yuhang Liu, Zeyu Liu, Shuanghe Zhu, Pengxiang Li, Congkai Xie, Jiasheng Wang, Xavier Hu, Xiaotian Han, Jianbo Yuan, Xinyao Wang, et al. Infigui-g1: Advancing gui grounding with adaptive exploration policy optimization. arXiv preprint arXiv:2508.05731,
-
[14]
Ui-r1: Enhancing action prediction of gui agents by reinforcement learning
Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Guanjing Xiong, and Hongsheng Li. Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning. 2025a. URL https://arxiv.org/abs/2503.21620. Zhengxi Lu, Jiabo Ye, Fei Tang, Yongliang Shen, Haiyang Xu, Ziwei Zheng, Weiming Lu, Ming Yan, Fei Huang...
-
[15]
Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Juan A Rodriguez, Montek Kalsi, Rabiul Awal, Nicolas Chapados, M Tamer Özsu, Aishwarya Agrawal, David Vazquez, et al. Ui-vision: A desktop-centric gui benchmark for visual perception and interaction.arXiv preprint arXiv:2503.15661,
-
[16]
Improved gui grounding via iterative narrowing.arXiv preprint arXiv:2411.13591, 2024
Anthony Nguyen. Improved gui grounding via iterative narrowing.arXiv preprint arXiv:2411.13591,
-
[17]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
URLhttps://arxiv.org/abs/2501.12326. 13 Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters,
-
[18]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
URLhttps://arxiv.org/abs/2408.03314. Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Gui-g2: Gaussian reward modeling for gui grounding, 2025a. URLhttps://arxiv.org/abs/2507.15846. Fei Tang, Yongliang Shen, Hang Zhang, Siqi Chen, Guiyang Hou, Wenq...
work page internal anchor Pith review arXiv
-
[19]
URLhttps://arxiv.org/abs/2406.01014. Qingni Wang, Yue Fan, and Xin Eric Wang. Safeground: Know when to trust gui grounding models via uncertainty calibration,
-
[20]
Hang Wu, Hongkai Chen, Yujun Cai, Chang Liu, Qingwen Ye, Ming-Hsuan Yang, and Yiwei Wang
URLhttps://arxiv.org/abs/2602.02419. Hang Wu, Hongkai Chen, Yujun Cai, Chang Liu, Qingwen Ye, Ming-Hsuan Yang, and Yiwei Wang. Dimo-gui: Advancing test-time scaling in gui grounding via modality-aware visual reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 26257–26267, 2025a. Qianhui Wu, Kanzhi Cheng...
-
[21]
Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, et al. Scaling computer-use grounding via user interface decomposition and synthesis.arXiv preprint arXiv:2505.13227,
-
[23]
URLhttps://arxiv.org/abs/2412.04454. Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v,
-
[24]
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
URLhttps://arxiv.org/abs/2310.11441. Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, et al. Gta1: Gui test-time scaling agent.arXiv preprint arXiv:2507.05791,
work page internal anchor Pith review arXiv
-
[25]
URLhttps://arxiv.org/abs/2505.12370. Chaoyun Zhang, He Huang, Chiming Ni, Jian Mu, Si Qin, Shilin He, Lu Wang, Fangkai Yang, Pu Zhao, Chao Du, Liqun Li, Yu Kang, Zhao Jiang, Suzhen Zheng, Rujia Wang, Jiaxu Qian, Minghua Ma, Jian-Guang Lou, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. Ufo2: The desktop agentos
-
[26]
URLhttps://arxiv.org/abs/ 2504.14603. 14 Contents 1 Introduction 1 2 Related W ork 3 2.1 GUI Grounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Test-Time Scaling for GUI Grounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 Method 3 3.1 Problem Setup . . . . . . . . . . . . . . . ....
-
[27]
Figure 7Complete prompt template used in our experiments
We use this prompt consistently across all experimental settings to ensure a fair comparison among different models and evaluation scenarios. Figure 7Complete prompt template used in our experiments. A.2 Comprehensive Comparison on ScreenSpot-v2 and UI-Vision Table 10 and Table 11 present the complete experimental results of our proposed UI-Zoomer on theS...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.