Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets

Amanda Sofie Rios; Amit Ranjan Trivedi; Devashri Naik; Divake Kumar; Nilesh Ahuja; Omesh Tickoo; Ranganath Krishnan; Sina Tayebati

arxiv: 2606.25760 · v1 · pith:P5757S2Fnew · submitted 2026-06-24 · 💻 cs.LG · cs.AI· cs.CL· cs.CV

Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets

Divake Kumar , Sina Tayebati , Devashri Naik , Amanda Sofie Rios , Nilesh Ahuja , Omesh Tickoo , Ranganath Krishnan , Amit Ranjan Trivedi This is my paper

Pith reviewed 2026-06-25 20:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.CV

keywords uncertainty quantificationvision-language modelsGUI groundingcomputer-use agentspost-hoc methodsbenchmarkselective transferconformal prediction

0 comments

The pith

Post-hoc uncertainty estimates for vision-language GUI agents transfer reliably across datasets but not across model classes or interfaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates 27 post-hoc uncertainty quantification methods on four vision-language models acting as GUI agents across four grounding datasets. It shows that method rankings hold steady when the model is fixed but the dataset changes, yet those rankings shift when the model class or observable interface changes. This pattern of selective transfer means that methods performing well on one agent setup often fail to generalize to others, including from open-weight to closed-source systems. The benchmark also tests conformal prediction for spatial safety regions and finds that calibrated regions shrink substantially but lose reliability under mismatched conditions.

Core claim

Argus, the introduced cross-regime benchmark, demonstrates selective transfer of post-hoc UQ methods for single-step executable GUI grounding: rankings stay stable across datasets for any fixed model, but degrade across model classes and observable interfaces. Hidden-state and density estimators form the most stable open-weight family, while specific sampling, attention, and verbalized methods win only in particular regimes. Within-model transfer reaches Spearman rho values up to 0.969, yet transfer to closed-source vendors averages only +0.08, so closed-source UQ must be reranked on the target rather than extrapolated. Conformal click regions shrink radii by 40-60 percent under matched cali

What carries the argument

The Argus benchmark matrix that applies 27 UQ methods (logit-based, sampling, hidden-state, attention, prompting, and conformal) to four VLM agents and four GUI datasets, plus a closed-source matrix, to measure ranking stability across regimes.

If this is right

UQ rankings remain stable across datasets when the model is held fixed.
Rankings degrade when switching model classes or observable interfaces.
Hidden-state and density methods are the most stable family among open-weight approaches.
Closed-source UQ requires separate reranking on the target system rather than extrapolation.
Calibrated conformal prediction shrinks click-region radii by 40-60 percent but loses coverage under calibration or interface mismatch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practitioners should evaluate candidate UQ methods directly on their target model and interface rather than adopting published rankings wholesale.
Calibration data collection can focus on model-specific factors instead of collecting new data for every dataset.
Extending conformal calibration to handle interface shifts could preserve the radius-reduction benefit in changing GUI environments.

Load-bearing premise

The four VLM agents and four datasets used are representative enough of GUI grounding tasks and model behaviors to support general claims about ranking stability and transfer.

What would settle it

Finding that UQ method rankings change substantially when the same methods are applied to a fifth VLM agent or fifth GUI dataset outside the tested set would falsify the selective transfer claim.

Figures

Figures reproduced from arXiv: 2606.25760 by Amanda Sofie Rios, Amit Ranjan Trivedi, Devashri Naik, Divake Kumar, Nilesh Ahuja, Omesh Tickoo, Ranganath Krishnan, Sina Tayebati.

**Figure 1.** Figure 1: Full-method ranking transfer. Spearman ρ between per-cell rankings (50-seed means). Left: 27-method open-weight, 16 cells. Right: 8-method API-only, 12 cells. Diagonal blocks: within-model or within-vendor transfer. 5 Graded Error and Calibration Binary correctness asks whether the click is wrong; graded severity asks how wrong it is. We evaluate severity ranking with miss-only AUSE using target εi = log(1… view at source ↗

**Figure 2.** Figure 2: Adaptive conformal click-disks at α = 0.10. 50-seed mean radius and coverage gap (target 0.90). Panels (a) cover 12 open-weight cells; (b, c) cover 12 API-only cells. Gray band: ±1 pp around target. Variants defined in §3; multi-α and 16-cell results in Appendix A24. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Cross-regime benchmark of post-hoc UQ for single-step executable GUI grounding. Each click prediction is paired with 27 open-weight UQ scores (8 on the API-only panel) and evaluated across 4 open-weight agents, 3 frontier closed-source vendors, and 4 grounding datasets, supporting error discrimination, selective execution, calibration, graded miss-severity, ranking transfer, and conformal click-disk covera… view at source ↗

read the original abstract

Computer-use agents turn vision-language model (VLM) predictions into executable GUI clicks, so reliable uncertainty estimates are essential for rejection, calibration, miss-severity ranking, and spatial safety regions. Yet evidence on post-hoc uncertainty quantification (UQ) for these agents is fragmented across isolated model and dataset pairs, leaving it unclear whether UQ rankings stay stable when the agent, benchmark, or observable interface changes. We present Argus, a cross-regime benchmark for post-hoc UQ in single-step executable GUI grounding: a 27-method open-weight matrix over 4 VLM agents and 4 datasets, plus an 8-method closed-source matrix across 3 frontier vendors where logits, hidden states, and attention maps are unavailable. Evaluated methods span logit-based scores, sampling and consistency measures, hidden-state and density estimators (Mahalanobis, SAPLMA), attention-based scores, P(True) and verbalised-confidence prompting, and split-conformal prediction. The main finding is selective transfer: UQ rankings are stable across datasets for a fixed model, but degrade across model classes and observable interfaces. Hidden-state and density methods are the most stable open-weight family, while CoCoA-1MCA, Focus, sampling-based scores, and verbalised self-assessment win in specific regimes. Within-model ranking transfer is strong (Spearman rho up to 0.969), but cross-tier transfer to closed-source vendors averages only +0.08, so closed-source UQ should be reranked on the target rather than extrapolated. Conformal click regions show score-level discrimination is not enough for deployment: locally weighted disks shrink radii by 40-60% when the plug-in UQ is calibrated, but coverage degrades under calibration-test or interface mismatch. We release per-item records, calibration/test splits, UQ scores, and analysis scripts for regime-aware UQ selection in GUI agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Argus gives a practical 27x4x4 UQ matrix for GUI agents with data release, but the selective-transfer claim sits on a narrow 4-model/4-dataset slice that may not generalize.

read the letter

The new thing here is the Argus benchmark itself: a 27-method open-weight grid over four VLMs and four GUI datasets, plus an 8-method closed-source slice, with per-item scores and splits released. That cross-regime matrix and the explicit within-model vs cross-model Spearman numbers are not in the earlier isolated evaluations the abstract cites.

The work is useful for anyone who actually has to pick a UQ method for a deployed computer-use agent. It shows that logit and sampling scores transfer reasonably inside one model family across datasets, that hidden-state methods are the most stable open-weight group, and that closed-source extrapolation is basically useless (rho ~0.08). The conformal region results also make the practical point that score discrimination alone is not enough for spatial safety.

The soft spot is exactly the one the stress-test flags. Four agents and four datasets is a small slice. If those models share similar training data or the datasets share similar screen layouts and task distributions, the high within-model rho (up to 0.969) could be an artifact rather than a general property of post-hoc UQ. The abstract does not report sensitivity checks or justify why this quartet spans the relevant axes of architecture, interface observability, and task variation. That makes the “selective transfer” headline rest on limited evidence.

This is a paper for practitioners and benchmark builders in the GUI-agent space who need concrete ranking data and code. It is not yet a definitive guide to UQ transfer. It deserves peer review because the empirical scope and data release are solid enough to be worth referee time, even if the generalization claim needs more datasets or models to hold up.

Referee Report

1 major / 1 minor

Summary. The paper introduces Argus, a benchmark evaluating 27 post-hoc UQ methods across 4 open-weight VLMs and 4 GUI grounding datasets (plus 8 methods on 3 closed-source models). It reports that UQ rankings exhibit strong within-model stability across datasets (Spearman rho up to 0.969) but degrade across model classes and interfaces (cross-tier rho ~0.08), with hidden-state/density methods most stable in open-weight settings and specific methods winning in others. It further shows that conformal click regions improve spatial safety when calibrated but suffer under mismatch, and releases per-item scores, splits, and scripts.

Significance. If the selective-transfer observation holds beyond the evaluated matrix, the work supplies actionable guidance for regime-aware UQ selection in GUI agents and demonstrates that post-hoc methods cannot be assumed transferable without re-ranking. The open release of per-item records, calibration splits, and analysis scripts is a concrete strength that enables reproducibility and follow-on regime-specific studies.

major comments (1)

[Experimental setup / model and dataset selection] The central selective-transfer claim (stable within-model, degrades across classes) rests on a 4 VLM × 4 dataset open-weight grid. No sensitivity analysis or explicit justification is given for why these particular agents and datasets span the relevant axes of architecture family, training regime, or interface observability; if the quartet shares correlated training data or screen-layout statistics, the observed within-model rho values could be an artifact rather than a general property.

minor comments (1)

[Abstract] The abstract states '27-method open-weight matrix' and '8-method closed-source matrix'; the exact mapping of which methods are evaluated in each regime should be tabulated early for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment point-by-point below and agree that additional justification for model and dataset selection will strengthen the presentation of the selective-transfer claim.

read point-by-point responses

Referee: [Experimental setup / model and dataset selection] The central selective-transfer claim (stable within-model, degrades across classes) rests on a 4 VLM × 4 dataset open-weight grid. No sensitivity analysis or explicit justification is given for why these particular agents and datasets span the relevant axes of architecture family, training regime, or interface observability; if the quartet shares correlated training data or screen-layout statistics, the observed within-model rho values could be an artifact rather than a general property.

Authors: We acknowledge that the manuscript would benefit from an explicit justification subsection. The four open-weight VLMs were selected to cover distinct architecture families (different vision encoders and LLM backbones), scales, and GUI-relevant fine-tuning regimes, while the four datasets were chosen as standard GUI grounding benchmarks differing in screen complexity, action distribution, and interface observability. The observed within-model Spearman stability (up to 0.969) holds across this diversity, and the degradation to cross-class and closed-source settings (average rho ~0.08) is itself evidence against simple artifact. Nevertheless, we agree a sensitivity analysis or expanded justification would address the concern directly. In the revised version we will add a dedicated paragraph detailing selection criteria with references to the models' training data and architectural differences, plus a brief discussion of why the current grid already spans the key axes. Full sensitivity analysis across additional models is left for future work due to compute limits but is enabled by our released code and per-item scores. revision: partial

Circularity Check

0 steps flagged

Purely empirical benchmark; no derivation chain or self-referential steps.

full rationale

The manuscript is a cross-model, cross-dataset empirical evaluation of post-hoc UQ methods on GUI grounding tasks. All reported quantities (Spearman correlations, coverage rates, radius reductions) are direct measurements on held-out splits; no equations define a target quantity in terms of itself, no fitted parameters are relabeled as predictions, and no uniqueness theorems or ansatzes are imported via self-citation. The 4×4 open-weight matrix and closed-source slice are chosen inputs, not outputs of any internal derivation. Hence the central claims do not reduce to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the representativeness of the chosen models and datasets plus standard assumptions in post-hoc UQ evaluation.

axioms (1)

domain assumption The selected 4 VLM agents and 4 datasets are representative of GUI grounding scenarios.
Invoked to generalize the selective transfer result beyond the specific matrix.

pith-pipeline@v0.9.1-grok · 5924 in / 1092 out tokens · 21749 ms · 2026-06-25T20:58:08.168705+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 1 canonical work pages

[1]

Angelopoulos and Stephen Bates

Anastasios N. Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification.arXiv preprint arXiv:2107.07511,

Pith/arXiv arXiv
[2]

Angelopoulos, Stephen Bates, Emmanuel J

Anastasios N. Angelopoulos, Stephen Bates, Emmanuel J. Candès, Michael I. Jordan, and Lihua Lei. Learn then test: Calibrating predictive algorithms to achieve risk control.arXiv preprint arXiv:2110.01052,

arXiv
[3]

The art of saying “maybe”: A conformal lens for uncertainty benchmarking in VLMs

Asif Azad, Mohammad Sadat Hossain, MD Sadik Hossain Shanto, M Saifur Rahman, and Md Rizwan Parvez. The art of saying “maybe”: A conformal lens for uncertainty benchmarking in VLMs. arXiv preprint arXiv:2509.13379,

arXiv
[4]

Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923,

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL technical report.ar...

Pith/arXiv arXiv
[5]

V2P: Visual attention calibration for GUI grounding via background suppression and center peaking.arXiv preprint arXiv:2508.13634,

Jikai Chen, Long Chen, Dong Wang, Qinglin Su, Zhixuan Chu, Bingguang Hao, Leilei Gan, Chenyi Zhuang, and Jinjie Gu. V2P: Visual attention calibration for GUI grounding via background suppression and center peaking.arXiv preprint arXiv:2508.13634,

arXiv
[6]

LM-Polygraph: Uncertainty estimation for language models

Ekaterina Fadeeva, Roman Vashurin, Akim Tsvigun, Artem Vazhentsev, Sergey Petrakov, Kirill Fedyanin, Daniil Vasilev, Elizaveta Goncharova, Alexander Panchenko, Maxim Panov, Timothy Baldwin, and Artem Shelmanov. LM-Polygraph: Uncertainty estimation for language models. arXiv preprint arXiv:2311.07383,

arXiv
[7]

Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221,

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

Pith/arXiv arXiv
[8]

Semantic entropy probes: Robust and cheap hallucination detection in LLMs.arXiv preprint arXiv:2406.15927,

Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, and Yarin Gal. Semantic entropy probes: Robust and cheap hallucination detection in LLMs.arXiv preprint arXiv:2406.15927,

Pith/arXiv arXiv
[9]

Uncertainty-aware evaluation for vision-language models.arXiv preprint arXiv:2402.14418,

Vasily Kostumov, Bulat Nutfullin, Oleg Pilipenko, and Eugene Ilyushin. Uncertainty-aware evaluation for vision-language models.arXiv preprint arXiv:2402.14418,

arXiv
[10]

Calibrated decomposition of aleatoric and epistemic uncertainty in deep features for inference-time adaptation.arXiv preprint arXiv:2511.12389, 2025a

Divake Kumar, Patrick Poggi, Sina Tayebati, Devashri Naik, Nilesh Ahuja, and Amit Ranjan Trivedi. Calibrated decomposition of aleatoric and epistemic uncertainty in deep features for inference-time adaptation.arXiv preprint arXiv:2511.12389, 2025a. Divake Kumar, Sina Tayebati, Nastaran Darabi, Vita Pi-Ho Hu, and Amit Ranjan Trivedi. Uncertainty- aware LiD...

work page doi:10.1109/coins65080.2025.11125785 2025
[11]

WebSuite: Systematically evaluating why web agents fail.arXiv preprint arXiv:2406.01623,

Eric Li and Jim Waldo. WebSuite: Systematically evaluating why web agents fail.arXiv preprint arXiv:2406.01623,

arXiv
[12]

ScreenSpot-Pro: GUI grounding for professional high-resolution computer use

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. ScreenSpot-Pro: GUI grounding for professional high-resolution computer use. arXiv preprint arXiv:2504.07981, 2025a. Yinghao Li, Rushi Qiang, Lama Moukheiber, and Chao Zhang. Language model uncertainty quantification with attention chain.arXiv preprin...

arXiv
[13]

Rodriguez, Montek Kalsi, Rabiul Awal, Nicolas Chapados, M

Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Juan A. Rodriguez, Montek Kalsi, Rabiul Awal, Nicolas Chapados, M. Tamer Özsu, Aishwarya Agrawal, David Vazquez, Christopher Pal, Perouz Taslakian, Spandana Gella, and Sai Rajeswar. UI-Vision: A desktop-centric GUI benchmark for visual perception and interaction.arXiv preprint arXiv:2503.15661,

arXiv
[14]

Uncertainty- guided inference-time depth adaptation for transformer-based visual tracking.arXiv preprint arXiv:2602.16160,

Patrick Poggi, Divake Kumar, Theja Tulabandhula, and Amit Ranjan Trivedi. Uncertainty- guided inference-time depth adaptation for transformer-based visual tracking.arXiv preprint arXiv:2602.16160,

arXiv
[15]

UI-TARS: Pioneering automated GUI interaction with native agents.arXiv preprint arXiv:2501.12326,

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Ya...

Pith/arXiv arXiv
[16]

UI-Zoomer: Uncertainty-driven adaptive zoom-in for GUI grounding.arXiv preprint arXiv:2604.14113,

Fei Tang, Bofan Chen, Zhengxi Lu, Tongbo Chen, Songqin Nong, Tao Jiang, Wenhao Xu, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. UI-Zoomer: Uncertainty-driven adaptive zoom-in for GUI grounding.arXiv preprint arXiv:2604.14113,

Pith/arXiv arXiv
[17]

Learning conformal abstention policies for adaptive risk management in large language and vision-language models.arXiv preprint arXiv:2502.06884, 2025a

Sina Tayebati, Divake Kumar, Nastaran Darabi, Dinithi Jayasuriya, Ranganath Krishnan, and Amit Ranjan Trivedi. Learning conformal abstention policies for adaptive risk management in large language and vision-language models.arXiv preprint arXiv:2502.06884, 2025a. 12 Sina Tayebati, Divake Kumar, Nastaran Darabi, Dinithi Jayasuriya, Theja Tulabandhula, Rang...

arXiv
[18]

Benchmarking uncertainty quantification methods for large language models with LM-Polygraph.Transactions of the Association for Computational Linguistics, 13:220–248, 2025a

Roman Vashurin, Ekaterina Fadeeva, Artem Vazhentsev, Lyudmila Rvanova, Daniil Vasilev, Akim Tsvigun, Sergey Petrakov, Rui Xing, Abdelrahman Boda Sadallah, Kirill Grishchenkov, Alexander Panchenko, Timothy Baldwin, Preslav Nakov, Maxim Panov, and Artem Shelmanov. Benchmarking uncertainty quantification methods for large language models with LM-Polygraph.Tr...

arXiv
[19]

SafeGround: Know when to trust GUI grounding models via uncertainty calibration.arXiv preprint arXiv:2602.02419,

Qingni Wang, Yue Fan, and Xin Eric Wang. SafeGround: Know when to trust GUI grounding models via uncertainty calibration.arXiv preprint arXiv:2602.02419,

arXiv
[20]

OpenCUA: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123,

Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, et al. OpenCUA: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123,

arXiv
[21]

OS-ATLAS: A foundation action model for generalist GUI agents.arXiv preprint arXiv:2410.23218,

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. OS-ATLAS: A foundation action model for generalist GUI agents.arXiv preprint arXiv:2410.23218,

Pith/arXiv arXiv
[22]

Scaling computer-use grounding via user interface decomposition and synthesis

Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, Yiheng Xu, Junli Wang, Doyen Sahoo, Tao Yu, and Caiming Xiong. Scaling computer-use grounding via user interface decomposition and synthesis. arXiv preprint arXiv:2505.13227,

arXiv
[23]

VL-Uncertainty: Detecting hallucination in large vision-language model via uncertainty estimation.arXiv preprint arXiv:2411.11919,

13 Ruiyang Zhang, Hu Zhang, and Zhedong Zheng. VL-Uncertainty: Detecting hallucination in large vision-language model via uncertainty estimation.arXiv preprint arXiv:2411.11919,

arXiv
[24]

HyperClick: Advancing reliable GUI grounding via uncertainty calibration.arXiv preprint arXiv:2510.27266,

Shaojie Zhang, Pei Fu, Ruoceng Zhang, Jiahui Yang, Anan Du, Xiuwen Xi, Shaokang Wang, Ying Huang, Bin Qin, Zhenbo Luo, and Jian Luan. HyperClick: Advancing reliable GUI grounding via uncertainty calibration.arXiv preprint arXiv:2510.27266,

Pith/arXiv arXiv
[25]

Enhancing uncertainty-based hallucination detection with stronger focus

Tianhang Zhang, Lin Qiu, Qipeng Guo, Cheng Deng, Yue Zhang, Zheng Zhang, Chenghu Zhou, Xinbing Wang, and Luoyi Fu. Enhancing uncertainty-based hallucination detection with stronger focus. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP),

2023
[26]

POINTS-GUI-G: GUI-grounding journey.arXiv preprint arXiv:2602.06391,

Zhongyin Zhao, Yuan Liu, Yikun Liu, Haicheng Wang, Le Tian, Xiao Zhou, Yangxiu You, Zilin Yu, Yang Yu, and Jie Zhou. POINTS-GUI-G: GUI-grounding journey.arXiv preprint arXiv:2602.06391,

arXiv
[27]

A1 Benchmark overview figure [ARGUS-Leaderboard]A Grounding & Reasoning benchmark for Universal GUI agentsAnalysisTaskInput · GUI screenshotGoal / Trace Sub-skillCorrelationSelectivity by FamilyUQ LeaderboardCommodity GUIstandard apps, web pages, dialogsProfessionalCAD · IDE · Office · ScientificDesktop / OSshell · file mgr · terminalReasoning-heavyfuncti...

2026
[28]

wins-but-miscalibrates

Per-cell numbers are released alongside the data artefacts. A14 Robustness to relaxed correctness convention The headline tables use strict point-in-bbox correctness, where a click is correct iff the predicted coordinate lies inside the target bounding box. Real GUIs typically register clicks within a small tolerance of the visible UI element due to OS-le...

2023
[29]

Up” direction worked uniformly; “down

on every open-weight cell. AUROC values are 50-seed means, consistent with the rest of the paper. The relative-Mahalanobis lift over the naive form is positive on every cell. Cell Mahal-naive AUROC Mahal-RMD AUROC Lift Q7×V2 .506 .781 +0.276 Q7×SP .386 .859 +0.473 Q7×OSG .476 .762 +0.286 Q7×UIV .435 .792 +0.357 Q72×V2 .463 .681 +0.218 Q72×SP .464 .861 +0....

2026

[1] [1]

Angelopoulos and Stephen Bates

Anastasios N. Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification.arXiv preprint arXiv:2107.07511,

Pith/arXiv arXiv

[2] [2]

Angelopoulos, Stephen Bates, Emmanuel J

Anastasios N. Angelopoulos, Stephen Bates, Emmanuel J. Candès, Michael I. Jordan, and Lihua Lei. Learn then test: Calibrating predictive algorithms to achieve risk control.arXiv preprint arXiv:2110.01052,

arXiv

[3] [3]

The art of saying “maybe”: A conformal lens for uncertainty benchmarking in VLMs

Asif Azad, Mohammad Sadat Hossain, MD Sadik Hossain Shanto, M Saifur Rahman, and Md Rizwan Parvez. The art of saying “maybe”: A conformal lens for uncertainty benchmarking in VLMs. arXiv preprint arXiv:2509.13379,

arXiv

[4] [4]

Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923,

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL technical report.ar...

Pith/arXiv arXiv

[5] [5]

V2P: Visual attention calibration for GUI grounding via background suppression and center peaking.arXiv preprint arXiv:2508.13634,

Jikai Chen, Long Chen, Dong Wang, Qinglin Su, Zhixuan Chu, Bingguang Hao, Leilei Gan, Chenyi Zhuang, and Jinjie Gu. V2P: Visual attention calibration for GUI grounding via background suppression and center peaking.arXiv preprint arXiv:2508.13634,

arXiv

[6] [6]

LM-Polygraph: Uncertainty estimation for language models

Ekaterina Fadeeva, Roman Vashurin, Akim Tsvigun, Artem Vazhentsev, Sergey Petrakov, Kirill Fedyanin, Daniil Vasilev, Elizaveta Goncharova, Alexander Panchenko, Maxim Panov, Timothy Baldwin, and Artem Shelmanov. LM-Polygraph: Uncertainty estimation for language models. arXiv preprint arXiv:2311.07383,

arXiv

[7] [7]

Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221,

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

Pith/arXiv arXiv

[8] [8]

Semantic entropy probes: Robust and cheap hallucination detection in LLMs.arXiv preprint arXiv:2406.15927,

Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, and Yarin Gal. Semantic entropy probes: Robust and cheap hallucination detection in LLMs.arXiv preprint arXiv:2406.15927,

Pith/arXiv arXiv

[9] [9]

Uncertainty-aware evaluation for vision-language models.arXiv preprint arXiv:2402.14418,

Vasily Kostumov, Bulat Nutfullin, Oleg Pilipenko, and Eugene Ilyushin. Uncertainty-aware evaluation for vision-language models.arXiv preprint arXiv:2402.14418,

arXiv

[10] [10]

Calibrated decomposition of aleatoric and epistemic uncertainty in deep features for inference-time adaptation.arXiv preprint arXiv:2511.12389, 2025a

Divake Kumar, Patrick Poggi, Sina Tayebati, Devashri Naik, Nilesh Ahuja, and Amit Ranjan Trivedi. Calibrated decomposition of aleatoric and epistemic uncertainty in deep features for inference-time adaptation.arXiv preprint arXiv:2511.12389, 2025a. Divake Kumar, Sina Tayebati, Nastaran Darabi, Vita Pi-Ho Hu, and Amit Ranjan Trivedi. Uncertainty- aware LiD...

work page doi:10.1109/coins65080.2025.11125785 2025

[11] [11]

WebSuite: Systematically evaluating why web agents fail.arXiv preprint arXiv:2406.01623,

Eric Li and Jim Waldo. WebSuite: Systematically evaluating why web agents fail.arXiv preprint arXiv:2406.01623,

arXiv

[12] [12]

ScreenSpot-Pro: GUI grounding for professional high-resolution computer use

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. ScreenSpot-Pro: GUI grounding for professional high-resolution computer use. arXiv preprint arXiv:2504.07981, 2025a. Yinghao Li, Rushi Qiang, Lama Moukheiber, and Chao Zhang. Language model uncertainty quantification with attention chain.arXiv preprin...

arXiv

[13] [13]

Rodriguez, Montek Kalsi, Rabiul Awal, Nicolas Chapados, M

Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Juan A. Rodriguez, Montek Kalsi, Rabiul Awal, Nicolas Chapados, M. Tamer Özsu, Aishwarya Agrawal, David Vazquez, Christopher Pal, Perouz Taslakian, Spandana Gella, and Sai Rajeswar. UI-Vision: A desktop-centric GUI benchmark for visual perception and interaction.arXiv preprint arXiv:2503.15661,

arXiv

[14] [14]

Uncertainty- guided inference-time depth adaptation for transformer-based visual tracking.arXiv preprint arXiv:2602.16160,

Patrick Poggi, Divake Kumar, Theja Tulabandhula, and Amit Ranjan Trivedi. Uncertainty- guided inference-time depth adaptation for transformer-based visual tracking.arXiv preprint arXiv:2602.16160,

arXiv

[15] [15]

UI-TARS: Pioneering automated GUI interaction with native agents.arXiv preprint arXiv:2501.12326,

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Ya...

Pith/arXiv arXiv

[16] [16]

UI-Zoomer: Uncertainty-driven adaptive zoom-in for GUI grounding.arXiv preprint arXiv:2604.14113,

Fei Tang, Bofan Chen, Zhengxi Lu, Tongbo Chen, Songqin Nong, Tao Jiang, Wenhao Xu, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. UI-Zoomer: Uncertainty-driven adaptive zoom-in for GUI grounding.arXiv preprint arXiv:2604.14113,

Pith/arXiv arXiv

[17] [17]

Learning conformal abstention policies for adaptive risk management in large language and vision-language models.arXiv preprint arXiv:2502.06884, 2025a

Sina Tayebati, Divake Kumar, Nastaran Darabi, Dinithi Jayasuriya, Ranganath Krishnan, and Amit Ranjan Trivedi. Learning conformal abstention policies for adaptive risk management in large language and vision-language models.arXiv preprint arXiv:2502.06884, 2025a. 12 Sina Tayebati, Divake Kumar, Nastaran Darabi, Dinithi Jayasuriya, Theja Tulabandhula, Rang...

arXiv

[18] [18]

Benchmarking uncertainty quantification methods for large language models with LM-Polygraph.Transactions of the Association for Computational Linguistics, 13:220–248, 2025a

Roman Vashurin, Ekaterina Fadeeva, Artem Vazhentsev, Lyudmila Rvanova, Daniil Vasilev, Akim Tsvigun, Sergey Petrakov, Rui Xing, Abdelrahman Boda Sadallah, Kirill Grishchenkov, Alexander Panchenko, Timothy Baldwin, Preslav Nakov, Maxim Panov, and Artem Shelmanov. Benchmarking uncertainty quantification methods for large language models with LM-Polygraph.Tr...

arXiv

[19] [19]

SafeGround: Know when to trust GUI grounding models via uncertainty calibration.arXiv preprint arXiv:2602.02419,

Qingni Wang, Yue Fan, and Xin Eric Wang. SafeGround: Know when to trust GUI grounding models via uncertainty calibration.arXiv preprint arXiv:2602.02419,

arXiv

[20] [20]

OpenCUA: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123,

Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, et al. OpenCUA: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123,

arXiv

[21] [21]

OS-ATLAS: A foundation action model for generalist GUI agents.arXiv preprint arXiv:2410.23218,

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. OS-ATLAS: A foundation action model for generalist GUI agents.arXiv preprint arXiv:2410.23218,

Pith/arXiv arXiv

[22] [22]

Scaling computer-use grounding via user interface decomposition and synthesis

Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, Yiheng Xu, Junli Wang, Doyen Sahoo, Tao Yu, and Caiming Xiong. Scaling computer-use grounding via user interface decomposition and synthesis. arXiv preprint arXiv:2505.13227,

arXiv

[23] [23]

VL-Uncertainty: Detecting hallucination in large vision-language model via uncertainty estimation.arXiv preprint arXiv:2411.11919,

13 Ruiyang Zhang, Hu Zhang, and Zhedong Zheng. VL-Uncertainty: Detecting hallucination in large vision-language model via uncertainty estimation.arXiv preprint arXiv:2411.11919,

arXiv

[24] [24]

HyperClick: Advancing reliable GUI grounding via uncertainty calibration.arXiv preprint arXiv:2510.27266,

Shaojie Zhang, Pei Fu, Ruoceng Zhang, Jiahui Yang, Anan Du, Xiuwen Xi, Shaokang Wang, Ying Huang, Bin Qin, Zhenbo Luo, and Jian Luan. HyperClick: Advancing reliable GUI grounding via uncertainty calibration.arXiv preprint arXiv:2510.27266,

Pith/arXiv arXiv

[25] [25]

Enhancing uncertainty-based hallucination detection with stronger focus

Tianhang Zhang, Lin Qiu, Qipeng Guo, Cheng Deng, Yue Zhang, Zheng Zhang, Chenghu Zhou, Xinbing Wang, and Luoyi Fu. Enhancing uncertainty-based hallucination detection with stronger focus. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP),

2023

[26] [26]

POINTS-GUI-G: GUI-grounding journey.arXiv preprint arXiv:2602.06391,

Zhongyin Zhao, Yuan Liu, Yikun Liu, Haicheng Wang, Le Tian, Xiao Zhou, Yangxiu You, Zilin Yu, Yang Yu, and Jie Zhou. POINTS-GUI-G: GUI-grounding journey.arXiv preprint arXiv:2602.06391,

arXiv

[27] [27]

A1 Benchmark overview figure [ARGUS-Leaderboard]A Grounding & Reasoning benchmark for Universal GUI agentsAnalysisTaskInput · GUI screenshotGoal / Trace Sub-skillCorrelationSelectivity by FamilyUQ LeaderboardCommodity GUIstandard apps, web pages, dialogsProfessionalCAD · IDE · Office · ScientificDesktop / OSshell · file mgr · terminalReasoning-heavyfuncti...

2026

[28] [28]

wins-but-miscalibrates

Per-cell numbers are released alongside the data artefacts. A14 Robustness to relaxed correctness convention The headline tables use strict point-in-bbox correctness, where a click is correct iff the predicted coordinate lies inside the target bounding box. Real GUIs typically register clicks within a small tolerance of the visible UI element due to OS-le...

2023

[29] [29]

Up” direction worked uniformly; “down

on every open-weight cell. AUROC values are 50-seed means, consistent with the rest of the paper. The relative-Mahalanobis lift over the naive form is positive on every cell. Cell Mahal-naive AUROC Mahal-RMD AUROC Lift Q7×V2 .506 .781 +0.276 Q7×SP .386 .859 +0.473 Q7×OSG .476 .762 +0.286 Q7×UIV .435 .792 +0.357 Q72×V2 .463 .681 +0.218 Q72×SP .464 .861 +0....

2026