Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets
Pith reviewed 2026-06-25 20:58 UTC · model grok-4.3
The pith
Post-hoc uncertainty estimates for vision-language GUI agents transfer reliably across datasets but not across model classes or interfaces.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Argus, the introduced cross-regime benchmark, demonstrates selective transfer of post-hoc UQ methods for single-step executable GUI grounding: rankings stay stable across datasets for any fixed model, but degrade across model classes and observable interfaces. Hidden-state and density estimators form the most stable open-weight family, while specific sampling, attention, and verbalized methods win only in particular regimes. Within-model transfer reaches Spearman rho values up to 0.969, yet transfer to closed-source vendors averages only +0.08, so closed-source UQ must be reranked on the target rather than extrapolated. Conformal click regions shrink radii by 40-60 percent under matched cali
What carries the argument
The Argus benchmark matrix that applies 27 UQ methods (logit-based, sampling, hidden-state, attention, prompting, and conformal) to four VLM agents and four GUI datasets, plus a closed-source matrix, to measure ranking stability across regimes.
If this is right
- UQ rankings remain stable across datasets when the model is held fixed.
- Rankings degrade when switching model classes or observable interfaces.
- Hidden-state and density methods are the most stable family among open-weight approaches.
- Closed-source UQ requires separate reranking on the target system rather than extrapolation.
- Calibrated conformal prediction shrinks click-region radii by 40-60 percent but loses coverage under calibration or interface mismatch.
Where Pith is reading between the lines
- Practitioners should evaluate candidate UQ methods directly on their target model and interface rather than adopting published rankings wholesale.
- Calibration data collection can focus on model-specific factors instead of collecting new data for every dataset.
- Extending conformal calibration to handle interface shifts could preserve the radius-reduction benefit in changing GUI environments.
Load-bearing premise
The four VLM agents and four datasets used are representative enough of GUI grounding tasks and model behaviors to support general claims about ranking stability and transfer.
What would settle it
Finding that UQ method rankings change substantially when the same methods are applied to a fifth VLM agent or fifth GUI dataset outside the tested set would falsify the selective transfer claim.
Figures
read the original abstract
Computer-use agents turn vision-language model (VLM) predictions into executable GUI clicks, so reliable uncertainty estimates are essential for rejection, calibration, miss-severity ranking, and spatial safety regions. Yet evidence on post-hoc uncertainty quantification (UQ) for these agents is fragmented across isolated model and dataset pairs, leaving it unclear whether UQ rankings stay stable when the agent, benchmark, or observable interface changes. We present Argus, a cross-regime benchmark for post-hoc UQ in single-step executable GUI grounding: a 27-method open-weight matrix over 4 VLM agents and 4 datasets, plus an 8-method closed-source matrix across 3 frontier vendors where logits, hidden states, and attention maps are unavailable. Evaluated methods span logit-based scores, sampling and consistency measures, hidden-state and density estimators (Mahalanobis, SAPLMA), attention-based scores, P(True) and verbalised-confidence prompting, and split-conformal prediction. The main finding is selective transfer: UQ rankings are stable across datasets for a fixed model, but degrade across model classes and observable interfaces. Hidden-state and density methods are the most stable open-weight family, while CoCoA-1MCA, Focus, sampling-based scores, and verbalised self-assessment win in specific regimes. Within-model ranking transfer is strong (Spearman rho up to 0.969), but cross-tier transfer to closed-source vendors averages only +0.08, so closed-source UQ should be reranked on the target rather than extrapolated. Conformal click regions show score-level discrimination is not enough for deployment: locally weighted disks shrink radii by 40-60% when the plug-in UQ is calibrated, but coverage degrades under calibration-test or interface mismatch. We release per-item records, calibration/test splits, UQ scores, and analysis scripts for regime-aware UQ selection in GUI agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Argus, a benchmark evaluating 27 post-hoc UQ methods across 4 open-weight VLMs and 4 GUI grounding datasets (plus 8 methods on 3 closed-source models). It reports that UQ rankings exhibit strong within-model stability across datasets (Spearman rho up to 0.969) but degrade across model classes and interfaces (cross-tier rho ~0.08), with hidden-state/density methods most stable in open-weight settings and specific methods winning in others. It further shows that conformal click regions improve spatial safety when calibrated but suffer under mismatch, and releases per-item scores, splits, and scripts.
Significance. If the selective-transfer observation holds beyond the evaluated matrix, the work supplies actionable guidance for regime-aware UQ selection in GUI agents and demonstrates that post-hoc methods cannot be assumed transferable without re-ranking. The open release of per-item records, calibration splits, and analysis scripts is a concrete strength that enables reproducibility and follow-on regime-specific studies.
major comments (1)
- [Experimental setup / model and dataset selection] The central selective-transfer claim (stable within-model, degrades across classes) rests on a 4 VLM × 4 dataset open-weight grid. No sensitivity analysis or explicit justification is given for why these particular agents and datasets span the relevant axes of architecture family, training regime, or interface observability; if the quartet shares correlated training data or screen-layout statistics, the observed within-model rho values could be an artifact rather than a general property.
minor comments (1)
- [Abstract] The abstract states '27-method open-weight matrix' and '8-method closed-source matrix'; the exact mapping of which methods are evaluated in each regime should be tabulated early for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the single major comment point-by-point below and agree that additional justification for model and dataset selection will strengthen the presentation of the selective-transfer claim.
read point-by-point responses
-
Referee: [Experimental setup / model and dataset selection] The central selective-transfer claim (stable within-model, degrades across classes) rests on a 4 VLM × 4 dataset open-weight grid. No sensitivity analysis or explicit justification is given for why these particular agents and datasets span the relevant axes of architecture family, training regime, or interface observability; if the quartet shares correlated training data or screen-layout statistics, the observed within-model rho values could be an artifact rather than a general property.
Authors: We acknowledge that the manuscript would benefit from an explicit justification subsection. The four open-weight VLMs were selected to cover distinct architecture families (different vision encoders and LLM backbones), scales, and GUI-relevant fine-tuning regimes, while the four datasets were chosen as standard GUI grounding benchmarks differing in screen complexity, action distribution, and interface observability. The observed within-model Spearman stability (up to 0.969) holds across this diversity, and the degradation to cross-class and closed-source settings (average rho ~0.08) is itself evidence against simple artifact. Nevertheless, we agree a sensitivity analysis or expanded justification would address the concern directly. In the revised version we will add a dedicated paragraph detailing selection criteria with references to the models' training data and architectural differences, plus a brief discussion of why the current grid already spans the key axes. Full sensitivity analysis across additional models is left for future work due to compute limits but is enabled by our released code and per-item scores. revision: partial
Circularity Check
Purely empirical benchmark; no derivation chain or self-referential steps.
full rationale
The manuscript is a cross-model, cross-dataset empirical evaluation of post-hoc UQ methods on GUI grounding tasks. All reported quantities (Spearman correlations, coverage rates, radius reductions) are direct measurements on held-out splits; no equations define a target quantity in terms of itself, no fitted parameters are relabeled as predictions, and no uniqueness theorems or ansatzes are imported via self-citation. The 4×4 open-weight matrix and closed-source slice are chosen inputs, not outputs of any internal derivation. Hence the central claims do not reduce to the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The selected 4 VLM agents and 4 datasets are representative of GUI grounding scenarios.
Reference graph
Works this paper leans on
-
[1]
Angelopoulos and Stephen Bates
Anastasios N. Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification.arXiv preprint arXiv:2107.07511,
-
[2]
Angelopoulos, Stephen Bates, Emmanuel J
Anastasios N. Angelopoulos, Stephen Bates, Emmanuel J. Candès, Michael I. Jordan, and Lihua Lei. Learn then test: Calibrating predictive algorithms to achieve risk control.arXiv preprint arXiv:2110.01052,
-
[3]
The art of saying “maybe”: A conformal lens for uncertainty benchmarking in VLMs
Asif Azad, Mohammad Sadat Hossain, MD Sadik Hossain Shanto, M Saifur Rahman, and Md Rizwan Parvez. The art of saying “maybe”: A conformal lens for uncertainty benchmarking in VLMs. arXiv preprint arXiv:2509.13379,
-
[4]
Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923,
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL technical report.ar...
-
[5]
Jikai Chen, Long Chen, Dong Wang, Qinglin Su, Zhixuan Chu, Bingguang Hao, Leilei Gan, Chenyi Zhuang, and Jinjie Gu. V2P: Visual attention calibration for GUI grounding via background suppression and center peaking.arXiv preprint arXiv:2508.13634,
-
[6]
LM-Polygraph: Uncertainty estimation for language models
Ekaterina Fadeeva, Roman Vashurin, Akim Tsvigun, Artem Vazhentsev, Sergey Petrakov, Kirill Fedyanin, Daniil Vasilev, Elizaveta Goncharova, Alexander Panchenko, Maxim Panov, Timothy Baldwin, and Artem Shelmanov. LM-Polygraph: Uncertainty estimation for language models. arXiv preprint arXiv:2311.07383,
-
[7]
Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221,
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...
-
[8]
Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, and Yarin Gal. Semantic entropy probes: Robust and cheap hallucination detection in LLMs.arXiv preprint arXiv:2406.15927,
-
[9]
Uncertainty-aware evaluation for vision-language models.arXiv preprint arXiv:2402.14418,
Vasily Kostumov, Bulat Nutfullin, Oleg Pilipenko, and Eugene Ilyushin. Uncertainty-aware evaluation for vision-language models.arXiv preprint arXiv:2402.14418,
-
[10]
Divake Kumar, Patrick Poggi, Sina Tayebati, Devashri Naik, Nilesh Ahuja, and Amit Ranjan Trivedi. Calibrated decomposition of aleatoric and epistemic uncertainty in deep features for inference-time adaptation.arXiv preprint arXiv:2511.12389, 2025a. Divake Kumar, Sina Tayebati, Nastaran Darabi, Vita Pi-Ho Hu, and Amit Ranjan Trivedi. Uncertainty- aware LiD...
-
[11]
WebSuite: Systematically evaluating why web agents fail.arXiv preprint arXiv:2406.01623,
Eric Li and Jim Waldo. WebSuite: Systematically evaluating why web agents fail.arXiv preprint arXiv:2406.01623,
-
[12]
ScreenSpot-Pro: GUI grounding for professional high-resolution computer use
Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. ScreenSpot-Pro: GUI grounding for professional high-resolution computer use. arXiv preprint arXiv:2504.07981, 2025a. Yinghao Li, Rushi Qiang, Lama Moukheiber, and Chao Zhang. Language model uncertainty quantification with attention chain.arXiv preprin...
-
[13]
Rodriguez, Montek Kalsi, Rabiul Awal, Nicolas Chapados, M
Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Juan A. Rodriguez, Montek Kalsi, Rabiul Awal, Nicolas Chapados, M. Tamer Özsu, Aishwarya Agrawal, David Vazquez, Christopher Pal, Perouz Taslakian, Spandana Gella, and Sai Rajeswar. UI-Vision: A desktop-centric GUI benchmark for visual perception and interaction.arXiv preprint arXiv:2503.15661,
-
[14]
Patrick Poggi, Divake Kumar, Theja Tulabandhula, and Amit Ranjan Trivedi. Uncertainty- guided inference-time depth adaptation for transformer-based visual tracking.arXiv preprint arXiv:2602.16160,
-
[15]
UI-TARS: Pioneering automated GUI interaction with native agents.arXiv preprint arXiv:2501.12326,
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Ya...
-
[16]
UI-Zoomer: Uncertainty-driven adaptive zoom-in for GUI grounding.arXiv preprint arXiv:2604.14113,
Fei Tang, Bofan Chen, Zhengxi Lu, Tongbo Chen, Songqin Nong, Tao Jiang, Wenhao Xu, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. UI-Zoomer: Uncertainty-driven adaptive zoom-in for GUI grounding.arXiv preprint arXiv:2604.14113,
-
[17]
Sina Tayebati, Divake Kumar, Nastaran Darabi, Dinithi Jayasuriya, Ranganath Krishnan, and Amit Ranjan Trivedi. Learning conformal abstention policies for adaptive risk management in large language and vision-language models.arXiv preprint arXiv:2502.06884, 2025a. 12 Sina Tayebati, Divake Kumar, Nastaran Darabi, Dinithi Jayasuriya, Theja Tulabandhula, Rang...
-
[18]
Roman Vashurin, Ekaterina Fadeeva, Artem Vazhentsev, Lyudmila Rvanova, Daniil Vasilev, Akim Tsvigun, Sergey Petrakov, Rui Xing, Abdelrahman Boda Sadallah, Kirill Grishchenkov, Alexander Panchenko, Timothy Baldwin, Preslav Nakov, Maxim Panov, and Artem Shelmanov. Benchmarking uncertainty quantification methods for large language models with LM-Polygraph.Tr...
-
[19]
Qingni Wang, Yue Fan, and Xin Eric Wang. SafeGround: Know when to trust GUI grounding models via uncertainty calibration.arXiv preprint arXiv:2602.02419,
-
[20]
OpenCUA: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123,
Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, et al. OpenCUA: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123,
-
[21]
OS-ATLAS: A foundation action model for generalist GUI agents.arXiv preprint arXiv:2410.23218,
Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. OS-ATLAS: A foundation action model for generalist GUI agents.arXiv preprint arXiv:2410.23218,
-
[22]
Scaling computer-use grounding via user interface decomposition and synthesis
Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, Yiheng Xu, Junli Wang, Doyen Sahoo, Tao Yu, and Caiming Xiong. Scaling computer-use grounding via user interface decomposition and synthesis. arXiv preprint arXiv:2505.13227,
-
[23]
13 Ruiyang Zhang, Hu Zhang, and Zhedong Zheng. VL-Uncertainty: Detecting hallucination in large vision-language model via uncertainty estimation.arXiv preprint arXiv:2411.11919,
-
[24]
Shaojie Zhang, Pei Fu, Ruoceng Zhang, Jiahui Yang, Anan Du, Xiuwen Xi, Shaokang Wang, Ying Huang, Bin Qin, Zhenbo Luo, and Jian Luan. HyperClick: Advancing reliable GUI grounding via uncertainty calibration.arXiv preprint arXiv:2510.27266,
-
[25]
Enhancing uncertainty-based hallucination detection with stronger focus
Tianhang Zhang, Lin Qiu, Qipeng Guo, Cheng Deng, Yue Zhang, Zheng Zhang, Chenghu Zhou, Xinbing Wang, and Luoyi Fu. Enhancing uncertainty-based hallucination detection with stronger focus. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP),
2023
-
[26]
POINTS-GUI-G: GUI-grounding journey.arXiv preprint arXiv:2602.06391,
Zhongyin Zhao, Yuan Liu, Yikun Liu, Haicheng Wang, Le Tian, Xiao Zhou, Yangxiu You, Zilin Yu, Yang Yu, and Jie Zhou. POINTS-GUI-G: GUI-grounding journey.arXiv preprint arXiv:2602.06391,
-
[27]
A1 Benchmark overview figure [ARGUS-Leaderboard]A Grounding & Reasoning benchmark for Universal GUI agentsAnalysisTaskInput · GUI screenshotGoal / Trace Sub-skillCorrelationSelectivity by FamilyUQ LeaderboardCommodity GUIstandard apps, web pages, dialogsProfessionalCAD · IDE · Office · ScientificDesktop / OSshell · file mgr · terminalReasoning-heavyfuncti...
2026
-
[28]
wins-but-miscalibrates
Per-cell numbers are released alongside the data artefacts. A14 Robustness to relaxed correctness convention The headline tables use strict point-in-bbox correctness, where a click is correct iff the predicted coordinate lies inside the target bounding box. Real GUIs typically register clicks within a small tolerance of the visible UI element due to OS-le...
2023
-
[29]
Up” direction worked uniformly; “down
on every open-weight cell. AUROC values are 50-seed means, consistent with the rest of the paper. The relative-Mahalanobis lift over the naive form is positive on every cell. Cell Mahal-naive AUROC Mahal-RMD AUROC Lift Q7×V2 .506 .781 +0.276 Q7×SP .386 .859 +0.473 Q7×OSG .476 .762 +0.286 Q7×UIV .435 .792 +0.357 Q72×V2 .463 .681 +0.218 Q72×SP .464 .861 +0....
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.