Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

Anan Du; Chongyang Zhang; Jian Luan; Pei Fu; Ruoceng Zhang; Shaojie Zhang; Xiuwen Xi; Yuchen Sun; Zhenbo Luo

arxiv: 2605.14311 · v2 · pith:BBPZHPOCnew · submitted 2026-05-14 · 💻 cs.LG · cs.AI· cs.HC

Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

Yuchen Sun , Pei Fu , Shaojie Zhang , Anan Du , Xiuwen Xi , Ruoceng Zhang , Zhenbo Luo , Jian Luan

show 1 more author

Chongyang Zhang

This is my paper

Pith reviewed 2026-05-19 16:41 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.HC

keywords GUI agentscritic modelcontrastive learningaffordancemetric learningtest-time scalinghierarchical evaluationsemantic alignment

0 comments

The pith

Reframing GUI critique as metric learning in a shared affordance space outperforms binary classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that binary classification for GUI critics causes two problems: affordance collapse, where rich hierarchies of action validity are flattened to 0/1, and noise sensitivity, where models overfit to uncertain boundaries. To fix this, BBCritic uses two-stage contrastive learning to place instructions and actions in a common Affordance Space, learning continuous similarities based on functional equivalence. This recovers the structure needed for accurate ranking of multiple candidate actions during test-time scaling. The authors also release BBBench, a benchmark with dense actions and a four-level hierarchy for evaluation. A 3B model trained this way beats 7B binary models and transfers to new settings without extra labels.

Core claim

GUI critique is fundamentally a metric-learning problem rather than a classification one. BBCritic resolves the defects of binary supervision by aligning instructions and actions through two-stage contrastive learning in a shared Affordance Space, recovering the hierarchical affordance structure that binary labels flatten. This leads to superior fine-grained ranking performance and zero-shot generalization.

What carries the argument

Two-stage contrastive learning in a shared Affordance Space that aligns instructions with actions according to the Functional Equivalence Hypothesis.

Load-bearing premise

The Functional Equivalence Hypothesis is true, meaning that contrastive learning without extra annotations can recover the full hierarchical structure of affordances that binary labels lose.

What would settle it

If experiments on BBBench show that BBCritic does not achieve higher ranking accuracy than binary critic models when distinguishing between the four levels of the action taxonomy.

read the original abstract

Test-Time Scaling (TTS), which samples multiple candidate actions and ranks them via a Critic Model, has emerged as a promising paradigm for generalist GUI agents. Its efficacy thus hinges on the critic's fine-grained ranking ability. However, existing GUI critic models uniformly adopt binary classification. Our motivational analysis of these models exposes a severe entanglement: scores for valid actions and plausible-but-invalid distractors become indistinguishable. We attribute this failure to two structural defects: Affordance Collapse--the hierarchical affordance space is compressed into 0/1 labels; and Noise Sensitivity--binary objectives overfit to noisy decision boundaries. To resolve this, we introduce BBCritic (Beyond-Binary Critic), a paradigm shift grounded in the Functional Equivalence Hypothesis. Through two-stage contrastive learning, BBCritic aligns instructions and actions in a shared Affordance Space, recovering the hierarchical structure that binary supervision flattens. We also present BBBench (Beyond-Binary Bench), the first GUI critic benchmark that pairs a dense action space with a hierarchical four-level taxonomy, enabling fine-grained ranking evaluation. Experimental results show that BBCritic-3B, trained without any extra annotation, outperforms 7B-parameter SOTA binary models. It demonstrates strong zero-shot transferability across platforms and tasks, supporting our methodological view: GUI critique is fundamentally a metric-learning problem, not a classification one.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Binary GUI critics collapse hierarchies into yes/no labels, and this paper tries to fix it with two-stage contrastive learning plus a new four-level benchmark.

read the letter

The main point is that existing binary critic models for GUI agents mix up scores for valid actions and plausible distractors because they squash hierarchical affordances into 0/1 labels. The paper proposes BBCritic to treat this as metric learning instead, aligning instructions and actions in a shared space via two-stage contrastive learning under the Functional Equivalence Hypothesis. They also release BBBench, which pairs dense actions with a four-level taxonomy for finer ranking tests. Their 3B model reportedly beats 7B binary SOTA models without extra annotations and transfers across platforms. That smaller-model win and the benchmark are the concrete additions worth noting. The motivational analysis of entanglement and noise sensitivity is straightforward and matches what people see in practice with current critics. The reframing itself is clean and avoids overclaiming prior work. The soft spots sit in the methods details that the abstract leaves out. How exactly the positive and negative pairs get sampled in each stage matters a lot for whether the hierarchy actually reappears or whether the model just learns better margins on existing validity signals. If pair construction quietly leans on taxonomy-derived cues, the no-extra-annotation claim weakens. The stress-test note flags this correctly, and without error breakdowns or explicit pair-construction diagrams it is hard to judge how much of the reported gain traces to the new framing versus standard contrastive tricks. This paper is for people building test-time scaling loops for GUI agents or anyone evaluating critic models for action ranking. Readers who care about moving beyond classification losses in agent work will find the benchmark and the hypothesis useful to test. It has enough of a distinct idea and some empirical signal to deserve a serious referee rather than a desk reject, mainly so the pair-sampling and hierarchy-recovery claims can be checked against the full methods and ablations.

Referee Report

3 major / 2 minor

Summary. The paper argues that GUI critique for test-time scaling in agents is better framed as a metric-learning problem than binary classification. It identifies Affordance Collapse and Noise Sensitivity as defects in existing binary critic models that entangle valid actions with plausible distractors. BBCritic is proposed as a two-stage contrastive learning approach grounded in the Functional Equivalence Hypothesis, which aligns instructions and actions in a shared Affordance Space to recover hierarchical structure. BBBench is introduced as a new benchmark with a dense action space and four-level taxonomy for fine-grained ranking evaluation. Experiments claim that a 3B-parameter BBCritic model, trained without extra annotations, outperforms 7B-parameter state-of-the-art binary models and shows strong zero-shot transfer across platforms.

Significance. If the results and hierarchy-recovery claims hold, the work would provide a substantive reframing of critic modeling for GUI agents, potentially improving ranking quality in test-time scaling setups. The hierarchical benchmark BBBench would be a useful contribution for future fine-grained evaluation, and the no-extra-annotation training result would be notable if rigorously supported.

major comments (3)

[Abstract, §3] Abstract and §3 (Method): The central claim that two-stage contrastive learning recovers the four-level hierarchical affordance structure without extra annotations rests on the Functional Equivalence Hypothesis and specific pair-construction rules. The manuscript must explicitly describe how positives and negatives are sampled in each stage (e.g., whether pairs are derived solely from existing action traces or incorporate taxonomy-derived signals from BBBench). Without this, it is impossible to verify that the method avoids implicit supervision while still restoring the hierarchy that binary labels collapse.
[§4] §4 (Experiments): The reported outperformance of BBCritic-3B over 7B binary models is load-bearing for the paradigm-shift claim, yet the abstract and available description provide no data splits, ablation on the two contrastive stages, or error analysis separating ranking improvements from hierarchy recovery. These details are required to assess whether the gains support the metric-learning reframing or could arise from standard contrastive objectives alone.
[§2] §2 (Motivational Analysis): The entanglement between valid actions and plausible-but-invalid distractors is attributed to Affordance Collapse and Noise Sensitivity. The manuscript should provide quantitative evidence (e.g., score distributions or embedding visualizations) showing that binary models indeed compress the hierarchy, and that the proposed continuous alignment measurably restores it, rather than merely widening margins.

minor comments (2)

[§4] Ensure all figures in §4 clearly label the four-level taxonomy and show how BBCritic embeddings separate levels that binary models collapse.
[§3.2] Clarify the exact definition of 'dense action space' in BBBench and how it differs from prior GUI benchmarks.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (Method): The central claim that two-stage contrastive learning recovers the four-level hierarchical affordance structure without extra annotations rests on the Functional Equivalence Hypothesis and specific pair-construction rules. The manuscript must explicitly describe how positives and negatives are sampled in each stage (e.g., whether pairs are derived solely from existing action traces or incorporate taxonomy-derived signals from BBBench). Without this, it is impossible to verify that the method avoids implicit supervision while still restoring the hierarchy that binary labels collapse.

Authors: We agree that the pair-sampling procedure requires explicit detail to substantiate the no-extra-annotation claim. In the revised manuscript we have expanded §3.2 and §3.3 to specify that all positive pairs are formed from temporally adjacent or outcome-equivalent actions within the existing training traces, while negatives are obtained via in-batch random sampling and hard-negative mining based on embedding similarity; no taxonomy labels or signals from BBBench are used at any point during training. BBBench is employed exclusively for evaluation. This clarification confirms that hierarchy recovery arises from the contrastive objective and Functional Equivalence Hypothesis rather than implicit supervision. revision: yes
Referee: [§4] §4 (Experiments): The reported outperformance of BBCritic-3B over 7B binary models is load-bearing for the paradigm-shift claim, yet the abstract and available description provide no data splits, ablation on the two contrastive stages, or error analysis separating ranking improvements from hierarchy recovery. These details are required to assess whether the gains support the metric-learning reframing or could arise from standard contrastive objectives alone.

Authors: The referee is correct that these experimental details are necessary for a rigorous assessment. We have added to the revised §4: (i) a clear description of the train/validation/test splits, (ii) ablations that isolate the contribution of each contrastive stage, and (iii) an error analysis that decomposes ranking gains into improvements attributable to hierarchy recovery versus general margin widening. These additions allow readers to evaluate whether the observed advantages are specific to the proposed reframing. revision: yes
Referee: [§2] §2 (Motivational Analysis): The entanglement between valid actions and plausible-but-invalid distractors is attributed to Affordance Collapse and Noise Sensitivity. The manuscript should provide quantitative evidence (e.g., score distributions or embedding visualizations) showing that binary models indeed compress the hierarchy, and that the proposed continuous alignment measurably restores it, rather than merely widening margins.

Authors: We accept that the motivational analysis would be strengthened by quantitative support. The revised §2 now includes score-distribution histograms across the four hierarchy levels for representative binary models and t-SNE visualizations of the learned embeddings for both binary and BBCritic models. These figures demonstrate the compression of hierarchical distinctions under binary supervision and the measurable separation recovered by continuous alignment in the affordance space. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper reframes GUI critique as metric learning via two-stage contrastive learning grounded in the Functional Equivalence Hypothesis, using standard contrastive objectives to align instructions and actions in a shared Affordance Space. This does not reduce by construction to fitted parameters, self-citations, or renamed inputs; the hypothesis serves as an explicit modeling assumption rather than a self-definitional loop, and the BBBench benchmark plus experimental comparisons provide independent external evaluation. No load-bearing steps collapse to the paper's own equations or prior self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the Functional Equivalence Hypothesis as a domain assumption and the premise that contrastive learning recovers hierarchical structure without additional supervision.

axioms (1)

domain assumption Functional Equivalence Hypothesis
Invoked to justify alignment in shared Affordance Space that recovers hierarchical structure flattened by binary labels.

pith-pipeline@v0.9.0 · 5800 in / 1172 out tokens · 49805 ms · 2026-05-19T16:41:22.234233+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BBCritic aligns instructions and actions in a shared Affordance Space, recovering the hierarchical structure that binary supervision flattens... Functional Equivalence Hypothesis
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

four-level semantic taxonomy (Optimal, Suboptimal, Semantic Distractor, Unrelated Error)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · 23 internal anchors

[4]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

KG-RAG: Enhancing GUI Agent Decision-Making via Knowledge Graph-Driven Retrieval-Augmented Generation , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025
[10]

Advances in Neural Information Processing Systems , volume=

Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration , author=. Advances in Neural Information Processing Systems , volume=

work page
[12]

arXiv e-prints , pages=

Self-Improving VLM Judges Without Human Annotations , author=. arXiv e-prints , pages=

work page
[13]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Improving Discriminative Capability of Reward Models in RLHF Using Contrastive Learning , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2024
[20]

interactions , volume=

Affordance, conventions, and design , author=. interactions , volume=. 1999 , publisher=

work page 1999
[22]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Os-genesis: Automating gui agent trajectory construction via reverse task synthesis , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[23]

arXiv preprint arXiv:2410.02907 , year=

Nnetnav: Unsupervised learning of browser agents through environment interaction in the wild , author=. arXiv preprint arXiv:2410.02907 , year=

work page arXiv
[27]

2026 , url=

Shaokang Wang and Pei Fu and Ruoceng Zhang and Shaojie Zhang and Xiuwen Xi and Jiahui Yang and Bin Qin and Ying Huang and Zhenbo Luo and Jian Luan , journal=. 2026 , url=

work page 2026
[31]

Advances in Neural Information Processing Systems , volume=

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments , author=. Advances in Neural Information Processing Systems , volume=

work page
[33]

Journal educational computing research , volume=

User centered system design: new perspectives on human-computer interaction , author=. Journal educational computing research , volume=

work page
[36]

Authorea Preprints , year=

Qwen 2.5: A comprehensive review of the leading resource-efficient llm with potentioal to surpass all competitors , author=. Authorea Preprints , year=

work page
[38]

Advances in Neural Information Processing Systems , volume=

On the effects of data scale on ui control agents , author=. Advances in Neural Information Processing Systems , volume=

work page
[39]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

GUIOdyssey: A comprehensive dataset for cross-app GUI navigation on mobile devices , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[40]

Gemini 3 Flash and Gemini 3 Pro , year =

work page
[41]

Claude 4.0 Sonnet , year =

work page
[47]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Seeclick: Harnessing gui grounding for advanced visual gui agents , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[53]

Advances in Neural Information Processing Systems , volume=

Mind2web: Towards a generalist agent for the web , author=. Advances in Neural Information Processing Systems , volume=

work page
[55]

Proceedings of the 22nd International Conference on Machine Learning , pages=

Learning to rank using gradient descent , author=. Proceedings of the 22nd International Conference on Machine Learning , pages=

work page
[56]

Proceedings of the 24th International Conference on Machine Learning , pages=

Learning to rank: from pairwise approach to listwise approach , author=. Proceedings of the 24th International Conference on Machine Learning , pages=

work page
[57]

Proceedings of the 9th International Conference on Artificial Neural Networks , pages=

Support vector learning for ordinal regression , author=. Proceedings of the 9th International Conference on Artificial Neural Networks , pages=

work page
[58]

Proceedings of the 38th International Conference on Machine Learning , pages=

Learning Transferable Visual Models From Natural Language Supervision , author=. Proceedings of the 38th International Conference on Machine Learning , pages=

work page
[60]

Proceedings of the 38th International Conference on Machine Learning (ICML) , year=

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , author=. Proceedings of the 38th International Conference on Machine Learning (ICML) , year=

work page
[61]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year=

Sigmoid Loss for Language Image Pre-Training , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year=

work page
[62]

Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , booktitle=

work page
[63]

Jiang, Ting and Song, Minghui and Zhang, Zihan and Huang, Haizhen and Deng, Weiwei and Sun, Feng and Zhang, Qi and Wang, Deqing and Zhuang, Fuzhen , journal=

work page
[64]

Lee, Chankyu and Roy, Rajarshi and Xu, Mengyao and Raiman, Jonathan and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei , journal=

work page
[66]

Lan, Zhibin and Niu, Liqiang and Meng, Fandong and Zhou, Jie and Su, Jinsong , booktitle=

work page
[67]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in

Yang, Jianwei and Zhang, Hao and Li, Feng and Zou, Xueyan and Li, Chunyuan and Gao, Jianfeng , journal=. Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in

work page
[68]

Claude 4.0 sonnet

Anthropic . Claude 4.0 sonnet. Large Language Model, 2026. URL https://www.anthropic.com/claude. Accessed: 2026-01-29

work page 2026
[69]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

Burges, T

C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning, pages 89--96, 2005

work page 2005
[71]

Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th International Conference on Machine Learning, pages 129--136, 2007

work page 2007
[72]

Y. Chai, H. Li, J. Zhang, L. Liu, G. Liu, G. Wang, S. Ren, S. Huang, and H. Li. A3: Android agent arena for mobile gui agents. arXiv preprint arXiv:2501.01149, 2025

work page arXiv 2025
[73]

C. Chen, K. Ji, H. Zhong, M. Zhu, A. Li, G. Gan, Z. Huang, C. Zou, J. Liu, J. Chen, et al. Gui-shepherd: Reliable process reward and verification for long-sequence gui tasks. arXiv preprint arXiv:2509.23738, 2025 a

work page arXiv 2025
[74]

L. Chen, R. Zheng, B. Wang, S. Jin, C. Huang, J. Ye, Z. Zhang, Y. Zhou, Z. Xi, T. Gui, et al. Improving discriminative capability of reward models in rlhf using contrastive learning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15270--15283, 2024

work page 2024
[75]

L. Chen, H. Zhou, C. Cai, J. Zhang, P. Tong, Q. Kong, X. Zhang, C. Liu, Y. Liu, W. Wang, et al. Ui-ins: Enhancing gui grounding with multi-perspective instruction-as-reasoning. arXiv preprint arXiv:2510.20286, 2025 b

work page arXiv 2025
[76]

S. Chen, T. Zhao, Y. Bin, F. Ma, W. Shao, and Z. Wang. D-gara: A dynamic benchmarking framework for gui agent robustness in real-world anomalies. arXiv preprint arXiv:2511.16590, 2025 c

work page arXiv 2025
[77]

Cheng, Q

K. Cheng, Q. Sun, Y. Chu, F. Xu, L. YanTao, J. Zhang, and Z. Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9313--9332, 2024

work page 2024
[78]

X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y. Su. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36: 0 28091--28114, 2023

work page 2023
[79]

ColPali: Efficient Document Retrieval with Vision Language Models

M. Faysse, H. Sibille, T. Wu, B. Omrani, G. Viaud, C. Hudelot, and P. Colombo. ColPali : Efficient document retrieval with vision language models. arXiv preprint arXiv:2407.01449, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[80]

Gemini 3 flash and gemini 3 pro

Google . Gemini 3 flash and gemini 3 pro. Large Language Model, 2026. URL https://gemini.google.com/. Accessed: 2026-01-29

work page 2026
[81]

Z. Gu, Z. Zeng, Z. Xu, X. Zhou, S. Shen, Y. Liu, B. Zhou, C. Meng, T. Xia, W. Chen, et al. Ui-venus technical report: Building high-performance ui agents with rft. arXiv preprint arXiv:2508.10833, 2025

work page arXiv 2025
[82]

Z. Guan, J. C. L. Li, Z. Hou, P. Zhang, D. Xu, Y. Zhao, M. Wu, J. Chen, T.-T. Nguyen, P. Xian, et al. Kg-rag: Enhancing gui agent decision-making via knowledge graph-driven retrieval-augmented generation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5396--5405, 2025

work page 2025
[83]

Herbrich, T

R. Herbrich, T. Graepel, and K. Obermayer. Support vector learning for ordinal regression. In Proceedings of the 9th International Conference on Artificial Neural Networks, pages 97--102, 1999

work page 1999
[84]

Y. Im, B. Jo, J. Wi, S. Baek, T. H. Min, J. H. Lee, S. Oh, I. Shin, and S. Lee. Modular and multi-path-aware offline benchmarking for mobile gui agents. arXiv preprint arXiv:2512.12634, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[85]

C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), 2021

work page 2021
[86]

E5-V: Universal Embeddings with Multimodal Large Language Models

T. Jiang, M. Song, Z. Zhang, H. Huang, W. Deng, F. Sun, Q. Zhang, D. Wang, and F. Zhuang. E5-V : Universal embeddings with multimodal large language models. arXiv preprint arXiv:2407.12580, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024
[87]

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Z. Jiang, R. Meng, X. Yang, S. Yavuz, Y. Zhou, and W. Chen. Vlm2vec: Training vision-language models for massive multimodal embedding tasks. arXiv preprint arXiv:2410.05160, 2024 b

work page internal anchor Pith review Pith/arXiv arXiv 2024
[88]

Z. Lan, L. Niu, F. Meng, J. Zhou, and J. Su. LLaVE : Large language and vision embedding models with hardness-weighted contrastive learning. In Findings of the Association for Computational Linguistics: EMNLP 2025, 2025

work page 2025
[89]

C. Lee, R. Roy, M. Xu, J. Raiman, M. Shoeybi, B. Catanzaro, and W. Ping. NV-Embed : Improved techniques for training LLMs as generalist embedding models. arXiv preprint arXiv:2405.17428, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[90]

J. Li, D. Li, S. Savarese, and S. Hoi. BLIP-2 : Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning (ICML), 2023

work page 2023
[91]

W. Li, W. E. Bishop, A. Li, C. Rawles, F. Campbell-Ajala, D. Tyamagundlu, and O. Riva. On the effects of data scale on ui control agents. Advances in Neural Information Processing Systems, 37: 0 92130--92154, 2024

work page 2024
[92]

M. Lin, P. Ding, S. Wang, Z. Zhuang, Y. Liu, X. Tong, W. Song, S. Lyu, S. Huang, and D. Wang. Hif-vla: Hindsight, insight and foresight through motion representation for vision-language-action models. arXiv preprint arXiv:2512.09928, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[93]

Y. Liu, P. Li, C. Xie, X. Hu, X. Han, S. Zhang, H. Yang, and F. Wu. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners. arXiv preprint arXiv:2504.14239, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025
[94]

Z. Liu, J. Xie, Z. Ding, Z. Li, B. Yang, Z. Wu, X. Wang, Q. Sun, S. Liu, W. Wang, et al. Scalecua: Scaling open-source computer use agents with cross-platform data. arXiv preprint arXiv:2509.15221, 2025 b

work page arXiv 2025
[95]

Q. Lu, W. Shao, Z. Liu, L. Du, F. Meng, B. Li, B. Chen, S. Huang, K. Zhang, and P. Luo. Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22404--22414, 2025 a

work page 2025
[96]

Z. Lu, Y. Chai, Y. Guo, X. Yin, L. Liu, H. Wang, H. Xiao, S. Ren, G. Xiong, and H. Li. Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning. arXiv preprint arXiv:2503.21620, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025
[97]

D. Luo, B. Tang, K. Li, G. Papoudakis, J. Song, S. Gong, J. Hao, J. Wang, and K. Shao. Vimo: A generative visual gui world model for app agents. arXiv preprint arXiv:2504.13936, 2025 a

work page arXiv 2025
[98]

R. Luo, L. Wang, W. He, L. Chen, J. Li, and X. Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents. arXiv preprint arXiv:2504.10458, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025
[99]

LLM Critics Help Catch LLM Bugs,

N. McAleese, R. M. Pokorny, J. F. C. Uribe, E. Nitishinskaya, M. Trebacz, and J. Leike. Llm critics help catch llm bugs. arXiv preprint arXiv:2407.00215, 2024

work page arXiv 2024
[100]

D. A. Norman. Affordance, conventions, and design. interactions, 6 0 (3): 0 38--43, 1999

work page 1999
[101]

OpenAI . Gpt-5. Large Language Model, 2026. URL https://chatgpt.com/. Accessed: 2026-01-29

work page 2026
[102]

R. D. Pea. User centered system design: new perspectives on human-computer interaction. Journal educational computing research, 3: 0 129--134, 1987

work page 1987
[103]

Y. Qin, Y. Ye, J. Fang, H. Wang, S. Liang, S. Tian, J. Zhang, J. Li, Y. Li, S. Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents. arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[104]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 8748--8763, 2021

work page 2021
[105]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

C. Rawles, S. Clinckemaillie, Y. Chang, J. Waltz, G. Lau, M. Fair, A. Li, W. Bishop, W. Li, F. Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents. arXiv preprint arXiv:2405.14573, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[106]

B. Seed, J. Chen, T. Fan, X. Liu, L. Liu, Z. Lin, M. Wang, C. Wang, X. Wei, W. Xu, et al. Seed1.5-thinking: Advancing superb reasoning models with reinforcement learning. arXiv preprint arXiv:2504.13914, 2025

work page arXiv 2025
[107]

Y. Shi, W. Yu, Z. Li, Y. Wang, H. Zhang, N. Liu, H. Mi, and D. Yu. Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment. arXiv preprint arXiv:2507.05720, 2025

work page arXiv 2025
[108]

X. Sun, Y. Chen, Y. Huang, R. Xie, J. Zhu, K. Zhang, S. Li, Z. Yang, J. Han, X. Shu, et al. Hunyuan-large: An open-source moe model with 52 billion activated parameters by tencent. arXiv preprint arXiv:2411.02265, 2024

work page arXiv 2024
[109]

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[110]

H. Wang, H. Zou, H. Song, J. Feng, J. Fang, J. Lu, L. Liu, Q. Luo, S. Liang, S. Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning. arXiv preprint arXiv:2509.02544, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[111]

J. Wang, H. Xu, H. Jia, X. Zhang, M. Yan, W. Shen, J. Zhang, F. Huang, and J. Sang. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration. Advances in Neural Information Processing Systems, 37: 0 2686--2710, 2024

work page 2024
[112]

S. Wang, P. Fu, R. Zhang, S. Zhang, X. Xi, J. Yang, B. Qin, Y. Huang, Z. Luo, and J. Luan. GAIA : A data flywheel system for training GUI test-time scaling critic models. arXiv preprint arXiv:2601.18197, 2026. URL https://arxiv.org/pdf/2601.18197

work page arXiv 2026
[113]

Look before you leap: A gui-critic-r1 model for pre-operative error diagnosis in gui automation, 2025

Y. Wanyan, X. Zhang, H. Xu, H. Liu, J. Wang, J. Ye, Y. Kou, M. Yan, F. Huang, X. Yang, et al. Look before you leap: A gui-critic-r1 model for pre-operative error diagnosis in gui automation. arXiv preprint arXiv:2506.04614, 2025

work page arXiv 2025
[114]

Wanyin Lin, Y

I. Wanyin Lin, Y. Hu, S. S. Li, S. Geng, P. W. Koh, L. Zettlemoyer, T. Althoff, and M. Ghazvininejad. Self-improving vlm judges without human annotations. arXiv e-prints, pages arXiv--2512, 2025

work page 2025
[115]

Q. Wu, K. Cheng, R. Yang, C. Zhang, J. Yang, H. Jiang, J. Mu, B. Peng, B. Qiao, R. Tan, et al. Gui-actor: Coordinate-free visual grounding for gui agents. arXiv preprint arXiv:2506.03143, 2025 a

work page arXiv 2025
[116]

Z. Wu, Z. Wu, F. Xu, Y. Wang, Q. Sun, C. Jia, K. Cheng, Z. Ding, L. Chen, P. P. Liang, et al. Os-atlas: A foundation action model for generalist gui agents. arXiv preprint arXiv:2410.23218, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[117]

Z. Wu, J. Xie, Z. Li, B. Yang, Q. Sun, Z. Liu, Z. Liu, Y. Qiao, X. Yue, Z. Wang, et al. Os-oracle: A comprehensive framework for cross-platform gui critic models. arXiv preprint arXiv:2512.16295, 2025 b

work page arXiv 2025
[118]

H. Xiao, G. Wang, Y. Chai, Z. Lu, W. Lin, H. He, L. Fan, L. Bian, R. Hu, L. Liu, et al. Ui-genie: A self-improving approach for iteratively boosting mllm-based mobile gui agents. arXiv preprint arXiv:2505.21496, 2025

work page arXiv 2025
[119]

T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37: 0 52040--52094, 2024

work page 2024

Showing first 80 references.

[1] [4]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

KG-RAG: Enhancing GUI Agent Decision-Making via Knowledge Graph-Driven Retrieval-Augmented Generation , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025

[2] [10]

Advances in Neural Information Processing Systems , volume=

Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration , author=. Advances in Neural Information Processing Systems , volume=

work page

[3] [12]

arXiv e-prints , pages=

Self-Improving VLM Judges Without Human Annotations , author=. arXiv e-prints , pages=

work page

[4] [13]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Improving Discriminative Capability of Reward Models in RLHF Using Contrastive Learning , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2024

[5] [20]

interactions , volume=

Affordance, conventions, and design , author=. interactions , volume=. 1999 , publisher=

work page 1999

[6] [22]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Os-genesis: Automating gui agent trajectory construction via reverse task synthesis , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[7] [23]

arXiv preprint arXiv:2410.02907 , year=

Nnetnav: Unsupervised learning of browser agents through environment interaction in the wild , author=. arXiv preprint arXiv:2410.02907 , year=

work page arXiv

[8] [27]

2026 , url=

Shaokang Wang and Pei Fu and Ruoceng Zhang and Shaojie Zhang and Xiuwen Xi and Jiahui Yang and Bin Qin and Ying Huang and Zhenbo Luo and Jian Luan , journal=. 2026 , url=

work page 2026

[9] [31]

Advances in Neural Information Processing Systems , volume=

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments , author=. Advances in Neural Information Processing Systems , volume=

work page

[10] [33]

Journal educational computing research , volume=

User centered system design: new perspectives on human-computer interaction , author=. Journal educational computing research , volume=

work page

[11] [36]

Authorea Preprints , year=

Qwen 2.5: A comprehensive review of the leading resource-efficient llm with potentioal to surpass all competitors , author=. Authorea Preprints , year=

work page

[12] [38]

Advances in Neural Information Processing Systems , volume=

On the effects of data scale on ui control agents , author=. Advances in Neural Information Processing Systems , volume=

work page

[13] [39]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

GUIOdyssey: A comprehensive dataset for cross-app GUI navigation on mobile devices , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[14] [40]

Gemini 3 Flash and Gemini 3 Pro , year =

work page

[15] [41]

Claude 4.0 Sonnet , year =

work page

[16] [47]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Seeclick: Harnessing gui grounding for advanced visual gui agents , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[17] [53]

Advances in Neural Information Processing Systems , volume=

Mind2web: Towards a generalist agent for the web , author=. Advances in Neural Information Processing Systems , volume=

work page

[18] [55]

Proceedings of the 22nd International Conference on Machine Learning , pages=

Learning to rank using gradient descent , author=. Proceedings of the 22nd International Conference on Machine Learning , pages=

work page

[19] [56]

Proceedings of the 24th International Conference on Machine Learning , pages=

Learning to rank: from pairwise approach to listwise approach , author=. Proceedings of the 24th International Conference on Machine Learning , pages=

work page

[20] [57]

Proceedings of the 9th International Conference on Artificial Neural Networks , pages=

Support vector learning for ordinal regression , author=. Proceedings of the 9th International Conference on Artificial Neural Networks , pages=

work page

[21] [58]

Proceedings of the 38th International Conference on Machine Learning , pages=

Learning Transferable Visual Models From Natural Language Supervision , author=. Proceedings of the 38th International Conference on Machine Learning , pages=

work page

[22] [60]

Proceedings of the 38th International Conference on Machine Learning (ICML) , year=

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , author=. Proceedings of the 38th International Conference on Machine Learning (ICML) , year=

work page

[23] [61]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year=

Sigmoid Loss for Language Image Pre-Training , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year=

work page

[24] [62]

Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , booktitle=

work page

[25] [63]

Jiang, Ting and Song, Minghui and Zhang, Zihan and Huang, Haizhen and Deng, Weiwei and Sun, Feng and Zhang, Qi and Wang, Deqing and Zhuang, Fuzhen , journal=

work page

[26] [64]

Lee, Chankyu and Roy, Rajarshi and Xu, Mengyao and Raiman, Jonathan and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei , journal=

work page

[27] [66]

Lan, Zhibin and Niu, Liqiang and Meng, Fandong and Zhou, Jie and Su, Jinsong , booktitle=

work page

[28] [67]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in

Yang, Jianwei and Zhang, Hao and Li, Feng and Zou, Xueyan and Li, Chunyuan and Gao, Jianfeng , journal=. Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in

work page

[29] [68]

Claude 4.0 sonnet

Anthropic . Claude 4.0 sonnet. Large Language Model, 2026. URL https://www.anthropic.com/claude. Accessed: 2026-01-29

work page 2026

[30] [69]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [70]

Burges, T

C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning, pages 89--96, 2005

work page 2005

[32] [71]

Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th International Conference on Machine Learning, pages 129--136, 2007

work page 2007

[33] [72]

Y. Chai, H. Li, J. Zhang, L. Liu, G. Liu, G. Wang, S. Ren, S. Huang, and H. Li. A3: Android agent arena for mobile gui agents. arXiv preprint arXiv:2501.01149, 2025

work page arXiv 2025

[34] [73]

C. Chen, K. Ji, H. Zhong, M. Zhu, A. Li, G. Gan, Z. Huang, C. Zou, J. Liu, J. Chen, et al. Gui-shepherd: Reliable process reward and verification for long-sequence gui tasks. arXiv preprint arXiv:2509.23738, 2025 a

work page arXiv 2025

[35] [74]

L. Chen, R. Zheng, B. Wang, S. Jin, C. Huang, J. Ye, Z. Zhang, Y. Zhou, Z. Xi, T. Gui, et al. Improving discriminative capability of reward models in rlhf using contrastive learning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15270--15283, 2024

work page 2024

[36] [75]

L. Chen, H. Zhou, C. Cai, J. Zhang, P. Tong, Q. Kong, X. Zhang, C. Liu, Y. Liu, W. Wang, et al. Ui-ins: Enhancing gui grounding with multi-perspective instruction-as-reasoning. arXiv preprint arXiv:2510.20286, 2025 b

work page arXiv 2025

[37] [76]

S. Chen, T. Zhao, Y. Bin, F. Ma, W. Shao, and Z. Wang. D-gara: A dynamic benchmarking framework for gui agent robustness in real-world anomalies. arXiv preprint arXiv:2511.16590, 2025 c

work page arXiv 2025

[38] [77]

Cheng, Q

K. Cheng, Q. Sun, Y. Chu, F. Xu, L. YanTao, J. Zhang, and Z. Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9313--9332, 2024

work page 2024

[39] [78]

X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y. Su. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36: 0 28091--28114, 2023

work page 2023

[40] [79]

ColPali: Efficient Document Retrieval with Vision Language Models

M. Faysse, H. Sibille, T. Wu, B. Omrani, G. Viaud, C. Hudelot, and P. Colombo. ColPali : Efficient document retrieval with vision language models. arXiv preprint arXiv:2407.01449, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [80]

Gemini 3 flash and gemini 3 pro

Google . Gemini 3 flash and gemini 3 pro. Large Language Model, 2026. URL https://gemini.google.com/. Accessed: 2026-01-29

work page 2026

[42] [81]

Z. Gu, Z. Zeng, Z. Xu, X. Zhou, S. Shen, Y. Liu, B. Zhou, C. Meng, T. Xia, W. Chen, et al. Ui-venus technical report: Building high-performance ui agents with rft. arXiv preprint arXiv:2508.10833, 2025

work page arXiv 2025

[43] [82]

Z. Guan, J. C. L. Li, Z. Hou, P. Zhang, D. Xu, Y. Zhao, M. Wu, J. Chen, T.-T. Nguyen, P. Xian, et al. Kg-rag: Enhancing gui agent decision-making via knowledge graph-driven retrieval-augmented generation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5396--5405, 2025

work page 2025

[44] [83]

Herbrich, T

R. Herbrich, T. Graepel, and K. Obermayer. Support vector learning for ordinal regression. In Proceedings of the 9th International Conference on Artificial Neural Networks, pages 97--102, 1999

work page 1999

[45] [84]

Y. Im, B. Jo, J. Wi, S. Baek, T. H. Min, J. H. Lee, S. Oh, I. Shin, and S. Lee. Modular and multi-path-aware offline benchmarking for mobile gui agents. arXiv preprint arXiv:2512.12634, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [85]

C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), 2021

work page 2021

[47] [86]

E5-V: Universal Embeddings with Multimodal Large Language Models

T. Jiang, M. Song, Z. Zhang, H. Huang, W. Deng, F. Sun, Q. Zhang, D. Wang, and F. Zhuang. E5-V : Universal embeddings with multimodal large language models. arXiv preprint arXiv:2407.12580, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [87]

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Z. Jiang, R. Meng, X. Yang, S. Yavuz, Y. Zhou, and W. Chen. Vlm2vec: Training vision-language models for massive multimodal embedding tasks. arXiv preprint arXiv:2410.05160, 2024 b

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [88]

Z. Lan, L. Niu, F. Meng, J. Zhou, and J. Su. LLaVE : Large language and vision embedding models with hardness-weighted contrastive learning. In Findings of the Association for Computational Linguistics: EMNLP 2025, 2025

work page 2025

[50] [89]

C. Lee, R. Roy, M. Xu, J. Raiman, M. Shoeybi, B. Catanzaro, and W. Ping. NV-Embed : Improved techniques for training LLMs as generalist embedding models. arXiv preprint arXiv:2405.17428, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [90]

J. Li, D. Li, S. Savarese, and S. Hoi. BLIP-2 : Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning (ICML), 2023

work page 2023

[52] [91]

W. Li, W. E. Bishop, A. Li, C. Rawles, F. Campbell-Ajala, D. Tyamagundlu, and O. Riva. On the effects of data scale on ui control agents. Advances in Neural Information Processing Systems, 37: 0 92130--92154, 2024

work page 2024

[53] [92]

M. Lin, P. Ding, S. Wang, Z. Zhuang, Y. Liu, X. Tong, W. Song, S. Lyu, S. Huang, and D. Wang. Hif-vla: Hindsight, insight and foresight through motion representation for vision-language-action models. arXiv preprint arXiv:2512.09928, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [93]

Y. Liu, P. Li, C. Xie, X. Hu, X. Han, S. Zhang, H. Yang, and F. Wu. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners. arXiv preprint arXiv:2504.14239, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [94]

Z. Liu, J. Xie, Z. Ding, Z. Li, B. Yang, Z. Wu, X. Wang, Q. Sun, S. Liu, W. Wang, et al. Scalecua: Scaling open-source computer use agents with cross-platform data. arXiv preprint arXiv:2509.15221, 2025 b

work page arXiv 2025

[56] [95]

Q. Lu, W. Shao, Z. Liu, L. Du, F. Meng, B. Li, B. Chen, S. Huang, K. Zhang, and P. Luo. Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22404--22414, 2025 a

work page 2025

[57] [96]

Z. Lu, Y. Chai, Y. Guo, X. Yin, L. Liu, H. Wang, H. Xiao, S. Ren, G. Xiong, and H. Li. Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning. arXiv preprint arXiv:2503.21620, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [97]

D. Luo, B. Tang, K. Li, G. Papoudakis, J. Song, S. Gong, J. Hao, J. Wang, and K. Shao. Vimo: A generative visual gui world model for app agents. arXiv preprint arXiv:2504.13936, 2025 a

work page arXiv 2025

[59] [98]

R. Luo, L. Wang, W. He, L. Chen, J. Li, and X. Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents. arXiv preprint arXiv:2504.10458, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025

[60] [99]

LLM Critics Help Catch LLM Bugs,

N. McAleese, R. M. Pokorny, J. F. C. Uribe, E. Nitishinskaya, M. Trebacz, and J. Leike. Llm critics help catch llm bugs. arXiv preprint arXiv:2407.00215, 2024

work page arXiv 2024

[61] [100]

D. A. Norman. Affordance, conventions, and design. interactions, 6 0 (3): 0 38--43, 1999

work page 1999

[62] [101]

OpenAI . Gpt-5. Large Language Model, 2026. URL https://chatgpt.com/. Accessed: 2026-01-29

work page 2026

[63] [102]

R. D. Pea. User centered system design: new perspectives on human-computer interaction. Journal educational computing research, 3: 0 129--134, 1987

work page 1987

[64] [103]

Y. Qin, Y. Ye, J. Fang, H. Wang, S. Liang, S. Tian, J. Zhang, J. Li, Y. Li, S. Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents. arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[65] [104]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 8748--8763, 2021

work page 2021

[66] [105]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

C. Rawles, S. Clinckemaillie, Y. Chang, J. Waltz, G. Lau, M. Fair, A. Li, W. Bishop, W. Li, F. Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents. arXiv preprint arXiv:2405.14573, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[67] [106]

B. Seed, J. Chen, T. Fan, X. Liu, L. Liu, Z. Lin, M. Wang, C. Wang, X. Wei, W. Xu, et al. Seed1.5-thinking: Advancing superb reasoning models with reinforcement learning. arXiv preprint arXiv:2504.13914, 2025

work page arXiv 2025

[68] [107]

Y. Shi, W. Yu, Z. Li, Y. Wang, H. Zhang, N. Liu, H. Mi, and D. Yu. Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment. arXiv preprint arXiv:2507.05720, 2025

work page arXiv 2025

[69] [108]

X. Sun, Y. Chen, Y. Huang, R. Xie, J. Zhu, K. Zhang, S. Li, Z. Yang, J. Han, X. Shu, et al. Hunyuan-large: An open-source moe model with 52 billion activated parameters by tencent. arXiv preprint arXiv:2411.02265, 2024

work page arXiv 2024

[70] [109]

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[71] [110]

H. Wang, H. Zou, H. Song, J. Feng, J. Fang, J. Lu, L. Liu, Q. Luo, S. Liang, S. Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning. arXiv preprint arXiv:2509.02544, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[72] [111]

J. Wang, H. Xu, H. Jia, X. Zhang, M. Yan, W. Shen, J. Zhang, F. Huang, and J. Sang. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration. Advances in Neural Information Processing Systems, 37: 0 2686--2710, 2024

work page 2024

[73] [112]

S. Wang, P. Fu, R. Zhang, S. Zhang, X. Xi, J. Yang, B. Qin, Y. Huang, Z. Luo, and J. Luan. GAIA : A data flywheel system for training GUI test-time scaling critic models. arXiv preprint arXiv:2601.18197, 2026. URL https://arxiv.org/pdf/2601.18197

work page arXiv 2026

[74] [113]

Look before you leap: A gui-critic-r1 model for pre-operative error diagnosis in gui automation, 2025

Y. Wanyan, X. Zhang, H. Xu, H. Liu, J. Wang, J. Ye, Y. Kou, M. Yan, F. Huang, X. Yang, et al. Look before you leap: A gui-critic-r1 model for pre-operative error diagnosis in gui automation. arXiv preprint arXiv:2506.04614, 2025

work page arXiv 2025

[75] [114]

Wanyin Lin, Y

I. Wanyin Lin, Y. Hu, S. S. Li, S. Geng, P. W. Koh, L. Zettlemoyer, T. Althoff, and M. Ghazvininejad. Self-improving vlm judges without human annotations. arXiv e-prints, pages arXiv--2512, 2025

work page 2025

[76] [115]

Q. Wu, K. Cheng, R. Yang, C. Zhang, J. Yang, H. Jiang, J. Mu, B. Peng, B. Qiao, R. Tan, et al. Gui-actor: Coordinate-free visual grounding for gui agents. arXiv preprint arXiv:2506.03143, 2025 a

work page arXiv 2025

[77] [116]

Z. Wu, Z. Wu, F. Xu, Y. Wang, Q. Sun, C. Jia, K. Cheng, Z. Ding, L. Chen, P. P. Liang, et al. Os-atlas: A foundation action model for generalist gui agents. arXiv preprint arXiv:2410.23218, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[78] [117]

Z. Wu, J. Xie, Z. Li, B. Yang, Q. Sun, Z. Liu, Z. Liu, Y. Qiao, X. Yue, Z. Wang, et al. Os-oracle: A comprehensive framework for cross-platform gui critic models. arXiv preprint arXiv:2512.16295, 2025 b

work page arXiv 2025

[79] [118]

H. Xiao, G. Wang, Y. Chai, Z. Lu, W. Lin, H. He, L. Fan, L. Bian, R. Hu, L. Liu, et al. Ui-genie: A self-improving approach for iteratively boosting mllm-based mobile gui agents. arXiv preprint arXiv:2505.21496, 2025

work page arXiv 2025

[80] [119]

T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37: 0 52040--52094, 2024

work page 2024