arxiv: 2605.14311 · v1 · submitted 2026-05-14 · 💻 cs.LG · cs.AI· cs.HC

Recognition: 2 theorem links

· Lean Theorem

Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

Yuchen Sun , Pei Fu , Shaojie Zhang , Anan Du , Xiuwen Xi , Ruoceng Zhang , Zhenbo Luo , Jian Luan , Chongyang Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.HC

keywords GUI critiquecontrastive learningaffordance spacetest-time scalingbinary classificationmetric learningaction rankingzero-shot transfer

0 comments

The pith

GUI critique improves when reframed as continuous semantic alignment in a shared affordance space rather than binary classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that existing binary classification approaches for GUI critic models lead to severe entanglement where scores for valid actions and plausible invalid ones become indistinguishable due to compressing hierarchical affordances into 0/1 labels and overfitting to noisy boundaries. This limits the effectiveness of test-time scaling for generalist GUI agents that rely on fine-grained ranking of candidate actions. To address this, the authors propose BBCritic which uses two-stage contrastive learning to align instructions and actions in a continuous shared Affordance Space based on the functional equivalence hypothesis, recovering the flattened structure. Trained without extra annotations, the 3B version outperforms larger 7B binary models and shows strong zero-shot transfer across platforms. A new benchmark called BBBench with a four-level taxonomy is introduced to evaluate such fine-grained continuous ranking.

Core claim

Through two-stage contrastive learning, BBCritic aligns instructions and actions in a shared Affordance Space to recover the hierarchical structure that binary supervision flattens, allowing a 3B model to outperform 7B-parameter state-of-the-art binary models on GUI critique tasks with strong zero-shot transferability.

What carries the argument

Shared Affordance Space via two-stage contrastive learning that reframes GUI critique as metric learning to preserve action hierarchies.

If this is right

A smaller model can provide better ranking for test-time scaling in GUI agents.
Zero-shot transfer across platforms and tasks is achievable without extra data.
New benchmarks with hierarchical taxonomies are needed to properly evaluate critic models.
GUI critique is more effectively treated as a metric-learning problem than a classification one.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This continuous approach may apply to other binary decision problems in AI agents to reduce model size requirements.
It implies that agent performance can improve by focusing on semantic alignment rather than label accuracy alone.
Testing the method on non-GUI environments could reveal broader applicability.

Load-bearing premise

A shared continuous Affordance Space can be learned to recover the hierarchical affordances compressed by binary labels.

What would settle it

Demonstrating that the 3B BBCritic model does not outperform binary baselines or fails to rank actions according to the four-level taxonomy on the new benchmark would disprove the central claim.

read the original abstract

Test-Time Scaling (TTS), which samples multiple candidate actions and ranks them via a Critic Model, has emerged as a promising paradigm for generalist GUI agents. Its efficacy thus hinges on the critic's fine-grained ranking ability. However, existing GUI critic models uniformly adopt binary classification. Our motivational analysis of these models exposes a severe entanglement: scores for valid actions and plausible-but-invalid distractors become indistinguishable. We attribute this failure to two structural defects: Affordance Collapse--the hierarchical affordance space is compressed into 0/1 labels; and Noise Sensitivity--binary objectives overfit to noisy decision boundaries. To resolve this, we introduce BBCritic (Beyond-Binary Critic), a paradigm shift grounded in the Functional Equivalence Hypothesis. Through two-stage contrastive learning, BBCritic aligns instructions and actions in a shared Affordance Space, recovering the hierarchical structure that binary supervision flattens. We also present BBBench (Beyond-Binary Bench), the first GUI critic benchmark that pairs a dense action space with a hierarchical four-level taxonomy, enabling fine-grained ranking evaluation. Experimental results show that BBCritic-3B, trained without any extra annotation, outperforms 7B-parameter SOTA binary models. It demonstrates strong zero-shot transferability across platforms and tasks, supporting our methodological view: GUI critique is fundamentally a metric-learning problem, not a classification one.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BBCritic reframes GUI critique as metric learning with a new four-level benchmark, and the 3B model beats larger binary baselines, but the key hypothesis still lacks direct validation.

read the letter

The main thing to know is that this paper treats GUI action ranking as a continuous metric-learning task instead of binary classification. Their BBCritic-3B, trained with two-stage contrastive alignment in a shared Affordance Space, outperforms 7B-parameter binary models and shows zero-shot transfer across platforms and tasks. They also release BBBench, which uses a four-level taxonomy rather than flat good/bad labels. That combination is the concrete advance here. The motivation is straightforward: binary labels collapse distinctions between near-miss and clearly invalid actions, and contrastive training is meant to recover more structure without needing extra annotations. The size reduction and cross-platform results are practical if they hold. The soft spots are in the supporting evidence. The Functional Equivalence Hypothesis is invoked to explain why the continuous space should separate the taxonomy levels, yet the abstract gives no embedding distances, clustering purity numbers, or ablations that isolate the metric-learning component from other training choices. Without those checks it is hard to know whether the gains come from the reframing or from data and optimization details. The paper would be stronger with at least one direct measurement showing that the learned space respects the four-level hierarchy. This is aimed at groups working on test-time scaling for generalist GUI agents. The directional shift is worth exploring even with the current gaps. I would send it to peer review so the authors can add the missing validation and readers can test the claims on additional tasks.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that binary classification for GUI critic models leads to Affordance Collapse (flattening of hierarchical affordance structure into 0/1 labels) and Noise Sensitivity. It proposes BBCritic, a 3B-parameter model trained via two-stage contrastive learning to embed instructions and actions in a shared continuous Affordance Space, grounded in the Functional Equivalence Hypothesis. A new benchmark BBBench is introduced with a dense action space and four-level hierarchical taxonomy. Experiments report that BBCritic-3B, trained without extra annotations, outperforms 7B-parameter binary SOTA models and shows strong zero-shot transfer across platforms and tasks, reframing GUI critique as metric learning.

Significance. If the results and hypothesis hold, the work offers a substantive reframing of GUI critique that could enable more efficient test-time scaling for generalist agents using smaller models. The continuous alignment approach and BBBench benchmark address a clear limitation in current binary critics and provide a new evaluation resource for fine-grained ranking.

major comments (2)

[Abstract / §3] Abstract and §3 (Functional Equivalence Hypothesis): The hypothesis that a shared continuous Affordance Space recovers the hierarchical structure lost by binary supervision is invoked to motivate the two-stage contrastive objective, yet no independent validation is supplied (e.g., no embedding-distance statistics, clustering purity by taxonomy level, or t-SNE analysis showing separation at the four levels of BBBench). Without such evidence, the outperformance of the 3B model over 7B binary baselines cannot be confidently attributed to structural recovery rather than training dynamics or data differences.
[Experiments] Experimental results section: The central performance claim (BBCritic-3B outperforming 7B SOTA binary models with zero-shot transfer) is stated without ablations, training curves, error analysis, or controls for data volume and annotation quality. This makes it impossible to isolate the contribution of the continuous alignment objective from other factors.

minor comments (2)

[Abstract] The abstract refers to 'dense action space' in BBBench but provides no quantitative details on action density or how the four-level taxonomy was constructed and validated.
[Abstract] Notation for the contrastive loss and Affordance Space embedding is introduced without an explicit equation or diagram in the abstract, reducing immediate clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify key opportunities to strengthen the evidence for the Functional Equivalence Hypothesis and to better isolate the contributions of our continuous alignment approach. We will revise the manuscript to incorporate the requested analyses.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (Functional Equivalence Hypothesis): The hypothesis that a shared continuous Affordance Space recovers the hierarchical structure lost by binary supervision is invoked to motivate the two-stage contrastive objective, yet no independent validation is supplied (e.g., no embedding-distance statistics, clustering purity by taxonomy level, or t-SNE analysis showing separation at the four levels of BBBench). Without such evidence, the outperformance of the 3B model over 7B binary baselines cannot be confidently attributed to structural recovery rather than training dynamics or data differences.

Authors: We appreciate the referee's emphasis on direct validation. While the reported performance gains provide supporting evidence for the hypothesis, we agree that explicit analyses of the embedding space are needed to demonstrate recovery of the four-level hierarchy. In the revised manuscript, we will add t-SNE visualizations of the Affordance Space embeddings colored by taxonomy level, quantitative metrics including average intra-level and inter-level embedding distances, and clustering purity scores (e.g., normalized mutual information) computed on BBBench. These additions will help attribute improvements to structural recovery in the continuous space. revision: yes
Referee: [Experiments] Experimental results section: The central performance claim (BBCritic-3B outperforming 7B SOTA binary models with zero-shot transfer) is stated without ablations, training curves, error analysis, or controls for data volume and annotation quality. This makes it impossible to isolate the contribution of the continuous alignment objective from other factors.

Authors: We acknowledge that the current experimental section lacks sufficient controls to isolate the effect of the two-stage contrastive objective. In the revision, we will add: (i) direct ablations training the same 3B backbone with a binary classification head versus our contrastive objective on identical data, (ii) training curves comparing convergence and final performance, (iii) error analysis stratified by taxonomy level and action type, and (iv) controls that vary data volume while holding annotation sources fixed. These will clarify the specific benefits of reframing GUI critique as metric learning. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces the Functional Equivalence Hypothesis as an explicit grounding assumption to motivate reframing GUI critique as metric learning in a shared Affordance Space. The derivation proceeds from identified defects in binary classification (Affordance Collapse and Noise Sensitivity) to a two-stage contrastive objective, with performance claims supported by empirical results on the newly introduced BBBench benchmark. No equations, self-citations, or definitional reductions are present that make any prediction equivalent to its inputs by construction. The central claims rest on experimental outperformance rather than tautological restatement, rendering the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim depends on the Functional Equivalence Hypothesis (ad-hoc to the paper) and the newly introduced Affordance Space construct. No explicit numerical free parameters are stated in the abstract. The hypothesis functions as an unproven domain assumption that justifies replacing binary labels with contrastive alignment.

axioms (1)

ad hoc to paper Functional Equivalence Hypothesis: instructions and actions can be aligned in a shared continuous Affordance Space that recovers hierarchical structure lost by binary supervision.
Invoked to motivate the two-stage contrastive learning; no independent derivation or external validation is provided in the abstract.

invented entities (3)

Affordance Space no independent evidence
purpose: Continuous embedding space in which distance reflects semantic equivalence between instructions and GUI actions.
New representational construct introduced to replace binary labels; no independent evidence outside the paper is given.
BBCritic no independent evidence
purpose: Beyond-binary critic model trained with two-stage contrastive learning.
New model name and architecture; evidence is the reported performance numbers.
BBBench no independent evidence
purpose: First GUI critic benchmark with dense action space and four-level hierarchical taxonomy.
New evaluation resource; no external validation of its construction is shown.

pith-pipeline@v0.9.0 · 5569 in / 1681 out tokens · 29156 ms · 2026-05-15T01:54:24.729643+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

reframes GUI critique from rigid binary discrimination to continuous semantic alignment via contrastive learning... project both modalities into a shared embedding space... temperature-scaled cosine similarity... InfoNCE loss
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Functional Equivalence Hypothesis... instruction I and optimal action a* are two forms of the same underlying GUI Affordance

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · 15 internal anchors

[4]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

KG-RAG: Enhancing GUI Agent Decision-Making via Knowledge Graph-Driven Retrieval-Augmented Generation , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025
[10]

Advances in Neural Information Processing Systems , volume=

Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration , author=. Advances in Neural Information Processing Systems , volume=

work page
[12]

arXiv e-prints , pages=

Self-Improving VLM Judges Without Human Annotations , author=. arXiv e-prints , pages=

work page
[13]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Improving Discriminative Capability of Reward Models in RLHF Using Contrastive Learning , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2024
[20]

interactions , volume=

Affordance, conventions, and design , author=. interactions , volume=. 1999 , publisher=

work page 1999
[22]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Os-genesis: Automating gui agent trajectory construction via reverse task synthesis , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[23]

arXiv preprint arXiv:2410.02907 , year=

Nnetnav: Unsupervised learning of browser agents through environment interaction in the wild , author=. arXiv preprint arXiv:2410.02907 , year=

work page arXiv
[27]

2026 , url=

Shaokang Wang and Pei Fu and Ruoceng Zhang and Shaojie Zhang and Xiuwen Xi and Jiahui Yang and Bin Qin and Ying Huang and Zhenbo Luo and Jian Luan , journal=. 2026 , url=

work page 2026
[31]

Advances in Neural Information Processing Systems , volume=

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments , author=. Advances in Neural Information Processing Systems , volume=

work page
[33]

Journal educational computing research , volume=

User centered system design: new perspectives on human-computer interaction , author=. Journal educational computing research , volume=

work page
[36]

Authorea Preprints , year=

Qwen 2.5: A comprehensive review of the leading resource-efficient llm with potentioal to surpass all competitors , author=. Authorea Preprints , year=

work page
[38]

Advances in Neural Information Processing Systems , volume=

On the effects of data scale on ui control agents , author=. Advances in Neural Information Processing Systems , volume=

work page
[39]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

GUIOdyssey: A comprehensive dataset for cross-app GUI navigation on mobile devices , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[40]

Gemini 3 Flash and Gemini 3 Pro , year =

work page
[41]

Claude 4.0 Sonnet , year =

work page
[47]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Seeclick: Harnessing gui grounding for advanced visual gui agents , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[53]

Advances in Neural Information Processing Systems , volume=

Mind2web: Towards a generalist agent for the web , author=. Advances in Neural Information Processing Systems , volume=

work page
[55]

Proceedings of the 22nd International Conference on Machine Learning , pages=

Learning to rank using gradient descent , author=. Proceedings of the 22nd International Conference on Machine Learning , pages=

work page
[56]

Proceedings of the 24th International Conference on Machine Learning , pages=

Learning to rank: from pairwise approach to listwise approach , author=. Proceedings of the 24th International Conference on Machine Learning , pages=

work page
[57]

Proceedings of the 9th International Conference on Artificial Neural Networks , pages=

Support vector learning for ordinal regression , author=. Proceedings of the 9th International Conference on Artificial Neural Networks , pages=

work page
[58]

Proceedings of the 38th International Conference on Machine Learning , pages=

Learning Transferable Visual Models From Natural Language Supervision , author=. Proceedings of the 38th International Conference on Machine Learning , pages=

work page
[60]

Proceedings of the 38th International Conference on Machine Learning (ICML) , year=

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , author=. Proceedings of the 38th International Conference on Machine Learning (ICML) , year=

work page
[61]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year=

Sigmoid Loss for Language Image Pre-Training , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year=

work page
[62]

Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , booktitle=

work page
[63]

Jiang, Ting and Song, Minghui and Zhang, Zihan and Huang, Haizhen and Deng, Weiwei and Sun, Feng and Zhang, Qi and Wang, Deqing and Zhuang, Fuzhen , journal=

work page
[64]

Lee, Chankyu and Roy, Rajarshi and Xu, Mengyao and Raiman, Jonathan and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei , journal=

work page
[66]

Lan, Zhibin and Niu, Liqiang and Meng, Fandong and Zhou, Jie and Su, Jinsong , booktitle=

work page
[67]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in

Yang, Jianwei and Zhang, Hao and Li, Feng and Zou, Xueyan and Li, Chunyuan and Gao, Jianfeng , journal=. Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in

work page
[68]

Claude 4.0 sonnet

Anthropic . Claude 4.0 sonnet. Large Language Model, 2026. URL https://www.anthropic.com/claude. Accessed: 2026-01-29

work page 2026
[69]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

Burges, T

C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning, pages 89--96, 2005

work page 2005
[71]

Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th International Conference on Machine Learning, pages 129--136, 2007

work page 2007
[72]

Y. Chai, H. Li, J. Zhang, L. Liu, G. Liu, G. Wang, S. Ren, S. Huang, and H. Li. A3: Android agent arena for mobile gui agents. arXiv preprint arXiv:2501.01149, 2025

work page arXiv 2025
[73]

C. Chen, K. Ji, H. Zhong, M. Zhu, A. Li, G. Gan, Z. Huang, C. Zou, J. Liu, J. Chen, et al. Gui-shepherd: Reliable process reward and verification for long-sequence gui tasks. arXiv preprint arXiv:2509.23738, 2025 a

work page arXiv 2025
[74]

L. Chen, R. Zheng, B. Wang, S. Jin, C. Huang, J. Ye, Z. Zhang, Y. Zhou, Z. Xi, T. Gui, et al. Improving discriminative capability of reward models in rlhf using contrastive learning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15270--15283, 2024

work page 2024
[75]

L. Chen, H. Zhou, C. Cai, J. Zhang, P. Tong, Q. Kong, X. Zhang, C. Liu, Y. Liu, W. Wang, et al. Ui-ins: Enhancing gui grounding with multi-perspective instruction-as-reasoning. arXiv preprint arXiv:2510.20286, 2025 b

work page arXiv 2025
[76]

S. Chen, T. Zhao, Y. Bin, F. Ma, W. Shao, and Z. Wang. D-gara: A dynamic benchmarking framework for gui agent robustness in real-world anomalies. arXiv preprint arXiv:2511.16590, 2025 c

work page arXiv 2025
[77]

Cheng, Q

K. Cheng, Q. Sun, Y. Chu, F. Xu, L. YanTao, J. Zhang, and Z. Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9313--9332, 2024

work page 2024
[78]

X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y. Su. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36: 0 28091--28114, 2023

work page 2023
[79]

ColPali: Efficient Document Retrieval with Vision Language Models

M. Faysse, H. Sibille, T. Wu, B. Omrani, G. Viaud, C. Hudelot, and P. Colombo. ColPali : Efficient document retrieval with vision language models. arXiv preprint arXiv:2407.01449, 2024

work page internal anchor Pith review arXiv 2024
[80]

Gemini 3 flash and gemini 3 pro

Google . Gemini 3 flash and gemini 3 pro. Large Language Model, 2026. URL https://gemini.google.com/. Accessed: 2026-01-29

work page 2026
[81]

Z. Gu, Z. Zeng, Z. Xu, X. Zhou, S. Shen, Y. Liu, B. Zhou, C. Meng, T. Xia, W. Chen, et al. Ui-venus technical report: Building high-performance ui agents with rft. arXiv preprint arXiv:2508.10833, 2025

work page arXiv 2025
[82]

Z. Guan, J. C. L. Li, Z. Hou, P. Zhang, D. Xu, Y. Zhao, M. Wu, J. Chen, T.-T. Nguyen, P. Xian, et al. Kg-rag: Enhancing gui agent decision-making via knowledge graph-driven retrieval-augmented generation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5396--5405, 2025

work page 2025
[83]

Herbrich, T

R. Herbrich, T. Graepel, and K. Obermayer. Support vector learning for ordinal regression. In Proceedings of the 9th International Conference on Artificial Neural Networks, pages 97--102, 1999

work page 1999
[84]

Y. Im, B. Jo, J. Wi, S. Baek, T. H. Min, J. H. Lee, S. Oh, I. Shin, and S. Lee. Modular and multi-path-aware offline benchmarking for mobile gui agents. arXiv preprint arXiv:2512.12634, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[85]

C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), 2021

work page 2021
[86]

E5-v: Universal embeddings with multimodal large language models.arXiv preprint arXiv:2407.12580, 2024

T. Jiang, M. Song, Z. Zhang, H. Huang, W. Deng, F. Sun, Q. Zhang, D. Wang, and F. Zhuang. E5-V : Universal embeddings with multimodal large language models. arXiv preprint arXiv:2407.12580, 2024 a

work page arXiv 2024
[87]

Jiang, R

Z. Jiang, R. Meng, X. Yang, S. Yavuz, Y. Zhou, and W. Chen. Vlm2vec: Training vision-language models for massive multimodal embedding tasks. arXiv preprint arXiv:2410.05160, 2024 b

work page arXiv 2024
[88]

Z. Lan, L. Niu, F. Meng, J. Zhou, and J. Su. LLaVE : Large language and vision embedding models with hardness-weighted contrastive learning. In Findings of the Association for Computational Linguistics: EMNLP 2025, 2025

work page 2025
[89]

C. Lee, R. Roy, M. Xu, J. Raiman, M. Shoeybi, B. Catanzaro, and W. Ping. NV-Embed : Improved techniques for training LLMs as generalist embedding models. arXiv preprint arXiv:2405.17428, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[90]

J. Li, D. Li, S. Savarese, and S. Hoi. BLIP-2 : Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning (ICML), 2023

work page 2023
[91]

W. Li, W. E. Bishop, A. Li, C. Rawles, F. Campbell-Ajala, D. Tyamagundlu, and O. Riva. On the effects of data scale on ui control agents. Advances in Neural Information Processing Systems, 37: 0 92130--92154, 2024

work page 2024
[92]

M. Lin, P. Ding, S. Wang, Z. Zhuang, Y. Liu, X. Tong, W. Song, S. Lyu, S. Huang, and D. Wang. Hif-vla: Hindsight, insight and foresight through motion representation for vision-language-action models. arXiv preprint arXiv:2512.09928, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[93]

Y. Liu, P. Li, C. Xie, X. Hu, X. Han, S. Zhang, H. Yang, and F. Wu. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners. arXiv preprint arXiv:2504.14239, 2025 a

work page arXiv 2025
[94]

Z. Liu, J. Xie, Z. Ding, Z. Li, B. Yang, Z. Wu, X. Wang, Q. Sun, S. Liu, W. Wang, et al. Scalecua: Scaling open-source computer use agents with cross-platform data. arXiv preprint arXiv:2509.15221, 2025 b

work page arXiv 2025
[95]

Q. Lu, W. Shao, Z. Liu, L. Du, F. Meng, B. Li, B. Chen, S. Huang, K. Zhang, and P. Luo. Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22404--22414, 2025 a

work page 2025
[96]

Z. Lu, Y. Chai, Y. Guo, X. Yin, L. Liu, H. Wang, H. Xiao, S. Ren, G. Xiong, and H. Li. Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning. arXiv preprint arXiv:2503.21620, 2025 b

work page arXiv 2025
[97]

D. Luo, B. Tang, K. Li, G. Papoudakis, J. Song, S. Gong, J. Hao, J. Wang, and K. Shao. Vimo: A generative visual gui world model for app agents. arXiv preprint arXiv:2504.13936, 2025 a

work page arXiv 2025
[98]

R. Luo, L. Wang, W. He, L. Chen, J. Li, and X. Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents. arXiv preprint arXiv:2504.10458, 2025 b

work page internal anchor Pith review arXiv 2025
[99]

McAleese, R

N. McAleese, R. M. Pokorny, J. F. C. Uribe, E. Nitishinskaya, M. Trebacz, and J. Leike. Llm critics help catch llm bugs. arXiv preprint arXiv:2407.00215, 2024

work page arXiv 2024
[100]

D. A. Norman. Affordance, conventions, and design. interactions, 6 0 (3): 0 38--43, 1999

work page 1999
[101]

OpenAI . Gpt-5. Large Language Model, 2026. URL https://chatgpt.com/. Accessed: 2026-01-29

work page 2026
[102]

R. D. Pea. User centered system design: new perspectives on human-computer interaction. Journal educational computing research, 3: 0 129--134, 1987

work page 1987
[103]

Y. Qin, Y. Ye, J. Fang, H. Wang, S. Liang, S. Tian, J. Zhang, J. Li, Y. Li, S. Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents. arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[104]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 8748--8763, 2021

work page 2021
[105]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

C. Rawles, S. Clinckemaillie, Y. Chang, J. Waltz, G. Lau, M. Fair, A. Li, W. Bishop, W. Li, F. Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents. arXiv preprint arXiv:2405.14573, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[106]

B. Seed, J. Chen, T. Fan, X. Liu, L. Liu, Z. Lin, M. Wang, C. Wang, X. Wei, W. Xu, et al. Seed1.5-thinking: Advancing superb reasoning models with reinforcement learning. arXiv preprint arXiv:2504.13914, 2025

work page arXiv 2025
[107]

Y. Shi, W. Yu, Z. Li, Y. Wang, H. Zhang, N. Liu, H. Mi, and D. Yu. Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment. arXiv preprint arXiv:2507.05720, 2025

work page arXiv 2025
[108]

X. Sun, Y. Chen, Y. Huang, R. Xie, J. Zhu, K. Zhang, S. Li, Z. Yang, J. Han, X. Shu, et al. Hunyuan-large: An open-source moe model with 52 billion activated parameters by tencent. arXiv preprint arXiv:2411.02265, 2024

work page arXiv 2024
[109]

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[110]

H. Wang, H. Zou, H. Song, J. Feng, J. Fang, J. Lu, L. Liu, Q. Luo, S. Liang, S. Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning. arXiv preprint arXiv:2509.02544, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[111]

J. Wang, H. Xu, H. Jia, X. Zhang, M. Yan, W. Shen, J. Zhang, F. Huang, and J. Sang. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration. Advances in Neural Information Processing Systems, 37: 0 2686--2710, 2024

work page 2024
[112]

S. Wang, P. Fu, R. Zhang, S. Zhang, X. Xi, J. Yang, B. Qin, Y. Huang, Z. Luo, and J. Luan. GAIA : A data flywheel system for training GUI test-time scaling critic models. arXiv preprint arXiv:2601.18197, 2026. URL https://arxiv.org/pdf/2601.18197

work page arXiv 2026
[113]

Wanyan, X

Y. Wanyan, X. Zhang, H. Xu, H. Liu, J. Wang, J. Ye, Y. Kou, M. Yan, F. Huang, X. Yang, et al. Look before you leap: A gui-critic-r1 model for pre-operative error diagnosis in gui automation. arXiv preprint arXiv:2506.04614, 2025

work page arXiv 2025
[114]

Wanyin Lin, Y

I. Wanyin Lin, Y. Hu, S. S. Li, S. Geng, P. W. Koh, L. Zettlemoyer, T. Althoff, and M. Ghazvininejad. Self-improving vlm judges without human annotations. arXiv e-prints, pages arXiv--2512, 2025

work page 2025
[115]

Q. Wu, K. Cheng, R. Yang, C. Zhang, J. Yang, H. Jiang, J. Mu, B. Peng, B. Qiao, R. Tan, et al. Gui-actor: Coordinate-free visual grounding for gui agents. arXiv preprint arXiv:2506.03143, 2025 a

work page arXiv 2025
[116]

Z. Wu, Z. Wu, F. Xu, Y. Wang, Q. Sun, C. Jia, K. Cheng, Z. Ding, L. Chen, P. P. Liang, et al. Os-atlas: A foundation action model for generalist gui agents. arXiv preprint arXiv:2410.23218, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[117]

Z. Wu, J. Xie, Z. Li, B. Yang, Q. Sun, Z. Liu, Z. Liu, Y. Qiao, X. Yue, Z. Wang, et al. Os-oracle: A comprehensive framework for cross-platform gui critic models. arXiv preprint arXiv:2512.16295, 2025 b

work page arXiv 2025
[118]

H. Xiao, G. Wang, Y. Chai, Z. Lu, W. Lin, H. He, L. Fan, L. Bian, R. Hu, L. Liu, et al. Ui-genie: A self-improving approach for iteratively boosting mllm-based mobile gui agents. arXiv preprint arXiv:2505.21496, 2025

work page arXiv 2025
[119]

T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37: 0 52040--52094, 2024

work page 2024

Showing first 80 references.