pith. sign in

arxiv: 2605.15963 · v1 · pith:45WR3ZSSnew · submitted 2026-05-15 · 💻 cs.AI

PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

Pith reviewed 2026-05-20 18:42 UTC · model grok-4.3

classification 💻 cs.AI
keywords GUI agentspoint-precise controlgeometric GUIvision-language modelsreinforcement learningsemantic execution gaptopology-aware planningPAGE Bench
0
0 comments X

The pith

PAGER bridges the semantic-execution gap in point-precise geometric GUI control by achieving 4.1 times higher task success than baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses precision-sensitive GUI tasks where actions must target exact points in continuous space because geometric primitives have dependencies that turn small coordinate errors into cascading failures. It creates PAGE Bench with 4,906 problems and more than 224,000 pixel-level actions to test these challenges. The proposed PAGER agent decomposes construction into dependency-structured planning and precise pixel execution. Supervised tuning builds an executable action grammar while reinforcement learning with state-conditioned geometric feedback corrects for exposure bias in rollouts. This yields major gains, with step success rising above 62 percent from under 9 percent in prior GUI agents.

Core claim

PAGER closes the Semantic-Execution Gap by combining dependency-structured planning with pixel-level execution. Pixel-grounded supervised tuning sets up the action grammar, and precision-aligned reinforcement learning uses state-conditioned geometric feedback to handle rollout errors, resulting in 4.1x higher task success and over 62% step success rate for point-precise GUI control.

What carries the argument

The PAGER agent, which decomposes tasks via dependency-structured planning and applies precision-aligned reinforcement learning with state-conditioned geometric feedback to mitigate exposure bias.

Load-bearing premise

The 4,906 problems in PAGE Bench together with the state-conditioned geometric feedback in the RL stage are representative of real-world dependency-driven error propagation and that the performance gains will generalize beyond these specific benchmark tasks.

What would settle it

Demonstrating that PAGER's success rates fall back to baseline levels when tested on geometric construction problems with different dependency structures or on live desktop interfaces not represented in PAGE Bench would challenge the central claims.

Figures

Figures reproduced from arXiv: 2605.15963 by Bihui Yu, Caijun Jia, Cheng Tan, Conghui He, Jingxuan Wei, Linzhuang Sun, Shan Liu, Siyuan Li, Xi Bai, Xinglong Xu, Zheng Sun.

Figure 1
Figure 1. Figure 1: Precision-sensitive GUI tasks expose a capability gap hidden by conventional GUI bench [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of PAGER. Planning orders sub-tasks, execution grounds them into pixel-level actions, and training aligns supervision with precision rewards. canvas state C0 and generates τ = (C0, a1, C1, . . . , aL, CL), Cℓ = M(Cℓ−1 , aℓ ), aℓ = (κℓ , oℓ , ξℓ ), (1) where M is the drawing environment, κℓ ∈ {click, paint, type} is the operation type, oℓ is the object type, and ξℓ denotes typed parameters. The tas… view at source ↗
Figure 3
Figure 3. Figure 3: Construction pipeline of PAGE Bench. Candidate geometry problems are screened for GeoGebra-executable instances, converted into structured task sequences, mapped to low-level GUI actions, executed with step-wise recording in a live environment, and finally filtered to retain high￾quality trajectories for precision-sensitive geometric GUI learning and evaluation. Problem collection and executable screening.… view at source ↗
Figure 4
Figure 4. Figure 4: Question-type and skill composition in PAGE Bench. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison. PAGER better preserves rectangular structure, diagonal intersection, and coordinate consistency. 5.4 Case Study and Error Analysis [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: illustrates a clear spatial separation in model perfor￾mance. Most existing MLLMs, including GPT-5.4 and Gemini￾3.1-Pro, cluster in the lower-left region with both low auto￾mated scores and low human ratings. In contrast, PAGER occupies the top-right corner, achieving high automated suc￾cess alongside superior human preference. The near-perfect correlation (r = 0.9397) demonstrates a strong alignment be￾tw… view at source ↗
Figure 7
Figure 7. Figure 7: Performance comparison of PAGER against fourteen baselines on PAGE Bench. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Fine-grained performance breakdown across ten geometric capabilities. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt used to screen K12 mathematics questions for GeoGebra-based geometric visualiza [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt used to generate structured GeoGebra construction tasks from K12 geometry [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt used for multimodal quality assurance of GeoGebra-based dataset entries. The [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
read the original abstract

Large vision-language models have significantly advanced GUI agents, enabling executable interaction across web, mobile, and desktop interfaces. Yet these gains largely rely on a forgiving region-tolerant paradigm, where many nearby pixels inside the same component remain valid. Precise geometric construction breaks this assumption: actions must land on points in continuous canvas space rather than tolerant regions. Because geometric primitives carry ontological dependencies, a local coordinate error can induce cascading topological failures that distort downstream objects and invalidate the final construction. We identify this regime as precision-sensitive GUI tasks, requiring point-level accuracy, geometry-aware verification, and robustness to dependency-driven error propagation. To benchmark it, we introduce PAGE Bench, with 4,906 problems and over 224K process-supervised, pixel-level GUI actions. We further propose PAGER, a topology-aware agent that decomposes construction into dependency-structured planning and pixel-level execution. Pixel-grounded supervised tuning establishes executable action grammar, while precision-aligned reinforcement learning mitigates rollout-induced exposure bias through state-conditioned geometric feedback. Experiments reveal a pronounced Semantic-Execution Gap: general multimodal models can exceed 88% action type accuracy yet remain below 6% task success. PAGER closes this gap, delivering 4.1x higher task success than the strongest evaluated general baseline and raising step success rate from below 9% for GUI-specialized agents to over 62%, establishing a new state of the art for point-precise GUI control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper identifies a semantic-execution gap in precision-sensitive GUI tasks where actions require point-level accuracy on continuous canvases and ontological dependencies can cause cascading failures. It introduces PAGE Bench (4,906 problems, >224K pixel-level actions) and proposes PAGER, which decomposes tasks into dependency-structured planning plus pixel-grounded execution. Supervised tuning learns executable action grammar; precision-aligned RL uses state-conditioned geometric feedback to reduce exposure bias. Experiments show general VLMs exceed 88% action-type accuracy yet <6% task success, while PAGER achieves 4.1x higher task success than the strongest baseline and >62% step success, establishing a new SOTA for point-precise GUI control.

Significance. If the empirical gains hold under standard visual-only inference, the work usefully isolates a new regime of dependency-driven error propagation in GUI agents and supplies both a benchmark and a training recipe that demonstrably narrows the gap between semantic understanding and executable precision. The scale of the action dataset and the explicit contrast between region-tolerant and point-precise regimes are concrete contributions that future GUI-agent research can build upon.

major comments (2)
  1. [Methods (precision-aligned RL)] Methods section describing precision-aligned RL: the state-conditioned geometric feedback (exact point-level errors and topological validity) is presented as the mechanism that mitigates exposure bias and yields the reported 62% step success. The manuscript does not state whether this oracle signal is removed at inference or whether an equivalent visual proxy is learned; if the 4.1x task-success gain depends on privileged geometric supervision unavailable to standard pixel-only agents, the central claim of practical SOTA does not yet transfer.
  2. [Experiments] Experiments section and abstract claims: the 4.1x task-success and >62% step-success figures are reported without error bars, confidence intervals, or explicit verification that the 224K actions were collected under the same train/test split used for evaluation. Because the central claim rests on these aggregate numbers establishing a new state of the art, the absence of statistical detail and reproducibility artifacts is load-bearing.
minor comments (2)
  1. [Abstract and §1] The term 'Semantic-Execution Gap' is used prominently in the abstract and title but receives only an informal gloss; a short formal definition or equation characterizing the gap (e.g., success rate conditioned on action-type accuracy) would improve clarity.
  2. [Figures and Tables] Figure captions and table headers should explicitly note whether reported metrics are macro- or micro-averaged and whether they include only successful trajectories or all rollouts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important points on methodological clarity and statistical reporting that we will address to strengthen the manuscript. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Methods (precision-aligned RL)] Methods section describing precision-aligned RL: the state-conditioned geometric feedback (exact point-level errors and topological validity) is presented as the mechanism that mitigates exposure bias and yields the reported 62% step success. The manuscript does not state whether this oracle signal is removed at inference or whether an equivalent visual proxy is learned; if the 4.1x task-success gain depends on privileged geometric supervision unavailable to standard pixel-only agents, the central claim of practical SOTA does not yet transfer.

    Authors: We appreciate the referee's request for explicit clarification on this point. The state-conditioned geometric feedback (point-level errors and topological validity) is used solely as a training-time signal within the precision-aligned RL stage to provide dense rewards and mitigate exposure bias during policy rollouts. Once training concludes, the resulting policy is deployed at inference using only standard visual observations from the GUI canvas, with no access to oracle geometric information. This is consistent with standard RL practice for GUI agents, where privileged signals aid learning but are unavailable during execution. We will revise the Methods section to state this distinction explicitly and confirm that all reported inference-time results (including the 4.1x task success and >62% step success) are obtained under visual-only conditions. revision: yes

  2. Referee: [Experiments] Experiments section and abstract claims: the 4.1x task-success and >62% step-success figures are reported without error bars, confidence intervals, or explicit verification that the 224K actions were collected under the same train/test split used for evaluation. Because the central claim rests on these aggregate numbers establishing a new state of the art, the absence of statistical detail and reproducibility artifacts is load-bearing.

    Authors: We agree that additional statistical detail and reproducibility information are necessary to support the central claims. The 4.1x task-success and >62% step-success results are computed on the held-out test split of PAGE Bench; the >224K pixel-level actions were collected exclusively from the training problems under the train/test split detailed in the dataset construction section. In the revised manuscript we will add error bars (standard deviation across five independent runs with different random seeds), 95% confidence intervals for the key metrics, and an explicit statement confirming the data split and evaluation protocol. We will also include a reproducibility appendix listing hyperparameters, seeds, and evaluation code references. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on new benchmark with independent method components

full rationale

The paper introduces PAGE Bench (4,906 problems) and PAGER agent, which applies standard pixel-grounded supervised tuning followed by RL using state-conditioned geometric feedback. Central claims consist of measured task success rates (4.1x improvement, 62% step success) on this benchmark rather than any derived quantity obtained by fitting parameters to a target and then re-predicting it, or by self-referential equations. No load-bearing step reduces to a self-citation chain, uniqueness theorem from the same authors, or ansatz smuggled via prior work; the feedback mechanism is an explicit training choice whose generalization is an empirical question separate from circularity. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The method relies on standard supervised and reinforcement learning techniques applied to a new domain.

pith-pipeline@v0.9.0 · 5820 in / 1199 out tokens · 50355 ms · 2026-05-20T18:42:29.618366+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

99 extracted references · 99 canonical work pages · 13 internal anchors

  1. [1]

    Claude 4.6 sonnet: Performance and safety updates

    Anthropic. Claude 4.6 sonnet: Performance and safety updates. Technical report, Anthropic PBC,

  2. [2]

    Available at: https://www.anthropic.com/news/ claude-sonnet-4-6

    Official announcement and technical overview. Available at: https://www.anthropic.com/news/ claude-sonnet-4-6

  3. [3]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  4. [4]

    Milestones over outcome: Unlocking geometric reasoning with sub-goal verifiable reward.arXiv preprint arXiv:2601.05073, 2026

    Jianlong Chen, Daocheng Fu, Shengze Xu, Jiawei Chen, Yuan Feng, Yue Yang, Junchi Yan, Hongyuan Zha, and Renqiu Xia. Milestones over outcome: Unlocking geometric reasoning with sub-goal verifiable reward. arXiv preprint arXiv:2601.05073, 2026

  5. [5]

    Guicourse: From general vision language model to versatile gui agent

    Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, et al. Guicourse: From general vision language model to versatile gui agent. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21936–21959, 2025

  6. [6]

    Mm-prm: Enhancing multimodal mathematical reasoning with scalable step-level supervision.arXiv preprint arXiv:2505.13427, 2025

    Lingxiao Du, Fanqing Meng, Zongkai Liu, Zhixiang Zhou, Ping Luo, Qiaosheng Zhang, and Wenqi Shao. Mm-prm: Enhancing multimodal mathematical reasoning with scalable step-level supervision.arXiv preprint arXiv:2505.13427, 2025

  7. [7]

    Mcptoolbench++: A large scale ai agent model context protocol mcp tool use benchmark.arXiv preprint arXiv:2508.07575, 2025

    Shiqing Fan, Xichen Ding, Liang Zhang, and Linjian Mo. Mcptoolbench++: A large scale ai agent model context protocol mcp tool use benchmark.arXiv preprint arXiv:2508.07575, 2025

  8. [8]

    Geobench: Rethinking multimodal geometric problem-solving via hierarchical evaluation

    Yuan Feng, Yue Yang, Xiaohan He, Jiatong Zhao, Jianlong Chen, Daocheng Fu, Qi Liu, Renqiu Xia, Bo Zhang, and Junchi Yan. Geobench: Rethinking multimodal geometric problem-solving via hierarchical evaluation. InThe Fourteenth International Conference on Learning Representations, 2026. ICLR 2026 Poster

  9. [9]

    Trustgeogen: Formal-verified data engine for trustworthy multi-modal geometric problem solving.arXiv preprint arXiv:2504.15780, 2026

    Daocheng Fu, Zijun Chen, Renqiu Xia, Qi Liu, Yuan Feng, Hongbin Zhou, Renrui Zhang, Shiyang Feng, Peng Gao, Junchi Yan, Botian Shi, Bo Zhang, and Yu Qiao. Trustgeogen: Scalable and formal-verified data engine for trustworthy multi-modal geometric problem solving.arXiv preprint arXiv:2504.15780, 2025

  10. [10]

    Gemini 3.1 pro: Model card and technical overview

    Google DeepMind. Gemini 3.1 pro: Model card and technical overview. Technical report, Google, February 2026. URL https://deepmind.google/models/gemini/pro/. Official technical documentation describing the Gemini 3.1 Pro preview model. Available at:https://deepmind.google/models/model-cards/ gemini-3-1-pro/

  11. [11]

    Geovlmath: Enhancing geometry reasoning in vision-language models via cross-modal reward for auxiliary line creation.arXiv preprint arXiv:2510.11020,

    Shasha Guo, Liang Pang, Xi Wang, Yanling Wang, Huawei Shen, and Jing Zhang. Geovlmath: Enhancing geometry reasoning in vision-language models via cross-modal reward for auxiliary line creation.arXiv preprint arXiv:2510.11020, 2025

  12. [12]

    Zhitao He, Zongwei Lyu, Dazhong Chen, Dadi Guo, and Yi R. Fung. Matp-bench: Can mllm be a good automated theorem prover for multimodal problems?arXiv preprint arXiv:2506.06034, 2025

  13. [13]

    Cogagent: A visual language model for gui agents

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14281–14290, 2024

  14. [14]

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

  15. [15]

    Reguide: Data efficient gui grounding via spatial reasoning and search

    Hyunseok Lee, Jeonghoon Kim, Beomjun Kim, Jihoon Tack, Cheonbok Park, Sookyo In, Chansong Jo, Jaehong Lee, Jinwoo Shin, and Kang Min Yoo. Reguide: Data efficient gui grounding via spatial reasoning and search. arXiv preprint arXiv:2505.15259, 2025

  16. [16]

    UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

    Shuquan Lian, Yuhang Wu, Jia Ma, Zihan Song, Bingqi Chen, Xiawu Zheng, and Hui Li. Ui-agile: Advancing gui agents with effective reinforcement learning and precise inference-time grounding.arXiv preprint arXiv:2507.22025, 2025. Accepted to CVPR 2026 Findings

  17. [17]

    Beyond clicking: A step towards generalist gui grounding via text dragging.arXiv preprint arXiv:2601.06031, 2026

    Zeyi Liao, Yadong Lu, Boyu Gou, Huan Sun, and Ahmed Awadallah. Beyond clicking: A step towards generalist gui grounding via text dragging.arXiv preprint arXiv:2601.06031, 2026. 12 PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

  18. [18]

    Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URL https://llava-vl.github.io/blog/ 2024-01-30-llava-next/

  19. [19]

    InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

    Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners.arXiv preprint arXiv:2504.14239, 2025

  20. [20]

    Infiguiagent: A multimodal generalist gui agent with native reasoning and reflection

    Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, and Fei Wu. Infiguiagent: A multimodal generalist gui agent with native reasoning and reflection. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 103...

  21. [21]

    UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

    Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Guanjing Xiong, and Hongsheng Li. Ui-r1: Enhancing action prediction of gui agents by reinforcement learning.arXiv preprint arXiv:2503.21620, 2025

  22. [22]

    Visaidmath: Benchmark- ing visual-aided mathematical reasoning

    Jingkun Ma, Runzhe Zhan, Derek F. Wong, Yang Li, Di Sun, Hou Pong Chan, and Lidia S. Chao. Visaidmath: Benchmarking visual-aided mathematical reasoning.arXiv preprint arXiv:2410.22995, 2024

  23. [23]

    Coco-agent: A comprehensive cognitive mllm agent for smartphone gui automation

    Xinbei Ma, Zhuosheng Zhang, and Hai Zhao. Coco-agent: A comprehensive cognitive mllm agent for smartphone gui automation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 9097–9110, 2024

  24. [24]

    Gui-360: A comprehensive dataset and benchmark for computer-using agents.arXiv preprint arXiv:2511.04307, 2025

    Jian Mu, Chaoyun Zhang, Chiming Ni, Lu Wang, Bo Qiao, Kartik Mathur, Qianhui Wu, Yuhang Xie, Xiaojun Ma, Mengyu Zhou, Si Qin, Liqun Li, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. Gui-360: A comprehensive dataset and benchmark for computer-using agents.arXiv preprint arXiv:2511.04307, 2025

  25. [25]

    Gui agents: A survey

    Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, et al. Gui agents: A survey. InFindings of the Association for Computational Linguistics: ACL 2025, pages 22522–22538, 2025

  26. [26]

    Introducing gpt-5.4

    OpenAI. Introducing gpt-5.4. OpenAI Blog, 2026. Official announcement describing GPT-5.4 capabilities and model updates. Available at:https://openai.com/zh-Hans-CN/index/introducing-gpt-5-4/

  27. [28]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

  28. [29]

    Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026

    Qwen Team. Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026. URL https://qwen.ai/ blog?id=qwen3.6-35b-a3b

  29. [30]

    Medspot: A workflow-aware sequential grounding benchmark for clinical gui.arXiv preprint arXiv:2603.19993, 2026

    Rozain Shakeel, Abdul Rahman Mohammad Ali, Muneeb Mushtaq, Tausifa Jan Saleem, and Tajamul Ashraf. Medspot: A workflow-aware sequential grounding benchmark for clinical gui.arXiv preprint arXiv:2603.19993, 2026

  30. [31]

    Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720, 2025

    Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, and Dong Yu. Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720, 2025

  31. [32]

    Kimi-VL Technical Report

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

  32. [33]

    Solidgeo: Measuring multimodal spatial math reasoning in solid geometry

    Peijie Wang, Chao Yang, Zhong-Zhi Li, Fei Yin, Dekang Ran, Mi Tian, Zhilong Ji, Jinfeng Bai, and Cheng-Lin Liu. Solidgeo: Measuring multimodal spatial math reasoning in solid geometry. InAdvances in Neural Information Processing Systems, 2025. NeurIPS 2025 Datasets and Benchmarks Track poster. 13 PAGER: Bridging the Semantic-Execution Gap in Point-Precise...

  33. [34]

    Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.Advances in Neural Information Processing Systems, 37:2686–2710, 2024a

    Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, et al. Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

  34. [35]

    History-aware reasoning for gui agents

    Ziwei Wang, Leyang Yang, Xiaoxuan Tang, Sheng Zhou, Dajun Chen, Wei Jiang, and Yong Li. History-aware reasoning for gui agents. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 36448–36456, 2026

  35. [36]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

  36. [37]

    Geoint-r1: Formalizing multimodal geometric reasoning with dynamic auxiliary constructions

    Jingxuan Wei, Caijun Jia, Qi Chen, Honghao He, Linzhuang Sun, Conghui He, Lijun Wu, Bihui Yu, and Cheng Tan. Geoint-r1: Formalizing multimodal geometric reasoning with dynamic auxiliary constructions. arXiv preprint arXiv:2508.03173, 2025

  37. [38]

    GeoSketch: A neural-symbolic approach to geometric multimodal reasoning with auxiliary line construction and affine transformation.arXiv preprint arXiv:2509.22460, 2025

    Shichao Weng, Zhiqiang Wang, Yuhua Zhou, Rui Lu, Ting Liu, Zhiyang Teng, Xiaozhang Liu, and Han- meng Liu. Geosketch: A neural-symbolic approach to geometric multimodal reasoning with auxiliary line construction and affine transformation.arXiv preprint arXiv:2509.22460, 2025

  38. [39]

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024

  39. [40]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024

  40. [41]

    GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

    Xiaobo Xia and Run Luo. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458, 2025

  41. [42]

    GeoSense: Evaluating identification and application of geometric principles in multimodal reasoning.arXiv preprint arXiv:2504.12597, 2025

    Liangyu Xu, Yingxiu Zhao, Jingyun Wang, Yingyao Wang, Bu Pi, Chen Wang, Mingliang Zhang, Jihao Gu, Xiang Li, Xiaoyong Zhu, Jun Song, and Bo Zheng. Geosense: Evaluating identification and application of geometric principles in multimodal reasoning.arXiv preprint arXiv:2504.12597, 2025

  42. [43]

    Mobilerl: Online agentic reinforcement learning for mobile gui agents

    Yifan Xu, Xiao Liu, Xinghan Liu, Jiaqi Fu, Jiayu Huang, Hanchen Zhang, Bohao Jing, Shudan Zhang, Yuting Wang, Wenyi Zhao, and Yuxiao Dong. Mobilerl: Online agentic reinforcement learning for mobile gui agents. InThe Fourteenth International Conference on Learning Representations, 2026. ICLR 2026 Poster

  43. [44]

    Probench: Benchmarking gui agents with accurate process information

    Leyang Yang, Ziwei Wang, Xiaoxuan Tang, Sheng Zhou, Dajun Chen, Wei Jiang, and Yong Li. Probench: Benchmarking gui agents with accurate process information. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 27547–27555, 2026

  44. [45]

    Gui-libra: Training native gui agents to reason and act with action-aware supervision and partially verifiable rl.arXiv preprint arXiv:2602.22190, 2026

    Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang, Hao Cheng, Huaxiu Yao, Baolin Peng, Huan Zhang, Jianfeng Gao, and Tong Zhang. Gui-libra: Training native gui agents to reason and act with action-aware supervision and partially verifiable rl.arXiv preprint arXiv:2602.22190, 2026

  45. [46]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024

  46. [47]

    Se-gui: Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning

    Xinbin Yuan, Jian Zhang, Kaixin Li, Zhuoxuan Cai, Lujian Yao, Jie Chen, Enguang Wang, Qibin Hou, Jinwei Chen, Peng-Tao Jiang, and Bo Li. Se-gui: Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning. InAdvances in Neural Information Processing Systems, 2025. NeurIPS 2025 poster

  47. [48]

    R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization

    Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1859–1869, October 2025

  48. [49]

    Geofm: Enhancing geometric reasoning of mllms via synthetic data generation through formal language, 2025

    Yuhao Zhang, Dingxin Hu, Tinghao Yu, Hao Liu, and Yiting Liu. Geofm: Enhancing geometric reasoning of mllms via synthetic data generation through formal language.arXiv preprint arXiv:2510.27448, 2025. 14 PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

  49. [50]

    Geochallenge: A multi-answer multiple-choice benchmark for geometric reasoning with diagrams.arXiv preprint arXiv:2603.19252, 2026

    Yushun Zhang, Weiping Fu, Zesheng Yang, Bo Zhao, Lingling Zhang, Jian Zhang, Yumeng Fu, Jiaxing Huang, and Jun Liu. Geochallenge: A multi-answer multiple-choice benchmark for geometric reasoning with diagrams.arXiv preprint arXiv:2603.19252, 2026

  50. [51]

    Worldgui: An interactive benchmark for desktop gui automation from any starting point.arXiv preprint arXiv:2502.08047, 2025

    Henry Hengyuan Zhao, Kaiming Yang, Wendi Yu, Difei Gao, and Mike Zheng Shou. Worldgui: An interactive benchmark for desktop gui automation from any starting point.arXiv preprint arXiv:2502.08047, 2025

  51. [52]

    Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents

    Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, and Jun Xu. Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents. InAdvances in Neural Information Processing Systems,

  52. [53]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 15 PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control A Performance ...

  53. [54]

    Exclude questions that do not require or cannot be visualized with GeoGebra, such as pure algebraic calculations or pure logical reasoning

    Core Task Screen K12 mathematics questions to identifygeometry-related ones that can use GeoGebra software for graphing to assist problem-solving/teaching. Exclude questions that do not require or cannot be visualized with GeoGebra, such as pure algebraic calculations or pure logical reasoning

  54. [55]

    Separate questions with blank lines or serial numbers

    Input Format: Batch text of K12 mathematics questions, which may include stems, options, and problem-solving steps. Separate questions with blank lines or serial numbers

  55. [56]

    Specify the school stage if available; otherwise, identify it automatically

    Input Scope: K12 mathematics questions, including primary, middle, and high school. Specify the school stage if available; otherwise, identify it automatically

  56. [57]

    Screening Criteria (Applicable if Any Is Met)

  57. [58]

    Core involvesgeometry graph drawing/analysis: triangles, quadrilaterals, circles, polygons, 3D shapes, coordinate system graphs, etc

  58. [59]

    Requiresgraph measurement/construction: side lengths, angles, areas, volumes, perpendicu- lar/parallel lines, angle bisectors, circumscribed/inscribed circles, and transformations

  59. [60]

    Involvesgeometry relationship verification: congruence, similarity, parallelism, perpendicular- ity, collinearity, or concyclicity requiring visualization

  60. [61]

    Coordinate-system related: point coordinates, linear/circle equations, conic sections, and graph- based property analysis

  61. [62]

    Exclusion Criteria (Not Applicable if Any Is Met)

  62. [63]

    Pure algebraic calculations: solving equations, factorization, formula calculation, sequences, or non-geometric probability/statistics

  63. [64]

    Pure logical reasoning: text-only geometry theorem proofs or definition discrimination

  64. [65]

    No clear geometric elements: numerical calculations only or application questions with no graph correlation

  65. [66]

    Primary”, “Middle

    Output Format (JSON Structure) Output aJSON array, where each element represents a screened question. Each element contains the following fields: •school_stage: String. School stage of the question: “Primary”, “Middle”, or “High”. •can_draw: Boolean.truemeans applicable to GeoGebra;falsemeans not applicable. •ggb_content : String. If can_draw=true, provid...

  66. [67]

    school_stage

    JSON Output Example [ { "school_stage": "Middle", "can_draw": true, "ggb_content": "Right triangle ABC with right angle at C. Mark side AB=5cm. Label vertices A, B, C and right angle symbol. And then calculate BC.", "supplementary_note": "Use GeoGebra's Right Triangle tool for quick construction." }, { "school_stage": "Primary", "can_draw": false, "ggb_co...

  67. [68]

    Your primary focus is to integrate all three components to infer accurate construction steps

    Mission Statement You will receive a dataset entry containing three core components:Question,Answer, andImage. Your primary focus is to integrate all three components to infer accurate construction steps. You must:

  68. [69]

    Analyze: Analyze the construction logic by combining the problem requirements, answer clues, and image context

  69. [70]

    Infer: If coordinates are not explicitly provided, infer reasonable Cartesian coordinates that satisfy the described geometric relationships

  70. [71]

    Custom functions are not permitted

    Map: Map each construction step to the exact function in the Allowed Function Library. Custom functions are not permitted. 4.Classify: Classify the problem by selecting skills, grade level, and drawing difficulty. 5.Labeling rule: Do not add labels beyond what is required by the Question, Answer, or Image. 6.Output: Strictly follow the Mandatory Output Fo...

  71. [72]

    text": "string

    Allowed Function Library You may only use the functions defined below. A. General and Input •generate_input_action: Used for algebraic text input. Parameters:{"text": "string"}. •add_text_label : Add a text label at a specified position. Parameters: {"position": [x, y], "text": "string"}. B. Points •draw_point: Draw one or more points. Parameters:{"points...

  72. [73]

    Object Grounding and Dependency Rules

  73. [74]

    No floating reference points: Any reference point used in a construction function must have been previously created or must be strictly located on a previously created object

  74. [75]

    Create before use: If a step requires construction based on an existing object, that object must have been created in an earlier task

  75. [76]

    4.Prefer reuse of existing key points: Reuse endpoints, vertices, or centers whenever possible

    Strict on-object point constraints: Coordinates of an on-object point must exactly satisfy the geometric equation of that object. 4.Prefer reuse of existing key points: Reuse endpoints, vertices, or centers whenever possible

  76. [77]

    Construction Function Requirements •perpendicular_line and parallel_line: the first point must lie on an existing line, segment, or ray; the second point must be an existing point

    Implicit object registry: Maintain an internal list of created objects and ensure all later steps reference them consistently. Construction Function Requirements •perpendicular_line and parallel_line: the first point must lie on an existing line, segment, or ray; the second point must be an existing point. •perpendicular_bisector: both points must be exis...

  77. [78]

    • Integer Priority Rule: If special coordinates are not required, choose integer coordinates whenever possible

    Coordinate Inference Protocol • If coordinates are not provided, infer reasonable Cartesian coordinates that preserve geometric properties. • Integer Priority Rule: If special coordinates are not required, choose integer coordinates whenever possible. • Use decimals only when integers cannot satisfy the required geometric constraints. • If the figure impl...

  78. [79]

    Skill Taxonomy Multiple selections are allowed

    Classification Standards A. Skill Taxonomy Multiple selections are allowed

  79. [80]

    Basic geometric object construction

  80. [81]

    Numerical and metric constraints

Showing first 80 references.