PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control
Pith reviewed 2026-05-20 18:42 UTC · model grok-4.3
The pith
PAGER bridges the semantic-execution gap in point-precise geometric GUI control by achieving 4.1 times higher task success than baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PAGER closes the Semantic-Execution Gap by combining dependency-structured planning with pixel-level execution. Pixel-grounded supervised tuning sets up the action grammar, and precision-aligned reinforcement learning uses state-conditioned geometric feedback to handle rollout errors, resulting in 4.1x higher task success and over 62% step success rate for point-precise GUI control.
What carries the argument
The PAGER agent, which decomposes tasks via dependency-structured planning and applies precision-aligned reinforcement learning with state-conditioned geometric feedback to mitigate exposure bias.
Load-bearing premise
The 4,906 problems in PAGE Bench together with the state-conditioned geometric feedback in the RL stage are representative of real-world dependency-driven error propagation and that the performance gains will generalize beyond these specific benchmark tasks.
What would settle it
Demonstrating that PAGER's success rates fall back to baseline levels when tested on geometric construction problems with different dependency structures or on live desktop interfaces not represented in PAGE Bench would challenge the central claims.
Figures
read the original abstract
Large vision-language models have significantly advanced GUI agents, enabling executable interaction across web, mobile, and desktop interfaces. Yet these gains largely rely on a forgiving region-tolerant paradigm, where many nearby pixels inside the same component remain valid. Precise geometric construction breaks this assumption: actions must land on points in continuous canvas space rather than tolerant regions. Because geometric primitives carry ontological dependencies, a local coordinate error can induce cascading topological failures that distort downstream objects and invalidate the final construction. We identify this regime as precision-sensitive GUI tasks, requiring point-level accuracy, geometry-aware verification, and robustness to dependency-driven error propagation. To benchmark it, we introduce PAGE Bench, with 4,906 problems and over 224K process-supervised, pixel-level GUI actions. We further propose PAGER, a topology-aware agent that decomposes construction into dependency-structured planning and pixel-level execution. Pixel-grounded supervised tuning establishes executable action grammar, while precision-aligned reinforcement learning mitigates rollout-induced exposure bias through state-conditioned geometric feedback. Experiments reveal a pronounced Semantic-Execution Gap: general multimodal models can exceed 88% action type accuracy yet remain below 6% task success. PAGER closes this gap, delivering 4.1x higher task success than the strongest evaluated general baseline and raising step success rate from below 9% for GUI-specialized agents to over 62%, establishing a new state of the art for point-precise GUI control.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper identifies a semantic-execution gap in precision-sensitive GUI tasks where actions require point-level accuracy on continuous canvases and ontological dependencies can cause cascading failures. It introduces PAGE Bench (4,906 problems, >224K pixel-level actions) and proposes PAGER, which decomposes tasks into dependency-structured planning plus pixel-grounded execution. Supervised tuning learns executable action grammar; precision-aligned RL uses state-conditioned geometric feedback to reduce exposure bias. Experiments show general VLMs exceed 88% action-type accuracy yet <6% task success, while PAGER achieves 4.1x higher task success than the strongest baseline and >62% step success, establishing a new SOTA for point-precise GUI control.
Significance. If the empirical gains hold under standard visual-only inference, the work usefully isolates a new regime of dependency-driven error propagation in GUI agents and supplies both a benchmark and a training recipe that demonstrably narrows the gap between semantic understanding and executable precision. The scale of the action dataset and the explicit contrast between region-tolerant and point-precise regimes are concrete contributions that future GUI-agent research can build upon.
major comments (2)
- [Methods (precision-aligned RL)] Methods section describing precision-aligned RL: the state-conditioned geometric feedback (exact point-level errors and topological validity) is presented as the mechanism that mitigates exposure bias and yields the reported 62% step success. The manuscript does not state whether this oracle signal is removed at inference or whether an equivalent visual proxy is learned; if the 4.1x task-success gain depends on privileged geometric supervision unavailable to standard pixel-only agents, the central claim of practical SOTA does not yet transfer.
- [Experiments] Experiments section and abstract claims: the 4.1x task-success and >62% step-success figures are reported without error bars, confidence intervals, or explicit verification that the 224K actions were collected under the same train/test split used for evaluation. Because the central claim rests on these aggregate numbers establishing a new state of the art, the absence of statistical detail and reproducibility artifacts is load-bearing.
minor comments (2)
- [Abstract and §1] The term 'Semantic-Execution Gap' is used prominently in the abstract and title but receives only an informal gloss; a short formal definition or equation characterizing the gap (e.g., success rate conditioned on action-type accuracy) would improve clarity.
- [Figures and Tables] Figure captions and table headers should explicitly note whether reported metrics are macro- or micro-averaged and whether they include only successful trajectories or all rollouts.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The comments highlight important points on methodological clarity and statistical reporting that we will address to strengthen the manuscript. We respond to each major comment below.
read point-by-point responses
-
Referee: [Methods (precision-aligned RL)] Methods section describing precision-aligned RL: the state-conditioned geometric feedback (exact point-level errors and topological validity) is presented as the mechanism that mitigates exposure bias and yields the reported 62% step success. The manuscript does not state whether this oracle signal is removed at inference or whether an equivalent visual proxy is learned; if the 4.1x task-success gain depends on privileged geometric supervision unavailable to standard pixel-only agents, the central claim of practical SOTA does not yet transfer.
Authors: We appreciate the referee's request for explicit clarification on this point. The state-conditioned geometric feedback (point-level errors and topological validity) is used solely as a training-time signal within the precision-aligned RL stage to provide dense rewards and mitigate exposure bias during policy rollouts. Once training concludes, the resulting policy is deployed at inference using only standard visual observations from the GUI canvas, with no access to oracle geometric information. This is consistent with standard RL practice for GUI agents, where privileged signals aid learning but are unavailable during execution. We will revise the Methods section to state this distinction explicitly and confirm that all reported inference-time results (including the 4.1x task success and >62% step success) are obtained under visual-only conditions. revision: yes
-
Referee: [Experiments] Experiments section and abstract claims: the 4.1x task-success and >62% step-success figures are reported without error bars, confidence intervals, or explicit verification that the 224K actions were collected under the same train/test split used for evaluation. Because the central claim rests on these aggregate numbers establishing a new state of the art, the absence of statistical detail and reproducibility artifacts is load-bearing.
Authors: We agree that additional statistical detail and reproducibility information are necessary to support the central claims. The 4.1x task-success and >62% step-success results are computed on the held-out test split of PAGE Bench; the >224K pixel-level actions were collected exclusively from the training problems under the train/test split detailed in the dataset construction section. In the revised manuscript we will add error bars (standard deviation across five independent runs with different random seeds), 95% confidence intervals for the key metrics, and an explicit statement confirming the data split and evaluation protocol. We will also include a reproducibility appendix listing hyperparameters, seeds, and evaluation code references. revision: yes
Circularity Check
No circularity: empirical results on new benchmark with independent method components
full rationale
The paper introduces PAGE Bench (4,906 problems) and PAGER agent, which applies standard pixel-grounded supervised tuning followed by RL using state-conditioned geometric feedback. Central claims consist of measured task success rates (4.1x improvement, 62% step success) on this benchmark rather than any derived quantity obtained by fitting parameters to a target and then re-predicting it, or by self-referential equations. No load-bearing step reduces to a self-citation chain, uniqueness theorem from the same authors, or ansatz smuggled via prior work; the feedback mechanism is an explicit training choice whose generalization is an empirical question separate from circularity. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PAGER factorizes drawing into dependency-structured planning and pixel-level execution... precision-aligned reinforcement learning mitigates rollout-induced exposure bias through state-conditioned geometric feedback
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Succpt(a) = I[∥p(a)−p∗∥2 ≤ϵ] ... ∆Cℓ+1 ≈ Jℓ ∆Cℓ + Bℓ ∆ξℓ
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Claude 4.6 sonnet: Performance and safety updates
Anthropic. Claude 4.6 sonnet: Performance and safety updates. Technical report, Anthropic PBC,
-
[2]
Available at: https://www.anthropic.com/news/ claude-sonnet-4-6
Official announcement and technical overview. Available at: https://www.anthropic.com/news/ claude-sonnet-4-6
-
[3]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Jianlong Chen, Daocheng Fu, Shengze Xu, Jiawei Chen, Yuan Feng, Yue Yang, Junchi Yan, Hongyuan Zha, and Renqiu Xia. Milestones over outcome: Unlocking geometric reasoning with sub-goal verifiable reward. arXiv preprint arXiv:2601.05073, 2026
-
[5]
Guicourse: From general vision language model to versatile gui agent
Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, et al. Guicourse: From general vision language model to versatile gui agent. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21936–21959, 2025
work page 2025
-
[6]
Lingxiao Du, Fanqing Meng, Zongkai Liu, Zhixiang Zhou, Ping Luo, Qiaosheng Zhang, and Wenqi Shao. Mm-prm: Enhancing multimodal mathematical reasoning with scalable step-level supervision.arXiv preprint arXiv:2505.13427, 2025
-
[7]
Shiqing Fan, Xichen Ding, Liang Zhang, and Linjian Mo. Mcptoolbench++: A large scale ai agent model context protocol mcp tool use benchmark.arXiv preprint arXiv:2508.07575, 2025
-
[8]
Geobench: Rethinking multimodal geometric problem-solving via hierarchical evaluation
Yuan Feng, Yue Yang, Xiaohan He, Jiatong Zhao, Jianlong Chen, Daocheng Fu, Qi Liu, Renqiu Xia, Bo Zhang, and Junchi Yan. Geobench: Rethinking multimodal geometric problem-solving via hierarchical evaluation. InThe Fourteenth International Conference on Learning Representations, 2026. ICLR 2026 Poster
work page 2026
-
[9]
Daocheng Fu, Zijun Chen, Renqiu Xia, Qi Liu, Yuan Feng, Hongbin Zhou, Renrui Zhang, Shiyang Feng, Peng Gao, Junchi Yan, Botian Shi, Bo Zhang, and Yu Qiao. Trustgeogen: Scalable and formal-verified data engine for trustworthy multi-modal geometric problem solving.arXiv preprint arXiv:2504.15780, 2025
-
[10]
Gemini 3.1 pro: Model card and technical overview
Google DeepMind. Gemini 3.1 pro: Model card and technical overview. Technical report, Google, February 2026. URL https://deepmind.google/models/gemini/pro/. Official technical documentation describing the Gemini 3.1 Pro preview model. Available at:https://deepmind.google/models/model-cards/ gemini-3-1-pro/
work page 2026
-
[11]
Shasha Guo, Liang Pang, Xi Wang, Yanling Wang, Huawei Shen, and Jing Zhang. Geovlmath: Enhancing geometry reasoning in vision-language models via cross-modal reward for auxiliary line creation.arXiv preprint arXiv:2510.11020, 2025
- [12]
-
[13]
Cogagent: A visual language model for gui agents
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14281–14290, 2024
work page 2024
-
[14]
Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Reguide: Data efficient gui grounding via spatial reasoning and search
Hyunseok Lee, Jeonghoon Kim, Beomjun Kim, Jihoon Tack, Cheonbok Park, Sookyo In, Chansong Jo, Jaehong Lee, Jinwoo Shin, and Kang Min Yoo. Reguide: Data efficient gui grounding via spatial reasoning and search. arXiv preprint arXiv:2505.15259, 2025
-
[16]
Shuquan Lian, Yuhang Wu, Jia Ma, Zihan Song, Bingqi Chen, Xiawu Zheng, and Hui Li. Ui-agile: Advancing gui agents with effective reinforcement learning and precise inference-time grounding.arXiv preprint arXiv:2507.22025, 2025. Accepted to CVPR 2026 Findings
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Zeyi Liao, Yadong Lu, Boyu Gou, Huan Sun, and Ahmed Awadallah. Beyond clicking: A step towards generalist gui grounding via text dragging.arXiv preprint arXiv:2601.06031, 2026. 12 PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control
-
[18]
Llava-next: Improved reasoning, ocr, and world knowledge, January 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URL https://llava-vl.github.io/blog/ 2024-01-30-llava-next/
work page 2024
-
[19]
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners
Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners.arXiv preprint arXiv:2504.14239, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Infiguiagent: A multimodal generalist gui agent with native reasoning and reflection
Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, and Fei Wu. Infiguiagent: A multimodal generalist gui agent with native reasoning and reflection. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 103...
work page 2026
-
[21]
UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning
Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Guanjing Xiong, and Hongsheng Li. Ui-r1: Enhancing action prediction of gui agents by reinforcement learning.arXiv preprint arXiv:2503.21620, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Visaidmath: Benchmark- ing visual-aided mathematical reasoning
Jingkun Ma, Runzhe Zhan, Derek F. Wong, Yang Li, Di Sun, Hou Pong Chan, and Lidia S. Chao. Visaidmath: Benchmarking visual-aided mathematical reasoning.arXiv preprint arXiv:2410.22995, 2024
-
[23]
Coco-agent: A comprehensive cognitive mllm agent for smartphone gui automation
Xinbei Ma, Zhuosheng Zhang, and Hai Zhao. Coco-agent: A comprehensive cognitive mllm agent for smartphone gui automation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 9097–9110, 2024
work page 2024
-
[24]
Jian Mu, Chaoyun Zhang, Chiming Ni, Lu Wang, Bo Qiao, Kartik Mathur, Qianhui Wu, Yuhang Xie, Xiaojun Ma, Mengyu Zhou, Si Qin, Liqun Li, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. Gui-360: A comprehensive dataset and benchmark for computer-using agents.arXiv preprint arXiv:2511.04307, 2025
-
[25]
Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, et al. Gui agents: A survey. InFindings of the Association for Computational Linguistics: ACL 2025, pages 22522–22538, 2025
work page 2025
-
[26]
OpenAI. Introducing gpt-5.4. OpenAI Blog, 2026. Official announcement describing GPT-5.4 capabilities and model updates. Available at:https://openai.com/zh-Hans-CN/index/introducing-gpt-5-4/
work page 2026
-
[28]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026
Qwen Team. Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026. URL https://qwen.ai/ blog?id=qwen3.6-35b-a3b
work page 2026
-
[30]
Rozain Shakeel, Abdul Rahman Mohammad Ali, Muneeb Mushtaq, Tausifa Jan Saleem, and Tajamul Ashraf. Medspot: A workflow-aware sequential grounding benchmark for clinical gui.arXiv preprint arXiv:2603.19993, 2026
-
[31]
Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, and Dong Yu. Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720, 2025
-
[32]
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Solidgeo: Measuring multimodal spatial math reasoning in solid geometry
Peijie Wang, Chao Yang, Zhong-Zhi Li, Fei Yin, Dekang Ran, Mi Tian, Zhilong Ji, Jinfeng Bai, and Cheng-Lin Liu. Solidgeo: Measuring multimodal spatial math reasoning in solid geometry. InAdvances in Neural Information Processing Systems, 2025. NeurIPS 2025 Datasets and Benchmarks Track poster. 13 PAGER: Bridging the Semantic-Execution Gap in Point-Precise...
work page 2025
-
[34]
Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, et al. Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025
-
[35]
History-aware reasoning for gui agents
Ziwei Wang, Leyang Yang, Xiaoxuan Tang, Sheng Zhou, Dajun Chen, Wei Jiang, and Yong Li. History-aware reasoning for gui agents. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 36448–36456, 2026
work page 2026
-
[36]
BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Geoint-r1: Formalizing multimodal geometric reasoning with dynamic auxiliary constructions
Jingxuan Wei, Caijun Jia, Qi Chen, Honghao He, Linzhuang Sun, Conghui He, Lijun Wu, Bihui Yu, and Cheng Tan. Geoint-r1: Formalizing multimodal geometric reasoning with dynamic auxiliary constructions. arXiv preprint arXiv:2508.03173, 2025
-
[38]
Shichao Weng, Zhiqiang Wang, Yuhua Zhou, Rui Lu, Ting Liu, Zhiyang Teng, Xiaozhang Liu, and Han- meng Liu. Geosketch: A neural-symbolic approach to geometric multimodal reasoning with auxiliary line construction and affine transformation.arXiv preprint arXiv:2509.22460, 2025
-
[39]
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
Xiaobo Xia and Run Luo. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Liangyu Xu, Yingxiu Zhao, Jingyun Wang, Yingyao Wang, Bu Pi, Chen Wang, Mingliang Zhang, Jihao Gu, Xiang Li, Xiaoyong Zhu, Jun Song, and Bo Zheng. Geosense: Evaluating identification and application of geometric principles in multimodal reasoning.arXiv preprint arXiv:2504.12597, 2025
-
[43]
Mobilerl: Online agentic reinforcement learning for mobile gui agents
Yifan Xu, Xiao Liu, Xinghan Liu, Jiaqi Fu, Jiayu Huang, Hanchen Zhang, Bohao Jing, Shudan Zhang, Yuting Wang, Wenyi Zhao, and Yuxiao Dong. Mobilerl: Online agentic reinforcement learning for mobile gui agents. InThe Fourteenth International Conference on Learning Representations, 2026. ICLR 2026 Poster
work page 2026
-
[44]
Probench: Benchmarking gui agents with accurate process information
Leyang Yang, Ziwei Wang, Xiaoxuan Tang, Sheng Zhou, Dajun Chen, Wei Jiang, and Yong Li. Probench: Benchmarking gui agents with accurate process information. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 27547–27555, 2026
work page 2026
-
[45]
Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang, Hao Cheng, Huaxiu Yao, Baolin Peng, Huan Zhang, Jianfeng Gao, and Tong Zhang. Gui-libra: Training native gui agents to reason and act with action-aware supervision and partially verifiable rl.arXiv preprint arXiv:2602.22190, 2026
-
[46]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
Se-gui: Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning
Xinbin Yuan, Jian Zhang, Kaixin Li, Zhuoxuan Cai, Lujian Yao, Jie Chen, Enguang Wang, Qibin Hou, Jinwei Chen, Peng-Tao Jiang, and Bo Li. Se-gui: Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning. InAdvances in Neural Information Processing Systems, 2025. NeurIPS 2025 poster
work page 2025
-
[48]
Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1859–1869, October 2025
work page 2025
-
[49]
Yuhao Zhang, Dingxin Hu, Tinghao Yu, Hao Liu, and Yiting Liu. Geofm: Enhancing geometric reasoning of mllms via synthetic data generation through formal language.arXiv preprint arXiv:2510.27448, 2025. 14 PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control
-
[50]
Yushun Zhang, Weiping Fu, Zesheng Yang, Bo Zhao, Lingling Zhang, Jian Zhang, Yumeng Fu, Jiaxing Huang, and Jun Liu. Geochallenge: A multi-answer multiple-choice benchmark for geometric reasoning with diagrams.arXiv preprint arXiv:2603.19252, 2026
-
[51]
Henry Hengyuan Zhao, Kaiming Yang, Wendi Yu, Difei Gao, and Mike Zheng Shou. Worldgui: An interactive benchmark for desktop gui automation from any starting point.arXiv preprint arXiv:2502.08047, 2025
-
[52]
Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents
Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, and Jun Xu. Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents. InAdvances in Neural Information Processing Systems,
-
[53]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 15 PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control A Performance ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Core Task Screen K12 mathematics questions to identifygeometry-related ones that can use GeoGebra software for graphing to assist problem-solving/teaching. Exclude questions that do not require or cannot be visualized with GeoGebra, such as pure algebraic calculations or pure logical reasoning
-
[55]
Separate questions with blank lines or serial numbers
Input Format: Batch text of K12 mathematics questions, which may include stems, options, and problem-solving steps. Separate questions with blank lines or serial numbers
-
[56]
Specify the school stage if available; otherwise, identify it automatically
Input Scope: K12 mathematics questions, including primary, middle, and high school. Specify the school stage if available; otherwise, identify it automatically
-
[57]
Screening Criteria (Applicable if Any Is Met)
-
[58]
Core involvesgeometry graph drawing/analysis: triangles, quadrilaterals, circles, polygons, 3D shapes, coordinate system graphs, etc
-
[59]
Requiresgraph measurement/construction: side lengths, angles, areas, volumes, perpendicu- lar/parallel lines, angle bisectors, circumscribed/inscribed circles, and transformations
-
[60]
Involvesgeometry relationship verification: congruence, similarity, parallelism, perpendicular- ity, collinearity, or concyclicity requiring visualization
-
[61]
Coordinate-system related: point coordinates, linear/circle equations, conic sections, and graph- based property analysis
-
[62]
Exclusion Criteria (Not Applicable if Any Is Met)
-
[63]
Pure algebraic calculations: solving equations, factorization, formula calculation, sequences, or non-geometric probability/statistics
-
[64]
Pure logical reasoning: text-only geometry theorem proofs or definition discrimination
-
[65]
No clear geometric elements: numerical calculations only or application questions with no graph correlation
-
[66]
Output Format (JSON Structure) Output aJSON array, where each element represents a screened question. Each element contains the following fields: •school_stage: String. School stage of the question: “Primary”, “Middle”, or “High”. •can_draw: Boolean.truemeans applicable to GeoGebra;falsemeans not applicable. •ggb_content : String. If can_draw=true, provid...
-
[67]
JSON Output Example [ { "school_stage": "Middle", "can_draw": true, "ggb_content": "Right triangle ABC with right angle at C. Mark side AB=5cm. Label vertices A, B, C and right angle symbol. And then calculate BC.", "supplementary_note": "Use GeoGebra's Right Triangle tool for quick construction." }, { "school_stage": "Primary", "can_draw": false, "ggb_co...
-
[68]
Your primary focus is to integrate all three components to infer accurate construction steps
Mission Statement You will receive a dataset entry containing three core components:Question,Answer, andImage. Your primary focus is to integrate all three components to infer accurate construction steps. You must:
-
[69]
Analyze: Analyze the construction logic by combining the problem requirements, answer clues, and image context
-
[70]
Infer: If coordinates are not explicitly provided, infer reasonable Cartesian coordinates that satisfy the described geometric relationships
-
[71]
Custom functions are not permitted
Map: Map each construction step to the exact function in the Allowed Function Library. Custom functions are not permitted. 4.Classify: Classify the problem by selecting skills, grade level, and drawing difficulty. 5.Labeling rule: Do not add labels beyond what is required by the Question, Answer, or Image. 6.Output: Strictly follow the Mandatory Output Fo...
-
[72]
Allowed Function Library You may only use the functions defined below. A. General and Input •generate_input_action: Used for algebraic text input. Parameters:{"text": "string"}. •add_text_label : Add a text label at a specified position. Parameters: {"position": [x, y], "text": "string"}. B. Points •draw_point: Draw one or more points. Parameters:{"points...
-
[73]
Object Grounding and Dependency Rules
-
[74]
No floating reference points: Any reference point used in a construction function must have been previously created or must be strictly located on a previously created object
-
[75]
Create before use: If a step requires construction based on an existing object, that object must have been created in an earlier task
-
[76]
4.Prefer reuse of existing key points: Reuse endpoints, vertices, or centers whenever possible
Strict on-object point constraints: Coordinates of an on-object point must exactly satisfy the geometric equation of that object. 4.Prefer reuse of existing key points: Reuse endpoints, vertices, or centers whenever possible
-
[77]
Implicit object registry: Maintain an internal list of created objects and ensure all later steps reference them consistently. Construction Function Requirements •perpendicular_line and parallel_line: the first point must lie on an existing line, segment, or ray; the second point must be an existing point. •perpendicular_bisector: both points must be exis...
-
[78]
Coordinate Inference Protocol • If coordinates are not provided, infer reasonable Cartesian coordinates that preserve geometric properties. • Integer Priority Rule: If special coordinates are not required, choose integer coordinates whenever possible. • Use decimals only when integers cannot satisfy the required geometric constraints. • If the figure impl...
-
[79]
Skill Taxonomy Multiple selections are allowed
Classification Standards A. Skill Taxonomy Multiple selections are allowed
-
[80]
Basic geometric object construction
-
[81]
Numerical and metric constraints
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.