PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

Bihui Yu; Caijun Jia; Cheng Tan; Conghui He; Jingxuan Wei; Linzhuang Sun; Shan Liu; Siyuan Li; Xi Bai; Xinglong Xu

arxiv: 2605.15963 · v1 · pith:45WR3ZSSnew · submitted 2026-05-15 · 💻 cs.AI

PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

Jingxuan Wei , Xi Bai , Shan Liu , Caijun Jia , Zheng Sun , Xinglong Xu , Siyuan Li , Linzhuang Sun

show 3 more authors

Bihui Yu Conghui He Cheng Tan

This is my paper

Pith reviewed 2026-05-20 18:42 UTC · model grok-4.3

classification 💻 cs.AI

keywords GUI agentspoint-precise controlgeometric GUIvision-language modelsreinforcement learningsemantic execution gaptopology-aware planningPAGE Bench

0 comments

The pith

PAGER bridges the semantic-execution gap in point-precise geometric GUI control by achieving 4.1 times higher task success than baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses precision-sensitive GUI tasks where actions must target exact points in continuous space because geometric primitives have dependencies that turn small coordinate errors into cascading failures. It creates PAGE Bench with 4,906 problems and more than 224,000 pixel-level actions to test these challenges. The proposed PAGER agent decomposes construction into dependency-structured planning and precise pixel execution. Supervised tuning builds an executable action grammar while reinforcement learning with state-conditioned geometric feedback corrects for exposure bias in rollouts. This yields major gains, with step success rising above 62 percent from under 9 percent in prior GUI agents.

Core claim

PAGER closes the Semantic-Execution Gap by combining dependency-structured planning with pixel-level execution. Pixel-grounded supervised tuning sets up the action grammar, and precision-aligned reinforcement learning uses state-conditioned geometric feedback to handle rollout errors, resulting in 4.1x higher task success and over 62% step success rate for point-precise GUI control.

What carries the argument

The PAGER agent, which decomposes tasks via dependency-structured planning and applies precision-aligned reinforcement learning with state-conditioned geometric feedback to mitigate exposure bias.

Load-bearing premise

The 4,906 problems in PAGE Bench together with the state-conditioned geometric feedback in the RL stage are representative of real-world dependency-driven error propagation and that the performance gains will generalize beyond these specific benchmark tasks.

What would settle it

Demonstrating that PAGER's success rates fall back to baseline levels when tested on geometric construction problems with different dependency structures or on live desktop interfaces not represented in PAGE Bench would challenge the central claims.

Figures

Figures reproduced from arXiv: 2605.15963 by Bihui Yu, Caijun Jia, Cheng Tan, Conghui He, Jingxuan Wei, Linzhuang Sun, Shan Liu, Siyuan Li, Xi Bai, Xinglong Xu, Zheng Sun.

**Figure 2.** Figure 2: Overview of PAGER. Planning orders sub-tasks, execution grounds them into pixel-level actions, and training aligns supervision with precision rewards. canvas state C0 and generates τ = (C0, a1, C1, . . . , aL, CL), Cℓ = M(Cℓ−1 , aℓ ), aℓ = (κℓ , oℓ , ξℓ ), (1) where M is the drawing environment, κℓ ∈ {click, paint, type} is the operation type, oℓ is the object type, and ξℓ denotes typed parameters. The tas… view at source ↗

**Figure 3.** Figure 3: Construction pipeline of PAGE Bench. Candidate geometry problems are screened for GeoGebra-executable instances, converted into structured task sequences, mapped to low-level GUI actions, executed with step-wise recording in a live environment, and finally filtered to retain highquality trajectories for precision-sensitive geometric GUI learning and evaluation. Problem collection and executable screening.… view at source ↗

**Figure 4.** Figure 4: Question-type and skill composition in PAGE Bench. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison. PAGER better preserves rectangular structure, diagonal intersection, and coordinate consistency. 5.4 Case Study and Error Analysis [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: illustrates a clear spatial separation in model performance. Most existing MLLMs, including GPT-5.4 and Gemini3.1-Pro, cluster in the lower-left region with both low automated scores and low human ratings. In contrast, PAGER occupies the top-right corner, achieving high automated success alongside superior human preference. The near-perfect correlation (r = 0.9397) demonstrates a strong alignment betw… view at source ↗

**Figure 7.** Figure 7: Performance comparison of PAGER against fourteen baselines on PAGE Bench. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Fine-grained performance breakdown across ten geometric capabilities. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt used to screen K12 mathematics questions for GeoGebra-based geometric visualiza [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt used to generate structured GeoGebra construction tasks from K12 geometry [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt used for multimodal quality assurance of GeoGebra-based dataset entries. The [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

read the original abstract

Large vision-language models have significantly advanced GUI agents, enabling executable interaction across web, mobile, and desktop interfaces. Yet these gains largely rely on a forgiving region-tolerant paradigm, where many nearby pixels inside the same component remain valid. Precise geometric construction breaks this assumption: actions must land on points in continuous canvas space rather than tolerant regions. Because geometric primitives carry ontological dependencies, a local coordinate error can induce cascading topological failures that distort downstream objects and invalidate the final construction. We identify this regime as precision-sensitive GUI tasks, requiring point-level accuracy, geometry-aware verification, and robustness to dependency-driven error propagation. To benchmark it, we introduce PAGE Bench, with 4,906 problems and over 224K process-supervised, pixel-level GUI actions. We further propose PAGER, a topology-aware agent that decomposes construction into dependency-structured planning and pixel-level execution. Pixel-grounded supervised tuning establishes executable action grammar, while precision-aligned reinforcement learning mitigates rollout-induced exposure bias through state-conditioned geometric feedback. Experiments reveal a pronounced Semantic-Execution Gap: general multimodal models can exceed 88% action type accuracy yet remain below 6% task success. PAGER closes this gap, delivering 4.1x higher task success than the strongest evaluated general baseline and raising step success rate from below 9% for GUI-specialized agents to over 62%, establishing a new state of the art for point-precise GUI control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PAGER adds a focused benchmark for point-precise geometric GUI tasks and shows clear gains with a planning-plus-RL agent, but the training feedback may not carry over to ordinary visual inference.

read the letter

The paper's core contribution is PAGE Bench, a set of 4,906 geometric construction problems with 224K labeled actions, paired with the PAGER agent that separates dependency-aware planning from pixel-level execution and adds precision-aligned RL. This directly targets the semantic-execution gap where models pick the right action type but still fail at task completion because small coordinate errors break downstream geometry. The reported lift—4.1x task success over strong baselines and step success above 62% versus under 9% for prior GUI agents—looks like a genuine step forward on this specific class of problems rather than another incremental GUI wrapper. The decomposition and the use of geometric feedback during RL training are the parts that feel new relative to earlier agent work. The numbers on the gap itself are useful to see laid out plainly. The main soft spot is the RL stage. It relies on state-conditioned geometric feedback that supplies exact point errors and topological checks during rollouts. The abstract does not spell out whether this signal is stripped away at inference or replaced by a learned visual proxy. If the 62% figure depends on that privileged oracle staying in the loop, the practical gains could shrink once the model faces only pixel observations and action outcomes. The benchmark construction itself also needs checking for how well the 4,906 problems capture real dependency chains versus curated cases. Readers working on multimodal agents for design software or CAD interfaces will find the benchmark and the planning-execution split worth looking at. The work is coherent on its own terms and has enough concrete results to justify sending it to referees rather than a desk reject. I would recommend peer review, with the main request being a clearer statement on how the geometric feedback is handled after training.

Referee Report

2 major / 2 minor

Summary. The paper identifies a semantic-execution gap in precision-sensitive GUI tasks where actions require point-level accuracy on continuous canvases and ontological dependencies can cause cascading failures. It introduces PAGE Bench (4,906 problems, >224K pixel-level actions) and proposes PAGER, which decomposes tasks into dependency-structured planning plus pixel-grounded execution. Supervised tuning learns executable action grammar; precision-aligned RL uses state-conditioned geometric feedback to reduce exposure bias. Experiments show general VLMs exceed 88% action-type accuracy yet <6% task success, while PAGER achieves 4.1x higher task success than the strongest baseline and >62% step success, establishing a new SOTA for point-precise GUI control.

Significance. If the empirical gains hold under standard visual-only inference, the work usefully isolates a new regime of dependency-driven error propagation in GUI agents and supplies both a benchmark and a training recipe that demonstrably narrows the gap between semantic understanding and executable precision. The scale of the action dataset and the explicit contrast between region-tolerant and point-precise regimes are concrete contributions that future GUI-agent research can build upon.

major comments (2)

[Methods (precision-aligned RL)] Methods section describing precision-aligned RL: the state-conditioned geometric feedback (exact point-level errors and topological validity) is presented as the mechanism that mitigates exposure bias and yields the reported 62% step success. The manuscript does not state whether this oracle signal is removed at inference or whether an equivalent visual proxy is learned; if the 4.1x task-success gain depends on privileged geometric supervision unavailable to standard pixel-only agents, the central claim of practical SOTA does not yet transfer.
[Experiments] Experiments section and abstract claims: the 4.1x task-success and >62% step-success figures are reported without error bars, confidence intervals, or explicit verification that the 224K actions were collected under the same train/test split used for evaluation. Because the central claim rests on these aggregate numbers establishing a new state of the art, the absence of statistical detail and reproducibility artifacts is load-bearing.

minor comments (2)

[Abstract and §1] The term 'Semantic-Execution Gap' is used prominently in the abstract and title but receives only an informal gloss; a short formal definition or equation characterizing the gap (e.g., success rate conditioned on action-type accuracy) would improve clarity.
[Figures and Tables] Figure captions and table headers should explicitly note whether reported metrics are macro- or micro-averaged and whether they include only successful trajectories or all rollouts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important points on methodological clarity and statistical reporting that we will address to strengthen the manuscript. We respond to each major comment below.

read point-by-point responses

Referee: [Methods (precision-aligned RL)] Methods section describing precision-aligned RL: the state-conditioned geometric feedback (exact point-level errors and topological validity) is presented as the mechanism that mitigates exposure bias and yields the reported 62% step success. The manuscript does not state whether this oracle signal is removed at inference or whether an equivalent visual proxy is learned; if the 4.1x task-success gain depends on privileged geometric supervision unavailable to standard pixel-only agents, the central claim of practical SOTA does not yet transfer.

Authors: We appreciate the referee's request for explicit clarification on this point. The state-conditioned geometric feedback (point-level errors and topological validity) is used solely as a training-time signal within the precision-aligned RL stage to provide dense rewards and mitigate exposure bias during policy rollouts. Once training concludes, the resulting policy is deployed at inference using only standard visual observations from the GUI canvas, with no access to oracle geometric information. This is consistent with standard RL practice for GUI agents, where privileged signals aid learning but are unavailable during execution. We will revise the Methods section to state this distinction explicitly and confirm that all reported inference-time results (including the 4.1x task success and >62% step success) are obtained under visual-only conditions. revision: yes
Referee: [Experiments] Experiments section and abstract claims: the 4.1x task-success and >62% step-success figures are reported without error bars, confidence intervals, or explicit verification that the 224K actions were collected under the same train/test split used for evaluation. Because the central claim rests on these aggregate numbers establishing a new state of the art, the absence of statistical detail and reproducibility artifacts is load-bearing.

Authors: We agree that additional statistical detail and reproducibility information are necessary to support the central claims. The 4.1x task-success and >62% step-success results are computed on the held-out test split of PAGE Bench; the >224K pixel-level actions were collected exclusively from the training problems under the train/test split detailed in the dataset construction section. In the revised manuscript we will add error bars (standard deviation across five independent runs with different random seeds), 95% confidence intervals for the key metrics, and an explicit statement confirming the data split and evaluation protocol. We will also include a reproducibility appendix listing hyperparameters, seeds, and evaluation code references. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on new benchmark with independent method components

full rationale

The paper introduces PAGE Bench (4,906 problems) and PAGER agent, which applies standard pixel-grounded supervised tuning followed by RL using state-conditioned geometric feedback. Central claims consist of measured task success rates (4.1x improvement, 62% step success) on this benchmark rather than any derived quantity obtained by fitting parameters to a target and then re-predicting it, or by self-referential equations. No load-bearing step reduces to a self-citation chain, uniqueness theorem from the same authors, or ansatz smuggled via prior work; the feedback mechanism is an explicit training choice whose generalization is an empirical question separate from circularity. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The method relies on standard supervised and reinforcement learning techniques applied to a new domain.

pith-pipeline@v0.9.0 · 5820 in / 1199 out tokens · 50355 ms · 2026-05-20T18:42:29.618366+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PAGER factorizes drawing into dependency-structured planning and pixel-level execution... precision-aligned reinforcement learning mitigates rollout-induced exposure bias through state-conditioned geometric feedback
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Succpt(a) = I[∥p(a)−p∗∥2 ≤ϵ] ... ∆Cℓ+1 ≈ Jℓ ∆Cℓ + Bℓ ∆ξℓ

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

99 extracted references · 99 canonical work pages · 13 internal anchors

[1]

Claude 4.6 sonnet: Performance and safety updates

Anthropic. Claude 4.6 sonnet: Performance and safety updates. Technical report, Anthropic PBC,

work page
[2]

Available at: https://www.anthropic.com/news/ claude-sonnet-4-6

Official announcement and technical overview. Available at: https://www.anthropic.com/news/ claude-sonnet-4-6

work page
[3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Milestones over outcome: Unlocking geometric reasoning with sub-goal verifiable reward.arXiv preprint arXiv:2601.05073, 2026

Jianlong Chen, Daocheng Fu, Shengze Xu, Jiawei Chen, Yuan Feng, Yue Yang, Junchi Yan, Hongyuan Zha, and Renqiu Xia. Milestones over outcome: Unlocking geometric reasoning with sub-goal verifiable reward. arXiv preprint arXiv:2601.05073, 2026

work page arXiv 2026
[5]

Guicourse: From general vision language model to versatile gui agent

Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, et al. Guicourse: From general vision language model to versatile gui agent. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21936–21959, 2025

work page 2025
[6]

Mm-prm: Enhancing multimodal mathematical reasoning with scalable step-level supervision.arXiv preprint arXiv:2505.13427, 2025

Lingxiao Du, Fanqing Meng, Zongkai Liu, Zhixiang Zhou, Ping Luo, Qiaosheng Zhang, and Wenqi Shao. Mm-prm: Enhancing multimodal mathematical reasoning with scalable step-level supervision.arXiv preprint arXiv:2505.13427, 2025

work page arXiv 2025
[7]

Mcptoolbench++: A large scale ai agent model context protocol mcp tool use benchmark.arXiv preprint arXiv:2508.07575, 2025

Shiqing Fan, Xichen Ding, Liang Zhang, and Linjian Mo. Mcptoolbench++: A large scale ai agent model context protocol mcp tool use benchmark.arXiv preprint arXiv:2508.07575, 2025

work page arXiv 2025
[8]

Geobench: Rethinking multimodal geometric problem-solving via hierarchical evaluation

Yuan Feng, Yue Yang, Xiaohan He, Jiatong Zhao, Jianlong Chen, Daocheng Fu, Qi Liu, Renqiu Xia, Bo Zhang, and Junchi Yan. Geobench: Rethinking multimodal geometric problem-solving via hierarchical evaluation. InThe Fourteenth International Conference on Learning Representations, 2026. ICLR 2026 Poster

work page 2026
[9]

Trustgeogen: Formal-verified data engine for trustworthy multi-modal geometric problem solving.arXiv preprint arXiv:2504.15780, 2026

Daocheng Fu, Zijun Chen, Renqiu Xia, Qi Liu, Yuan Feng, Hongbin Zhou, Renrui Zhang, Shiyang Feng, Peng Gao, Junchi Yan, Botian Shi, Bo Zhang, and Yu Qiao. Trustgeogen: Scalable and formal-verified data engine for trustworthy multi-modal geometric problem solving.arXiv preprint arXiv:2504.15780, 2025

work page arXiv 2025
[10]

Gemini 3.1 pro: Model card and technical overview

Google DeepMind. Gemini 3.1 pro: Model card and technical overview. Technical report, Google, February 2026. URL https://deepmind.google/models/gemini/pro/. Official technical documentation describing the Gemini 3.1 Pro preview model. Available at:https://deepmind.google/models/model-cards/ gemini-3-1-pro/

work page 2026
[11]

Geovlmath: Enhancing geometry reasoning in vision-language models via cross-modal reward for auxiliary line creation.arXiv preprint arXiv:2510.11020,

Shasha Guo, Liang Pang, Xi Wang, Yanling Wang, Huawei Shen, and Jing Zhang. Geovlmath: Enhancing geometry reasoning in vision-language models via cross-modal reward for auxiliary line creation.arXiv preprint arXiv:2510.11020, 2025

work page arXiv 2025
[12]

Zhitao He, Zongwei Lyu, Dazhong Chen, Dadi Guo, and Yi R. Fung. Matp-bench: Can mllm be a good automated theorem prover for multimodal problems?arXiv preprint arXiv:2506.06034, 2025

work page arXiv 2025
[13]

Cogagent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14281–14290, 2024

work page 2024
[14]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Reguide: Data efficient gui grounding via spatial reasoning and search

Hyunseok Lee, Jeonghoon Kim, Beomjun Kim, Jihoon Tack, Cheonbok Park, Sookyo In, Chansong Jo, Jaehong Lee, Jinwoo Shin, and Kang Min Yoo. Reguide: Data efficient gui grounding via spatial reasoning and search. arXiv preprint arXiv:2505.15259, 2025

work page arXiv 2025
[16]

UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

Shuquan Lian, Yuhang Wu, Jia Ma, Zihan Song, Bingqi Chen, Xiawu Zheng, and Hui Li. Ui-agile: Advancing gui agents with effective reinforcement learning and precise inference-time grounding.arXiv preprint arXiv:2507.22025, 2025. Accepted to CVPR 2026 Findings

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Beyond clicking: A step towards generalist gui grounding via text dragging.arXiv preprint arXiv:2601.06031, 2026

Zeyi Liao, Yadong Lu, Boyu Gou, Huan Sun, and Ahmed Awadallah. Beyond clicking: A step towards generalist gui grounding via text dragging.arXiv preprint arXiv:2601.06031, 2026. 12 PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

work page arXiv 2026
[18]

Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URL https://llava-vl.github.io/blog/ 2024-01-30-llava-next/

work page 2024
[19]

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners.arXiv preprint arXiv:2504.14239, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Infiguiagent: A multimodal generalist gui agent with native reasoning and reflection

Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, and Fei Wu. Infiguiagent: A multimodal generalist gui agent with native reasoning and reflection. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 103...

work page 2026
[21]

UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Guanjing Xiong, and Hongsheng Li. Ui-r1: Enhancing action prediction of gui agents by reinforcement learning.arXiv preprint arXiv:2503.21620, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Visaidmath: Benchmark- ing visual-aided mathematical reasoning

Jingkun Ma, Runzhe Zhan, Derek F. Wong, Yang Li, Di Sun, Hou Pong Chan, and Lidia S. Chao. Visaidmath: Benchmarking visual-aided mathematical reasoning.arXiv preprint arXiv:2410.22995, 2024

work page arXiv 2024
[23]

Coco-agent: A comprehensive cognitive mllm agent for smartphone gui automation

Xinbei Ma, Zhuosheng Zhang, and Hai Zhao. Coco-agent: A comprehensive cognitive mllm agent for smartphone gui automation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 9097–9110, 2024

work page 2024
[24]

Gui-360: A comprehensive dataset and benchmark for computer-using agents.arXiv preprint arXiv:2511.04307, 2025

Jian Mu, Chaoyun Zhang, Chiming Ni, Lu Wang, Bo Qiao, Kartik Mathur, Qianhui Wu, Yuhang Xie, Xiaojun Ma, Mengyu Zhou, Si Qin, Liqun Li, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. Gui-360: A comprehensive dataset and benchmark for computer-using agents.arXiv preprint arXiv:2511.04307, 2025

work page arXiv 2025
[25]

Gui agents: A survey

Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, et al. Gui agents: A survey. InFindings of the Association for Computational Linguistics: ACL 2025, pages 22522–22538, 2025

work page 2025
[26]

Introducing gpt-5.4

OpenAI. Introducing gpt-5.4. OpenAI Blog, 2026. Official announcement describing GPT-5.4 capabilities and model updates. Available at:https://openai.com/zh-Hans-CN/index/introducing-gpt-5-4/

work page 2026
[28]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026

Qwen Team. Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026. URL https://qwen.ai/ blog?id=qwen3.6-35b-a3b

work page 2026
[30]

Medspot: A workflow-aware sequential grounding benchmark for clinical gui.arXiv preprint arXiv:2603.19993, 2026

Rozain Shakeel, Abdul Rahman Mohammad Ali, Muneeb Mushtaq, Tausifa Jan Saleem, and Tajamul Ashraf. Medspot: A workflow-aware sequential grounding benchmark for clinical gui.arXiv preprint arXiv:2603.19993, 2026

work page arXiv 2026
[31]

Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720, 2025

Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, and Dong Yu. Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720, 2025

work page arXiv 2025
[32]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Solidgeo: Measuring multimodal spatial math reasoning in solid geometry

Peijie Wang, Chao Yang, Zhong-Zhi Li, Fei Yin, Dekang Ran, Mi Tian, Zhilong Ji, Jinfeng Bai, and Cheng-Lin Liu. Solidgeo: Measuring multimodal spatial math reasoning in solid geometry. InAdvances in Neural Information Processing Systems, 2025. NeurIPS 2025 Datasets and Benchmarks Track poster. 13 PAGER: Bridging the Semantic-Execution Gap in Point-Precise...

work page 2025
[34]

Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.Advances in Neural Information Processing Systems, 37:2686–2710, 2024a

Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, et al. Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

work page arXiv 2025
[35]

History-aware reasoning for gui agents

Ziwei Wang, Leyang Yang, Xiaoxuan Tang, Sheng Zhou, Dajun Chen, Wei Jiang, and Yong Li. History-aware reasoning for gui agents. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 36448–36456, 2026

work page 2026
[36]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Geoint-r1: Formalizing multimodal geometric reasoning with dynamic auxiliary constructions

Jingxuan Wei, Caijun Jia, Qi Chen, Honghao He, Linzhuang Sun, Conghui He, Lijun Wu, Bihui Yu, and Cheng Tan. Geoint-r1: Formalizing multimodal geometric reasoning with dynamic auxiliary constructions. arXiv preprint arXiv:2508.03173, 2025

work page arXiv 2025
[38]

GeoSketch: A neural-symbolic approach to geometric multimodal reasoning with auxiliary line construction and affine transformation.arXiv preprint arXiv:2509.22460, 2025

Shichao Weng, Zhiqiang Wang, Yuhua Zhou, Rui Lu, Ting Liu, Zhiyang Teng, Xiaozhang Liu, and Han- meng Liu. Geosketch: A neural-symbolic approach to geometric multimodal reasoning with auxiliary line construction and affine transformation.arXiv preprint arXiv:2509.22460, 2025

work page arXiv 2025
[39]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Xiaobo Xia and Run Luo. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

GeoSense: Evaluating identification and application of geometric principles in multimodal reasoning.arXiv preprint arXiv:2504.12597, 2025

Liangyu Xu, Yingxiu Zhao, Jingyun Wang, Yingyao Wang, Bu Pi, Chen Wang, Mingliang Zhang, Jihao Gu, Xiang Li, Xiaoyong Zhu, Jun Song, and Bo Zheng. Geosense: Evaluating identification and application of geometric principles in multimodal reasoning.arXiv preprint arXiv:2504.12597, 2025

work page arXiv 2025
[43]

Mobilerl: Online agentic reinforcement learning for mobile gui agents

Yifan Xu, Xiao Liu, Xinghan Liu, Jiaqi Fu, Jiayu Huang, Hanchen Zhang, Bohao Jing, Shudan Zhang, Yuting Wang, Wenyi Zhao, and Yuxiao Dong. Mobilerl: Online agentic reinforcement learning for mobile gui agents. InThe Fourteenth International Conference on Learning Representations, 2026. ICLR 2026 Poster

work page 2026
[44]

Probench: Benchmarking gui agents with accurate process information

Leyang Yang, Ziwei Wang, Xiaoxuan Tang, Sheng Zhou, Dajun Chen, Wei Jiang, and Yong Li. Probench: Benchmarking gui agents with accurate process information. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 27547–27555, 2026

work page 2026
[45]

Gui-libra: Training native gui agents to reason and act with action-aware supervision and partially verifiable rl.arXiv preprint arXiv:2602.22190, 2026

Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang, Hao Cheng, Huaxiu Yao, Baolin Peng, Huan Zhang, Jianfeng Gao, and Tong Zhang. Gui-libra: Training native gui agents to reason and act with action-aware supervision and partially verifiable rl.arXiv preprint arXiv:2602.22190, 2026

work page arXiv 2026
[46]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Se-gui: Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning

Xinbin Yuan, Jian Zhang, Kaixin Li, Zhuoxuan Cai, Lujian Yao, Jie Chen, Enguang Wang, Qibin Hou, Jinwei Chen, Peng-Tao Jiang, and Bo Li. Se-gui: Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning. InAdvances in Neural Information Processing Systems, 2025. NeurIPS 2025 poster

work page 2025
[48]

R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization

Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1859–1869, October 2025

work page 2025
[49]

Geofm: Enhancing geometric reasoning of mllms via synthetic data generation through formal language, 2025

Yuhao Zhang, Dingxin Hu, Tinghao Yu, Hao Liu, and Yiting Liu. Geofm: Enhancing geometric reasoning of mllms via synthetic data generation through formal language.arXiv preprint arXiv:2510.27448, 2025. 14 PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

work page arXiv 2025
[50]

Geochallenge: A multi-answer multiple-choice benchmark for geometric reasoning with diagrams.arXiv preprint arXiv:2603.19252, 2026

Yushun Zhang, Weiping Fu, Zesheng Yang, Bo Zhao, Lingling Zhang, Jian Zhang, Yumeng Fu, Jiaxing Huang, and Jun Liu. Geochallenge: A multi-answer multiple-choice benchmark for geometric reasoning with diagrams.arXiv preprint arXiv:2603.19252, 2026

work page arXiv 2026
[51]

Worldgui: An interactive benchmark for desktop gui automation from any starting point.arXiv preprint arXiv:2502.08047, 2025

Henry Hengyuan Zhao, Kaiming Yang, Wendi Yu, Difei Gao, and Mike Zheng Shou. Worldgui: An interactive benchmark for desktop gui automation from any starting point.arXiv preprint arXiv:2502.08047, 2025

work page arXiv 2025
[52]

Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents

Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, and Jun Xu. Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents. InAdvances in Neural Information Processing Systems,

work page
[53]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 15 PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control A Performance ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Exclude questions that do not require or cannot be visualized with GeoGebra, such as pure algebraic calculations or pure logical reasoning

Core Task Screen K12 mathematics questions to identifygeometry-related ones that can use GeoGebra software for graphing to assist problem-solving/teaching. Exclude questions that do not require or cannot be visualized with GeoGebra, such as pure algebraic calculations or pure logical reasoning

work page
[55]

Separate questions with blank lines or serial numbers

Input Format: Batch text of K12 mathematics questions, which may include stems, options, and problem-solving steps. Separate questions with blank lines or serial numbers

work page
[56]

Specify the school stage if available; otherwise, identify it automatically

Input Scope: K12 mathematics questions, including primary, middle, and high school. Specify the school stage if available; otherwise, identify it automatically

work page
[57]

Screening Criteria (Applicable if Any Is Met)

work page
[58]

Core involvesgeometry graph drawing/analysis: triangles, quadrilaterals, circles, polygons, 3D shapes, coordinate system graphs, etc

work page
[59]

Requiresgraph measurement/construction: side lengths, angles, areas, volumes, perpendicu- lar/parallel lines, angle bisectors, circumscribed/inscribed circles, and transformations

work page
[60]

Involvesgeometry relationship verification: congruence, similarity, parallelism, perpendicular- ity, collinearity, or concyclicity requiring visualization

work page
[61]

Coordinate-system related: point coordinates, linear/circle equations, conic sections, and graph- based property analysis

work page
[62]

Exclusion Criteria (Not Applicable if Any Is Met)

work page
[63]

Pure algebraic calculations: solving equations, factorization, formula calculation, sequences, or non-geometric probability/statistics

work page
[64]

Pure logical reasoning: text-only geometry theorem proofs or definition discrimination

work page
[65]

No clear geometric elements: numerical calculations only or application questions with no graph correlation

work page
[66]

Primary”, “Middle

Output Format (JSON Structure) Output aJSON array, where each element represents a screened question. Each element contains the following fields: •school_stage: String. School stage of the question: “Primary”, “Middle”, or “High”. •can_draw: Boolean.truemeans applicable to GeoGebra;falsemeans not applicable. •ggb_content : String. If can_draw=true, provid...

work page
[67]

school_stage

JSON Output Example [ { "school_stage": "Middle", "can_draw": true, "ggb_content": "Right triangle ABC with right angle at C. Mark side AB=5cm. Label vertices A, B, C and right angle symbol. And then calculate BC.", "supplementary_note": "Use GeoGebra's Right Triangle tool for quick construction." }, { "school_stage": "Primary", "can_draw": false, "ggb_co...

work page
[68]

Your primary focus is to integrate all three components to infer accurate construction steps

Mission Statement You will receive a dataset entry containing three core components:Question,Answer, andImage. Your primary focus is to integrate all three components to infer accurate construction steps. You must:

work page
[69]

Analyze: Analyze the construction logic by combining the problem requirements, answer clues, and image context

work page
[70]

Infer: If coordinates are not explicitly provided, infer reasonable Cartesian coordinates that satisfy the described geometric relationships

work page
[71]

Custom functions are not permitted

Map: Map each construction step to the exact function in the Allowed Function Library. Custom functions are not permitted. 4.Classify: Classify the problem by selecting skills, grade level, and drawing difficulty. 5.Labeling rule: Do not add labels beyond what is required by the Question, Answer, or Image. 6.Output: Strictly follow the Mandatory Output Fo...

work page
[72]

text": "string

Allowed Function Library You may only use the functions defined below. A. General and Input •generate_input_action: Used for algebraic text input. Parameters:{"text": "string"}. •add_text_label : Add a text label at a specified position. Parameters: {"position": [x, y], "text": "string"}. B. Points •draw_point: Draw one or more points. Parameters:{"points...

work page
[73]

Object Grounding and Dependency Rules

work page
[74]

No floating reference points: Any reference point used in a construction function must have been previously created or must be strictly located on a previously created object

work page
[75]

Create before use: If a step requires construction based on an existing object, that object must have been created in an earlier task

work page
[76]

4.Prefer reuse of existing key points: Reuse endpoints, vertices, or centers whenever possible

Strict on-object point constraints: Coordinates of an on-object point must exactly satisfy the geometric equation of that object. 4.Prefer reuse of existing key points: Reuse endpoints, vertices, or centers whenever possible

work page
[77]

Construction Function Requirements •perpendicular_line and parallel_line: the first point must lie on an existing line, segment, or ray; the second point must be an existing point

Implicit object registry: Maintain an internal list of created objects and ensure all later steps reference them consistently. Construction Function Requirements •perpendicular_line and parallel_line: the first point must lie on an existing line, segment, or ray; the second point must be an existing point. •perpendicular_bisector: both points must be exis...

work page
[78]

• Integer Priority Rule: If special coordinates are not required, choose integer coordinates whenever possible

Coordinate Inference Protocol • If coordinates are not provided, infer reasonable Cartesian coordinates that preserve geometric properties. • Integer Priority Rule: If special coordinates are not required, choose integer coordinates whenever possible. • Use decimals only when integers cannot satisfy the required geometric constraints. • If the figure impl...

work page
[79]

Skill Taxonomy Multiple selections are allowed

Classification Standards A. Skill Taxonomy Multiple selections are allowed

work page
[80]

Basic geometric object construction

work page
[81]

Numerical and metric constraints

work page

Showing first 80 references.

[1] [1]

Claude 4.6 sonnet: Performance and safety updates

Anthropic. Claude 4.6 sonnet: Performance and safety updates. Technical report, Anthropic PBC,

work page

[2] [2]

Available at: https://www.anthropic.com/news/ claude-sonnet-4-6

Official announcement and technical overview. Available at: https://www.anthropic.com/news/ claude-sonnet-4-6

work page

[3] [3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Milestones over outcome: Unlocking geometric reasoning with sub-goal verifiable reward.arXiv preprint arXiv:2601.05073, 2026

Jianlong Chen, Daocheng Fu, Shengze Xu, Jiawei Chen, Yuan Feng, Yue Yang, Junchi Yan, Hongyuan Zha, and Renqiu Xia. Milestones over outcome: Unlocking geometric reasoning with sub-goal verifiable reward. arXiv preprint arXiv:2601.05073, 2026

work page arXiv 2026

[5] [5]

Guicourse: From general vision language model to versatile gui agent

Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, et al. Guicourse: From general vision language model to versatile gui agent. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21936–21959, 2025

work page 2025

[6] [6]

Mm-prm: Enhancing multimodal mathematical reasoning with scalable step-level supervision.arXiv preprint arXiv:2505.13427, 2025

Lingxiao Du, Fanqing Meng, Zongkai Liu, Zhixiang Zhou, Ping Luo, Qiaosheng Zhang, and Wenqi Shao. Mm-prm: Enhancing multimodal mathematical reasoning with scalable step-level supervision.arXiv preprint arXiv:2505.13427, 2025

work page arXiv 2025

[7] [7]

Mcptoolbench++: A large scale ai agent model context protocol mcp tool use benchmark.arXiv preprint arXiv:2508.07575, 2025

Shiqing Fan, Xichen Ding, Liang Zhang, and Linjian Mo. Mcptoolbench++: A large scale ai agent model context protocol mcp tool use benchmark.arXiv preprint arXiv:2508.07575, 2025

work page arXiv 2025

[8] [8]

Geobench: Rethinking multimodal geometric problem-solving via hierarchical evaluation

Yuan Feng, Yue Yang, Xiaohan He, Jiatong Zhao, Jianlong Chen, Daocheng Fu, Qi Liu, Renqiu Xia, Bo Zhang, and Junchi Yan. Geobench: Rethinking multimodal geometric problem-solving via hierarchical evaluation. InThe Fourteenth International Conference on Learning Representations, 2026. ICLR 2026 Poster

work page 2026

[9] [9]

Trustgeogen: Formal-verified data engine for trustworthy multi-modal geometric problem solving.arXiv preprint arXiv:2504.15780, 2026

Daocheng Fu, Zijun Chen, Renqiu Xia, Qi Liu, Yuan Feng, Hongbin Zhou, Renrui Zhang, Shiyang Feng, Peng Gao, Junchi Yan, Botian Shi, Bo Zhang, and Yu Qiao. Trustgeogen: Scalable and formal-verified data engine for trustworthy multi-modal geometric problem solving.arXiv preprint arXiv:2504.15780, 2025

work page arXiv 2025

[10] [10]

Gemini 3.1 pro: Model card and technical overview

Google DeepMind. Gemini 3.1 pro: Model card and technical overview. Technical report, Google, February 2026. URL https://deepmind.google/models/gemini/pro/. Official technical documentation describing the Gemini 3.1 Pro preview model. Available at:https://deepmind.google/models/model-cards/ gemini-3-1-pro/

work page 2026

[11] [11]

Geovlmath: Enhancing geometry reasoning in vision-language models via cross-modal reward for auxiliary line creation.arXiv preprint arXiv:2510.11020,

Shasha Guo, Liang Pang, Xi Wang, Yanling Wang, Huawei Shen, and Jing Zhang. Geovlmath: Enhancing geometry reasoning in vision-language models via cross-modal reward for auxiliary line creation.arXiv preprint arXiv:2510.11020, 2025

work page arXiv 2025

[12] [12]

Zhitao He, Zongwei Lyu, Dazhong Chen, Dadi Guo, and Yi R. Fung. Matp-bench: Can mllm be a good automated theorem prover for multimodal problems?arXiv preprint arXiv:2506.06034, 2025

work page arXiv 2025

[13] [13]

Cogagent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14281–14290, 2024

work page 2024

[14] [14]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Reguide: Data efficient gui grounding via spatial reasoning and search

Hyunseok Lee, Jeonghoon Kim, Beomjun Kim, Jihoon Tack, Cheonbok Park, Sookyo In, Chansong Jo, Jaehong Lee, Jinwoo Shin, and Kang Min Yoo. Reguide: Data efficient gui grounding via spatial reasoning and search. arXiv preprint arXiv:2505.15259, 2025

work page arXiv 2025

[16] [16]

UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

Shuquan Lian, Yuhang Wu, Jia Ma, Zihan Song, Bingqi Chen, Xiawu Zheng, and Hui Li. Ui-agile: Advancing gui agents with effective reinforcement learning and precise inference-time grounding.arXiv preprint arXiv:2507.22025, 2025. Accepted to CVPR 2026 Findings

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Beyond clicking: A step towards generalist gui grounding via text dragging.arXiv preprint arXiv:2601.06031, 2026

Zeyi Liao, Yadong Lu, Boyu Gou, Huan Sun, and Ahmed Awadallah. Beyond clicking: A step towards generalist gui grounding via text dragging.arXiv preprint arXiv:2601.06031, 2026. 12 PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

work page arXiv 2026

[18] [18]

Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URL https://llava-vl.github.io/blog/ 2024-01-30-llava-next/

work page 2024

[19] [19]

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners.arXiv preprint arXiv:2504.14239, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Infiguiagent: A multimodal generalist gui agent with native reasoning and reflection

Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, and Fei Wu. Infiguiagent: A multimodal generalist gui agent with native reasoning and reflection. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 103...

work page 2026

[21] [21]

UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Guanjing Xiong, and Hongsheng Li. Ui-r1: Enhancing action prediction of gui agents by reinforcement learning.arXiv preprint arXiv:2503.21620, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Visaidmath: Benchmark- ing visual-aided mathematical reasoning

Jingkun Ma, Runzhe Zhan, Derek F. Wong, Yang Li, Di Sun, Hou Pong Chan, and Lidia S. Chao. Visaidmath: Benchmarking visual-aided mathematical reasoning.arXiv preprint arXiv:2410.22995, 2024

work page arXiv 2024

[23] [23]

Coco-agent: A comprehensive cognitive mllm agent for smartphone gui automation

Xinbei Ma, Zhuosheng Zhang, and Hai Zhao. Coco-agent: A comprehensive cognitive mllm agent for smartphone gui automation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 9097–9110, 2024

work page 2024

[24] [24]

Gui-360: A comprehensive dataset and benchmark for computer-using agents.arXiv preprint arXiv:2511.04307, 2025

Jian Mu, Chaoyun Zhang, Chiming Ni, Lu Wang, Bo Qiao, Kartik Mathur, Qianhui Wu, Yuhang Xie, Xiaojun Ma, Mengyu Zhou, Si Qin, Liqun Li, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. Gui-360: A comprehensive dataset and benchmark for computer-using agents.arXiv preprint arXiv:2511.04307, 2025

work page arXiv 2025

[25] [25]

Gui agents: A survey

Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, et al. Gui agents: A survey. InFindings of the Association for Computational Linguistics: ACL 2025, pages 22522–22538, 2025

work page 2025

[26] [26]

Introducing gpt-5.4

OpenAI. Introducing gpt-5.4. OpenAI Blog, 2026. Official announcement describing GPT-5.4 capabilities and model updates. Available at:https://openai.com/zh-Hans-CN/index/introducing-gpt-5-4/

work page 2026

[27] [28]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [29]

Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026

Qwen Team. Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026. URL https://qwen.ai/ blog?id=qwen3.6-35b-a3b

work page 2026

[29] [30]

Medspot: A workflow-aware sequential grounding benchmark for clinical gui.arXiv preprint arXiv:2603.19993, 2026

Rozain Shakeel, Abdul Rahman Mohammad Ali, Muneeb Mushtaq, Tausifa Jan Saleem, and Tajamul Ashraf. Medspot: A workflow-aware sequential grounding benchmark for clinical gui.arXiv preprint arXiv:2603.19993, 2026

work page arXiv 2026

[30] [31]

Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720, 2025

Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, and Dong Yu. Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720, 2025

work page arXiv 2025

[31] [32]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [33]

Solidgeo: Measuring multimodal spatial math reasoning in solid geometry

Peijie Wang, Chao Yang, Zhong-Zhi Li, Fei Yin, Dekang Ran, Mi Tian, Zhilong Ji, Jinfeng Bai, and Cheng-Lin Liu. Solidgeo: Measuring multimodal spatial math reasoning in solid geometry. InAdvances in Neural Information Processing Systems, 2025. NeurIPS 2025 Datasets and Benchmarks Track poster. 13 PAGER: Bridging the Semantic-Execution Gap in Point-Precise...

work page 2025

[33] [34]

Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.Advances in Neural Information Processing Systems, 37:2686–2710, 2024a

Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, et al. Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

work page arXiv 2025

[34] [35]

History-aware reasoning for gui agents

Ziwei Wang, Leyang Yang, Xiaoxuan Tang, Sheng Zhou, Dajun Chen, Wei Jiang, and Yong Li. History-aware reasoning for gui agents. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 36448–36456, 2026

work page 2026

[35] [36]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [37]

Geoint-r1: Formalizing multimodal geometric reasoning with dynamic auxiliary constructions

Jingxuan Wei, Caijun Jia, Qi Chen, Honghao He, Linzhuang Sun, Conghui He, Lijun Wu, Bihui Yu, and Cheng Tan. Geoint-r1: Formalizing multimodal geometric reasoning with dynamic auxiliary constructions. arXiv preprint arXiv:2508.03173, 2025

work page arXiv 2025

[37] [38]

GeoSketch: A neural-symbolic approach to geometric multimodal reasoning with auxiliary line construction and affine transformation.arXiv preprint arXiv:2509.22460, 2025

Shichao Weng, Zhiqiang Wang, Yuhua Zhou, Rui Lu, Ting Liu, Zhiyang Teng, Xiaozhang Liu, and Han- meng Liu. Geosketch: A neural-symbolic approach to geometric multimodal reasoning with auxiliary line construction and affine transformation.arXiv preprint arXiv:2509.22460, 2025

work page arXiv 2025

[38] [39]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [40]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [41]

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Xiaobo Xia and Run Luo. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [42]

GeoSense: Evaluating identification and application of geometric principles in multimodal reasoning.arXiv preprint arXiv:2504.12597, 2025

Liangyu Xu, Yingxiu Zhao, Jingyun Wang, Yingyao Wang, Bu Pi, Chen Wang, Mingliang Zhang, Jihao Gu, Xiang Li, Xiaoyong Zhu, Jun Song, and Bo Zheng. Geosense: Evaluating identification and application of geometric principles in multimodal reasoning.arXiv preprint arXiv:2504.12597, 2025

work page arXiv 2025

[42] [43]

Mobilerl: Online agentic reinforcement learning for mobile gui agents

Yifan Xu, Xiao Liu, Xinghan Liu, Jiaqi Fu, Jiayu Huang, Hanchen Zhang, Bohao Jing, Shudan Zhang, Yuting Wang, Wenyi Zhao, and Yuxiao Dong. Mobilerl: Online agentic reinforcement learning for mobile gui agents. InThe Fourteenth International Conference on Learning Representations, 2026. ICLR 2026 Poster

work page 2026

[43] [44]

Probench: Benchmarking gui agents with accurate process information

Leyang Yang, Ziwei Wang, Xiaoxuan Tang, Sheng Zhou, Dajun Chen, Wei Jiang, and Yong Li. Probench: Benchmarking gui agents with accurate process information. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 27547–27555, 2026

work page 2026

[44] [45]

Gui-libra: Training native gui agents to reason and act with action-aware supervision and partially verifiable rl.arXiv preprint arXiv:2602.22190, 2026

Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang, Hao Cheng, Huaxiu Yao, Baolin Peng, Huan Zhang, Jianfeng Gao, and Tong Zhang. Gui-libra: Training native gui agents to reason and act with action-aware supervision and partially verifiable rl.arXiv preprint arXiv:2602.22190, 2026

work page arXiv 2026

[45] [46]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [47]

Se-gui: Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning

Xinbin Yuan, Jian Zhang, Kaixin Li, Zhuoxuan Cai, Lujian Yao, Jie Chen, Enguang Wang, Qibin Hou, Jinwei Chen, Peng-Tao Jiang, and Bo Li. Se-gui: Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning. InAdvances in Neural Information Processing Systems, 2025. NeurIPS 2025 poster

work page 2025

[47] [48]

R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization

Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1859–1869, October 2025

work page 2025

[48] [49]

Geofm: Enhancing geometric reasoning of mllms via synthetic data generation through formal language, 2025

Yuhao Zhang, Dingxin Hu, Tinghao Yu, Hao Liu, and Yiting Liu. Geofm: Enhancing geometric reasoning of mllms via synthetic data generation through formal language.arXiv preprint arXiv:2510.27448, 2025. 14 PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

work page arXiv 2025

[49] [50]

Geochallenge: A multi-answer multiple-choice benchmark for geometric reasoning with diagrams.arXiv preprint arXiv:2603.19252, 2026

Yushun Zhang, Weiping Fu, Zesheng Yang, Bo Zhao, Lingling Zhang, Jian Zhang, Yumeng Fu, Jiaxing Huang, and Jun Liu. Geochallenge: A multi-answer multiple-choice benchmark for geometric reasoning with diagrams.arXiv preprint arXiv:2603.19252, 2026

work page arXiv 2026

[50] [51]

Worldgui: An interactive benchmark for desktop gui automation from any starting point.arXiv preprint arXiv:2502.08047, 2025

Henry Hengyuan Zhao, Kaiming Yang, Wendi Yu, Difei Gao, and Mike Zheng Shou. Worldgui: An interactive benchmark for desktop gui automation from any starting point.arXiv preprint arXiv:2502.08047, 2025

work page arXiv 2025

[51] [52]

Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents

Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, and Jun Xu. Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents. InAdvances in Neural Information Processing Systems,

work page

[52] [53]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 15 PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control A Performance ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [54]

Exclude questions that do not require or cannot be visualized with GeoGebra, such as pure algebraic calculations or pure logical reasoning

Core Task Screen K12 mathematics questions to identifygeometry-related ones that can use GeoGebra software for graphing to assist problem-solving/teaching. Exclude questions that do not require or cannot be visualized with GeoGebra, such as pure algebraic calculations or pure logical reasoning

work page

[54] [55]

Separate questions with blank lines or serial numbers

Input Format: Batch text of K12 mathematics questions, which may include stems, options, and problem-solving steps. Separate questions with blank lines or serial numbers

work page

[55] [56]

Specify the school stage if available; otherwise, identify it automatically

Input Scope: K12 mathematics questions, including primary, middle, and high school. Specify the school stage if available; otherwise, identify it automatically

work page

[56] [57]

Screening Criteria (Applicable if Any Is Met)

work page

[57] [58]

Core involvesgeometry graph drawing/analysis: triangles, quadrilaterals, circles, polygons, 3D shapes, coordinate system graphs, etc

work page

[58] [59]

Requiresgraph measurement/construction: side lengths, angles, areas, volumes, perpendicu- lar/parallel lines, angle bisectors, circumscribed/inscribed circles, and transformations

work page

[59] [60]

Involvesgeometry relationship verification: congruence, similarity, parallelism, perpendicular- ity, collinearity, or concyclicity requiring visualization

work page

[60] [61]

Coordinate-system related: point coordinates, linear/circle equations, conic sections, and graph- based property analysis

work page

[61] [62]

Exclusion Criteria (Not Applicable if Any Is Met)

work page

[62] [63]

Pure algebraic calculations: solving equations, factorization, formula calculation, sequences, or non-geometric probability/statistics

work page

[63] [64]

Pure logical reasoning: text-only geometry theorem proofs or definition discrimination

work page

[64] [65]

No clear geometric elements: numerical calculations only or application questions with no graph correlation

work page

[65] [66]

Primary”, “Middle

Output Format (JSON Structure) Output aJSON array, where each element represents a screened question. Each element contains the following fields: •school_stage: String. School stage of the question: “Primary”, “Middle”, or “High”. •can_draw: Boolean.truemeans applicable to GeoGebra;falsemeans not applicable. •ggb_content : String. If can_draw=true, provid...

work page

[66] [67]

school_stage

JSON Output Example [ { "school_stage": "Middle", "can_draw": true, "ggb_content": "Right triangle ABC with right angle at C. Mark side AB=5cm. Label vertices A, B, C and right angle symbol. And then calculate BC.", "supplementary_note": "Use GeoGebra's Right Triangle tool for quick construction." }, { "school_stage": "Primary", "can_draw": false, "ggb_co...

work page

[67] [68]

Your primary focus is to integrate all three components to infer accurate construction steps

Mission Statement You will receive a dataset entry containing three core components:Question,Answer, andImage. Your primary focus is to integrate all three components to infer accurate construction steps. You must:

work page

[68] [69]

Analyze: Analyze the construction logic by combining the problem requirements, answer clues, and image context

work page

[69] [70]

Infer: If coordinates are not explicitly provided, infer reasonable Cartesian coordinates that satisfy the described geometric relationships

work page

[70] [71]

Custom functions are not permitted

Map: Map each construction step to the exact function in the Allowed Function Library. Custom functions are not permitted. 4.Classify: Classify the problem by selecting skills, grade level, and drawing difficulty. 5.Labeling rule: Do not add labels beyond what is required by the Question, Answer, or Image. 6.Output: Strictly follow the Mandatory Output Fo...

work page

[71] [72]

text": "string

Allowed Function Library You may only use the functions defined below. A. General and Input •generate_input_action: Used for algebraic text input. Parameters:{"text": "string"}. •add_text_label : Add a text label at a specified position. Parameters: {"position": [x, y], "text": "string"}. B. Points •draw_point: Draw one or more points. Parameters:{"points...

work page

[72] [73]

Object Grounding and Dependency Rules

work page

[73] [74]

No floating reference points: Any reference point used in a construction function must have been previously created or must be strictly located on a previously created object

work page

[74] [75]

Create before use: If a step requires construction based on an existing object, that object must have been created in an earlier task

work page

[75] [76]

4.Prefer reuse of existing key points: Reuse endpoints, vertices, or centers whenever possible

Strict on-object point constraints: Coordinates of an on-object point must exactly satisfy the geometric equation of that object. 4.Prefer reuse of existing key points: Reuse endpoints, vertices, or centers whenever possible

work page

[76] [77]

Construction Function Requirements •perpendicular_line and parallel_line: the first point must lie on an existing line, segment, or ray; the second point must be an existing point

Implicit object registry: Maintain an internal list of created objects and ensure all later steps reference them consistently. Construction Function Requirements •perpendicular_line and parallel_line: the first point must lie on an existing line, segment, or ray; the second point must be an existing point. •perpendicular_bisector: both points must be exis...

work page

[77] [78]

• Integer Priority Rule: If special coordinates are not required, choose integer coordinates whenever possible

Coordinate Inference Protocol • If coordinates are not provided, infer reasonable Cartesian coordinates that preserve geometric properties. • Integer Priority Rule: If special coordinates are not required, choose integer coordinates whenever possible. • Use decimals only when integers cannot satisfy the required geometric constraints. • If the figure impl...

work page

[78] [79]

Skill Taxonomy Multiple selections are allowed

Classification Standards A. Skill Taxonomy Multiple selections are allowed

work page

[79] [80]

Basic geometric object construction

work page

[80] [81]

Numerical and metric constraints

work page