DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding

arxiv: 2605.15542 · v1 · pith:26GH5WORnew · submitted 2026-05-15 · 💻 cs.AI

DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding

Yichao Liu , Huawen Shen , Liu Yu , Shiyu Liu , Zeyu Chen , Yu Zhou This is my paper

Pith reviewed 2026-05-19 14:20 UTC · model grok-4.3

classification 💻 cs.AI

keywords GUI groundingtraining-free methodsmultimodal large language modelsdynamic region searchMonte Carlo Tree SearchUI Perceptorscreen understanding

0 comments p. Extension

pith:26GH5WOR Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{26GH5WOR}

Prints a linked pith:26GH5WOR badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

A training-free dynamic region search method improves GUI grounding performance by 14 percent in existing multimodal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large language models often struggle to ground user instructions accurately on high-resolution GUI screenshots filled with irrelevant elements. The paper introduces DRS-GUI as a way to add dynamic exploration that mimics how humans narrow their focus on complex screens. A lightweight UI Perceptor applies three actions called Focus, Shift, and Scatter to generate region proposals step by step. An MCTS-based Action Planner coordinates these actions and uses a quality reward to pick the most relevant region while discarding clutter. This module plugs into current models without any training and raises accuracy on grounding benchmarks.

Core claim

The paper claims that a training-free dynamic region search framework can improve instruction grounding on cluttered high-resolution GUI screenshots. It does so by equipping multimodal models with a lightweight UI Perceptor that executes Focus, Shift, and Scatter actions, scheduled dynamically by an MCTS-based Action Planner and evaluated through a region quality reward that selects highly relevant proposals and prunes irrelevant UI elements. This integration requires no model training or fine-tuning and produces measurable gains for both general and GUI-specific MLLMs.

What carries the argument

The central mechanism is the MCTS-based Action Planner that schedules the lightweight UI Perceptor's three perceptual actions (Focus, Shift, and Scatter) to generate, evaluate, and select instruction-relevant region proposals.

If this is right

Existing MLLMs gain improved grounding accuracy on complex GUI interfaces by adding this module.
No additional training or fine-tuning is needed to obtain the reported performance lift.
The approach applies across both general-purpose and GUI-specialized multimodal models.
Irrelevant UI components are removed efficiently through reward-driven selection of regions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar search-based planning could help with other visual grounding problems where screens or scenes contain high visual clutter.
The work points to planning algorithms as one route to strengthen perception inside large models that lack explicit exploration.
Because the method adds no training cost, it could be tested quickly on additional benchmarks or real-world GUI agent tasks.

Load-bearing premise

The three perceptual actions performed by the UI Perceptor will reliably produce and allow selection of instruction-relevant regions when scheduled by the MCTS planner.

What would settle it

Direct evaluation on ScreenSpot-Pro showing that adding DRS-GUI produces no accuracy gain or a loss compared to the same MLLMs without the module would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.15542 by Huawen Shen, Liu Yu, Shiyu Liu, Yichao Liu, Yu Zhou, Zeyu Chen.

**Figure 1.** Figure 1: (a) Single-stage methods progressively crop and zoom through forward-only focus, causing error accumulation. (b) DRS-GUI uses an action planner and a UI Perceptor to explore regions via three perceptual actions, guided by region quality evaluation to select the instruction-relevant region. visual clutter, but also from the absence of an explicit mechanism for adapting perceptual scope during interaction.… view at source ↗

**Figure 2.** Figure 2: Processing pipeline of DRS-GUI. The UI Perceptor parses UI elements and scores their relevance to the instruction, while the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Redundancy reduction of our dynamic region search [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of the MCTS-based perceptual planning process. The tree illustrates how the planner dynamically adjusts perceptual [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative examples of perceptual actions. The left red [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 8.** Figure 8: Accuracy under different perception iteration counts [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 6.** Figure 6: Comparison between the responses of Qwen2.5-VL and [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison between the responses of Qwen2.5-VL and [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

GUI agents powered by Multimodal Large Language Models (MLLMs) have demonstrated impressive capability in understanding and executing user instructions. However, accurately grounding instruction-relevant elements from high-resolution screenshots cluttered with irrelevant UI components remains challenging for existing approaches. Inspired by how humans dynamically adjust their perceptual scope to locate task-related regions on complex screens, we propose DRS-GUI, a training-free dynamic region search framework for GUI grounding that can be seamlessly integrated into existing MLLMs. DRS-GUI introduces a lightweight UI Perceptor that performs three human-like perceptual actions (Focus, Shift, and Scatter) to progressively explore the interface and generate region proposals. To dynamically schedule these actions, we further design an Action Planner based on Monte Carlo Tree Search (MCTS). A region quality reward is employed to evaluate and select the highly instruction-relevant region, efficiently pruning redundant UI elements. Experiments demonstrate that DRS-GUI yields a 14\% improvement on ScreenSpot-Pro for general and GUI-specific MLLMs (Qwen2.5-VL-7B and UGround-V1-7B), significantly enhancing grounding performance and generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DRS-GUI claims a 14% grounding boost on ScreenSpot-Pro via training-free MCTS-scheduled perceptual actions, but the abstract leaves it unclear whether the lift comes from the search policy or just extra model calls.

read the letter

The main point is that this paper offers a training-free framework to improve how MLLMs ground instructions on cluttered GUI screenshots by using a lightweight UI Perceptor that runs Focus, Shift, and Scatter actions, then lets MCTS pick the best region proposal based on a quality reward. It reports a 14% gain for Qwen2.5-VL-7B and UGround-V1-7B on ScreenSpot-Pro and positions the whole thing as easy to drop into existing models without any retraining.

Referee Report

3 major / 1 minor

Summary. The manuscript presents DRS-GUI, a training-free framework for GUI grounding in multimodal large language models. It features a lightweight UI Perceptor that executes three perceptual actions—Focus, Shift, and Scatter—to generate region proposals from high-resolution screenshots, scheduled dynamically via a Monte Carlo Tree Search (MCTS) based Action Planner using a region quality reward. The approach is evaluated on the ScreenSpot-Pro benchmark, reporting a 14% performance improvement for both general (Qwen2.5-VL-7B) and GUI-specific (UGround-V1-7B) MLLMs.

Significance. Should the central claims hold after addressing the experimental gaps, the work could offer a practical, training-free method to enhance the grounding accuracy and generalization of existing MLLMs on cluttered GUI interfaces by mimicking human-like dynamic perceptual adjustment. This has potential implications for improving the reliability of GUI agents without the need for model fine-tuning or additional training data.

major comments (3)

Experiments section: The reported 14% gain on ScreenSpot-Pro lacks information on the number of MLLM forward passes required per test case, which is necessary to determine if the improvement stems from the dynamic region search or simply from additional model queries.
Method section: The paper does not include ablations demonstrating that the MCTS-based scheduling of Focus, Shift, and Scatter actions outperforms simpler strategies such as random region sampling or direct evaluation on the full screenshot.
§3.3 Action Planner: Details on the exact formulation of the region quality reward function are missing, making it unclear how the planner reliably ranks and selects instruction-relevant regions over the original cluttered input.

minor comments (1)

Abstract: Consider specifying the exact evaluation metric (e.g., accuracy at a given IoU threshold) underlying the 14% improvement claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify key aspects of our work. We address each major comment below and indicate revisions to the manuscript.

read point-by-point responses

Referee: Experiments section: The reported 14% gain on ScreenSpot-Pro lacks information on the number of MLLM forward passes required per test case, which is necessary to determine if the improvement stems from the dynamic region search or simply from additional model queries.

Authors: We agree this detail is important for interpreting the source of gains. In the revised manuscript we have added the average number of MLLM forward passes per test case (approximately 6.2 for DRS-GUI versus 1 for direct baselines). We further include a controlled comparison showing that simply increasing query budget on the full screenshot does not reproduce the observed accuracy lift, indicating the benefit arises from targeted region selection and pruning rather than query volume alone. revision: yes
Referee: Method section: The paper does not include ablations demonstrating that the MCTS-based scheduling of Focus, Shift, and Scatter actions outperforms simpler strategies such as random region sampling or direct evaluation on the full screenshot.

Authors: We acknowledge the value of such ablations. The revised manuscript now includes an ablation study (new Section 4.3) comparing MCTS scheduling against random action selection and direct full-screenshot evaluation. Results confirm MCTS yields higher grounding accuracy; random sampling frequently selects irrelevant regions while direct evaluation is hindered by visual clutter. These additions directly address the concern. revision: yes
Referee: §3.3 Action Planner: Details on the exact formulation of the region quality reward function are missing, making it unclear how the planner reliably ranks and selects instruction-relevant regions over the original cluttered input.

Authors: We apologize for the omission. The region quality reward is formulated as r = α · sim(MLLM_embed(region), MLLM_embed(instruction)) − β · (area(region)/area(screenshot)), where sim denotes cosine similarity of embeddings and α, β are tuned hyperparameters. This formulation prioritizes semantically aligned yet compact regions. We have inserted the exact equation, hyperparameter values, and usage within the MCTS backup step into the revised §3.3, together with a short derivation of why it favors relevant regions over cluttered input. revision: yes

Circularity Check

0 steps flagged

No significant circularity: method is training-free with externally evaluated gains and no self-referential derivations or fitted predictions.

full rationale

The paper presents DRS-GUI as a training-free framework using a lightweight UI Perceptor with Focus/Shift/Scatter actions scheduled by MCTS and a region quality reward. No equations, parameter fits, or derivations are described that reduce the reported 14% ScreenSpot-Pro improvement to an internal definition or self-citation chain. The central claims rely on external benchmarks (Qwen2.5-VL-7B, UGround-V1-7B) rather than any quantity defined inside the paper itself. Self-citations, if present in the full text, are not load-bearing for the core performance claim, which remains independently falsifiable via the stated evaluation protocol. This is a standard honest non-finding for a descriptive systems paper without mathematical self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the method implicitly assumes that the three named perceptual actions plus MCTS search are sufficient to locate relevant regions without training.

pith-pipeline@v0.9.0 · 5740 in / 1364 out tokens · 43558 ms · 2026-05-19T14:20:03.742235+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DRS-GUI introduces a lightweight UI Perceptor that performs three human-like perceptual actions (Focus, Shift, and Scatter) ... an Action Planner based on Monte Carlo Tree Search (MCTS). A region quality reward is employed...
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The MCTS planner runs with a rollout budget of N=8, depth limit H=3...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 17 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Introducing our multimodal models, 2023

Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sa ˘gnak Tas ¸ırlar. Introducing our multimodal models, 2023. 6

work page 2023
[6]

Ground-r1: Incentivizing grounded visual reasoning via reinforcement learning.arXiv preprint arXiv:2505.20272, 2025

Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground-r1: Incentivizing grounded visual reasoning via reinforcement learning.arXiv preprint arXiv:2505.20272, 2025. 2

work page arXiv 2025
[7]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks.arXiv preprint arXiv:2312.14238, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.arXiv preprint arXiv:2404.16821, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Seeclick: Harness- ing GUI grounding for advanced visual GUI agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yan- tao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harness- ing GUI grounding for advanced visual GUI agents. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 9313–

work page 2024
[10]

1, 2, 5, 6

Association for Computational Linguistics, 2024. 1, 2, 5, 6

work page 2024
[11]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samual Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. InAdvances in Neu- ral Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. 2

work page 2023
[12]

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents.arXiv preprint arXiv:2410.05243, 2024. 1, 2, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab- hinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Cogagent: A visual lan- guage model for GUI agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual lan- guage model for GUI agents. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 14281–14290. IEEE, 2024. 6

work page 2024
[15]

The dawn of gui agent: A preliminary case study with claude 3.5 computer use.arXiv preprint arXiv:2411.10323,

Siyuan Hu, Mingyu Ouyang, Difei Gao, and Mike Zheng Shou. The dawn of gui agent: A preliminary case study with claude 3.5 computer use.arXiv preprint arXiv:2411.10323,

work page arXiv
[16]

Bandit based monte- carlo planning

Levente Kocsis and Csaba Szepesv ´ari. Bandit based monte- carlo planning. InEuropean conference on machine learn- ing, pages 282–293, 2006. 4

work page 2006
[17]

A training-free, task-agnostic framework for enhancing mllm performance on high-resolution images

Jaeseong Lee, Yeeun Choi, Heechan Choi, Hanjung Kim, and Seonjoo Kim. A training-free, task-agnostic framework for enhancing mllm performance on high-resolution images. arXiv preprint arXiv:2507.10202, 2025. 1, 2

work page arXiv 2025
[18]

Ground- ing multimodal large language model in gui world

Weixian Lei, Difei Gao, and Mike Zheng Shou. Ground- ing multimodal large language model in gui world. InThe Thirteenth International Conference on Learning Represen- tations. 1

work page
[19]

Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding

Geng Li, Jinglin Xu, Yunzhen Zhao, and Yuxin Peng. Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9098–9108, 2025. 2

work page 2025
[20]

2025 , publisher =

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high- resolution computer use.arXiv preprint arXiv:2504.07981,

work page arXiv
[21]

Showui: One vision-language-action model for gener- alist gui agent

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gener- alist gui agent. InNeurIPS 2024 Workshop on Open-World Agents, 2024. 1, 6

work page 2024
[22]

Omniparser for pure vision based gui agent

Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah. Omniparser for pure vision based gui agent. arXiv preprint arXiv:2408.00203, 2024. 2, 3

work page arXiv 2024
[23]

Gpt-4v(ision) system card.CoRR, 2023

OpenAI. Gpt-4v(ision) system card.CoRR, 2023. 6

work page 2023
[24]

R-vlm: Region-aware vision language model for precise gui grounding.arXiv preprint arXiv:2507.05673, 2025

Joonhyung Park, Peng Tang, Sagnik Das, Srikar Appalaraju, Kunwar Yashraj Singh, R Manmatha, and Shabnam Ghadar. R-vlm: Region-aware vision language model for precise gui grounding.arXiv preprint arXiv:2507.05673, 2025. 1, 2

work page arXiv 2025
[25]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shi- jue Huang, et al. Ui-tars: Pioneering automated gui inter- action with native agents.arXiv preprint arXiv:2501.12326,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Grounded Reinforcement Learning for Visual Reasoning

Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J Tarr, Aviral Kumar, and Katerina Fragki- adaki. Grounded reinforcement learning for visual reason- ing.arXiv preprint arXiv:2505.23678, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Modeling human visual search: A combined bayesian searcher and saliency map approach for eye movement guidance in natural scenes

Melanie Sclar, Gaston Bujia, Sebastian Vita, Guillermo Solovey, and Juan Esteban Kamienkowski. Modeling human visual search: A combined bayesian searcher and saliency map approach for eye movement guidance in natural scenes. InNeurIPS Workshop SVRHM, 2020. 2

work page 2020
[28]

Falcon-ui: Understand- ing gui before following user instructions.arXiv preprint arXiv:2412.09362, 2024

Huawen Shen, Chang Liu, Gengluo Li, Xinlong Wang, Yu Zhou, Can Ma, and Xiangyang Ji. Falcon-ui: Understand- ing gui before following user instructions.arXiv preprint arXiv:2412.09362, 2024. 1

work page arXiv 2024
[29]

Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrit- twieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016. 2

work page 2016
[30]

One embedder, any task: Instruction-finetuned text embeddings.arXiv preprint arXiv:2212.09741, 2022

Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A Smith, Luke Zettlemoyer, and Tao Yu. One embedder, any task: Instruction-finetuned text embeddings.arXiv preprint arXiv:2212.09741, 2022. 2, 3

work page arXiv 2022
[31]

Visual agents as fast and slow thinkers.arXiv preprint arXiv:2408.08862, 2024

Guangyan Sun, Mingyu Jin, Zhenting Wang, Cheng-Long Wang, Siqi Ma, Qifan Wang, Tong Geng, Ying Nian Wu, Yongfeng Zhang, and Dongfang Liu. Visual agents as fast and slow thinkers.arXiv preprint arXiv:2408.08862, 2024. 2

work page arXiv 2024
[32]

Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search.Psychological Review, 2006

Antonio Torralba, Aude Oliva, Monica S Castelhano, and John M Henderson. Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search.Psychological Review, 2006. 2, 4

work page 2006
[33]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception.arXiv preprint arXiv:2401.16158, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Learning active perception via self- evolving preference optimization for gui grounding.arXiv preprint arXiv:2509.04243, 2025

Wanfu Wang, Qipeng Huang, Guangquan Xue, Xiaobo Liang, and Juntao Li. Learning active perception via self- evolving preference optimization for gui grounding.arXiv preprint arXiv:2509.04243, 2025. 1, 2

work page arXiv 2025
[38]

Visual search: How do we find what we are looking for?Annual review of vision science, 2020

Jeremy M Wolfe. Visual search: How do we find what we are looking for?Annual review of vision science, 2020. 2

work page 2020
[39]

Visual search in scenes involves se- lective and nonselective pathways.Trends in Cognitive Sci- ences, 2011

Jeremy M Wolfe, Melissa L-H V ˜o, Karla K Evans, and Michelle R Greene. Visual search in scenes involves se- lective and nonselective pathways.Trends in Cognitive Sci- ences, 2011. 2, 4

work page 2011
[40]

Dimo-gui: Advancing test-time scaling in gui grounding via modality- aware visual reasoning.arXiv preprint arXiv:2507.00008,

Hang Wu, Hongkai Chen, Yujun Cai, Chang Liu, Qing- wen Ye, Ming-Hsuan Yang, and Yiwei Wang. Dimo-gui: Advancing test-time scaling in gui grounding via modality- aware visual reasoning.arXiv preprint arXiv:2507.00008,

work page arXiv
[41]

V*: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024. 2

work page 2024
[42]

Gui-actor: Coordinate-free visual grounding for gui agents.arXiv preprint arXiv:2506.03143,

Qianhui Wu, Kanzhi Cheng, Rui Yang, Chaoyun Zhang, Jianwei Yang, Huiqiang Jiang, Jian Mu, Baolin Peng, Bo Qiao, Reuben Tan, et al. Gui-actor: Coordinate-free visual grounding for gui agents.arXiv preprint arXiv:2506.03143,

work page arXiv
[43]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024. 1, 2, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tian- bao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction.arXiv preprint arXiv:2412.04454, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

McAuley, Zicheng Gao, Lijuan Liu, and Lijuan Wang

An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Lin- jie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, et al. Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv:2311.07562, 2023. 2

work page arXiv 2023
[46]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Aria-ui: Visual grounding for gui instructions

Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, and Junnan Li. Aria-ui: Visual grounding for gui instructions. InFindings of the Association for Compu- tational Linguistics: ACL 2025, pages 22418–22433, 2025. 1, 6

work page 2025
[48]

CAT: enhancing multimodal large language model to answer questions in dynamic audio-visual scenar- ios

Qilang Ye, Zitong Yu, Rui Shao, Xinyu Xie, Philip Torr, and Xiaochun Cao. CAT: enhancing multimodal large language model to answer questions in dynamic audio-visual scenar- ios. InComputer Vision - ECCV 2024 - 18th European Con- ference, Milan, Italy, September 29-October 4, 2024, Pro- ceedings, Part X, pages 146–164, 2024. 2

work page 2024
[49]

Cat+: Investigating and enhancing audio-visual understanding in large language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Qilang Ye, Zitong Yu, Rui Shao, Yawen Cui, Xiangui Kang, Xin Liu, Philip Torr, and Xiaochun Cao. Cat+: Investigating and enhancing audio-visual understanding in large language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 2

work page 2025
[50]

When eyes and ears disagree: Can mllms discern audio-visual confusion? InAAAI Conference on Artificial Intelligence, 2026

Qilang Ye, Wei Zeng, Meng Liu, Jie Zhang, Yupeng Hu, Zi- tong Yu, and Yu Zhou. When eyes and ears disagree: Can mllms discern audio-visual confusion? InAAAI Conference on Artificial Intelligence, 2026. 2

work page 2026
[51]

Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning

Xinbin Yuan, Jian Zhang, Kaixin Li, Zhuoxuan Cai, Lujian Yao, Jie Chen, Enguang Wang, Qibin Hou, Jinwei Chen, Peng-Tao Jiang, et al. Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning.arXiv preprint arXiv:2505.12370, 2025. 1

work page arXiv 2025
[52]

Finding any waldo with zero- shot invariant and efficient visual search.Nature communi- cations, 2018

Mengmi Zhang, Jiashi Feng, Keng Teck Ma, Joo Hwee Lim, Qi Zhao, and Gabriel Kreiman. Finding any waldo with zero- shot invariant and efficient visual search.Nature communi- cations, 2018. 2

work page 2018
[53]

Learn- ing gui grounding with spatial reasoning from visual feed- back.arXiv preprint arXiv:2509.21552, 2025

Yu Zhao, Wei-Ning Chen, Huseyin Atahan Inan, Samuel Kessler, Lu Wang, Lukas Wutschitz, Fangkai Yang, Chaoyun Zhang, Pasquale Minervini, Saravan Rajmohan, et al. Learn- ing gui grounding with spatial reasoning from visual feed- back.arXiv preprint arXiv:2509.21552, 2025. 2

work page arXiv 2025
[54]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Active-o3: Empowering multimodal large language models with active perception via grpo.arXiv preprint arXiv:2505.21457, 2025

Muzhi Zhu, Hao Zhong, Canyu Zhao, Zongze Du, Zheng Huang, Mingyu Liu, Hao Chen, Cheng Zou, Jingdong Chen, Ming Yang, et al. Active-o3: Empowering multimodal large language models with active perception via grpo.arXiv preprint arXiv:2505.21457, 2025. 2

work page arXiv 2025

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Introducing our multimodal models, 2023

Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sa ˘gnak Tas ¸ırlar. Introducing our multimodal models, 2023. 6

work page 2023

[6] [6]

Ground-r1: Incentivizing grounded visual reasoning via reinforcement learning.arXiv preprint arXiv:2505.20272, 2025

Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground-r1: Incentivizing grounded visual reasoning via reinforcement learning.arXiv preprint arXiv:2505.20272, 2025. 2

work page arXiv 2025

[7] [7]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks.arXiv preprint arXiv:2312.14238, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.arXiv preprint arXiv:2404.16821, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Seeclick: Harness- ing GUI grounding for advanced visual GUI agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yan- tao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harness- ing GUI grounding for advanced visual GUI agents. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 9313–

work page 2024

[10] [10]

1, 2, 5, 6

Association for Computational Linguistics, 2024. 1, 2, 5, 6

work page 2024

[11] [11]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samual Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. InAdvances in Neu- ral Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. 2

work page 2023

[12] [12]

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents.arXiv preprint arXiv:2410.05243, 2024. 1, 2, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab- hinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Cogagent: A visual lan- guage model for GUI agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual lan- guage model for GUI agents. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 14281–14290. IEEE, 2024. 6

work page 2024

[15] [15]

The dawn of gui agent: A preliminary case study with claude 3.5 computer use.arXiv preprint arXiv:2411.10323,

Siyuan Hu, Mingyu Ouyang, Difei Gao, and Mike Zheng Shou. The dawn of gui agent: A preliminary case study with claude 3.5 computer use.arXiv preprint arXiv:2411.10323,

work page arXiv

[16] [16]

Bandit based monte- carlo planning

Levente Kocsis and Csaba Szepesv ´ari. Bandit based monte- carlo planning. InEuropean conference on machine learn- ing, pages 282–293, 2006. 4

work page 2006

[17] [17]

A training-free, task-agnostic framework for enhancing mllm performance on high-resolution images

Jaeseong Lee, Yeeun Choi, Heechan Choi, Hanjung Kim, and Seonjoo Kim. A training-free, task-agnostic framework for enhancing mllm performance on high-resolution images. arXiv preprint arXiv:2507.10202, 2025. 1, 2

work page arXiv 2025

[18] [18]

Ground- ing multimodal large language model in gui world

Weixian Lei, Difei Gao, and Mike Zheng Shou. Ground- ing multimodal large language model in gui world. InThe Thirteenth International Conference on Learning Represen- tations. 1

work page

[19] [19]

Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding

Geng Li, Jinglin Xu, Yunzhen Zhao, and Yuxin Peng. Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9098–9108, 2025. 2

work page 2025

[20] [20]

2025 , publisher =

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high- resolution computer use.arXiv preprint arXiv:2504.07981,

work page arXiv

[21] [21]

Showui: One vision-language-action model for gener- alist gui agent

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gener- alist gui agent. InNeurIPS 2024 Workshop on Open-World Agents, 2024. 1, 6

work page 2024

[22] [22]

Omniparser for pure vision based gui agent

Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah. Omniparser for pure vision based gui agent. arXiv preprint arXiv:2408.00203, 2024. 2, 3

work page arXiv 2024

[23] [23]

Gpt-4v(ision) system card.CoRR, 2023

OpenAI. Gpt-4v(ision) system card.CoRR, 2023. 6

work page 2023

[24] [24]

R-vlm: Region-aware vision language model for precise gui grounding.arXiv preprint arXiv:2507.05673, 2025

Joonhyung Park, Peng Tang, Sagnik Das, Srikar Appalaraju, Kunwar Yashraj Singh, R Manmatha, and Shabnam Ghadar. R-vlm: Region-aware vision language model for precise gui grounding.arXiv preprint arXiv:2507.05673, 2025. 1, 2

work page arXiv 2025

[25] [25]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shi- jue Huang, et al. Ui-tars: Pioneering automated gui inter- action with native agents.arXiv preprint arXiv:2501.12326,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Grounded Reinforcement Learning for Visual Reasoning

Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J Tarr, Aviral Kumar, and Katerina Fragki- adaki. Grounded reinforcement learning for visual reason- ing.arXiv preprint arXiv:2505.23678, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Modeling human visual search: A combined bayesian searcher and saliency map approach for eye movement guidance in natural scenes

Melanie Sclar, Gaston Bujia, Sebastian Vita, Guillermo Solovey, and Juan Esteban Kamienkowski. Modeling human visual search: A combined bayesian searcher and saliency map approach for eye movement guidance in natural scenes. InNeurIPS Workshop SVRHM, 2020. 2

work page 2020

[28] [28]

Falcon-ui: Understand- ing gui before following user instructions.arXiv preprint arXiv:2412.09362, 2024

Huawen Shen, Chang Liu, Gengluo Li, Xinlong Wang, Yu Zhou, Can Ma, and Xiangyang Ji. Falcon-ui: Understand- ing gui before following user instructions.arXiv preprint arXiv:2412.09362, 2024. 1

work page arXiv 2024

[29] [29]

Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrit- twieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016. 2

work page 2016

[30] [30]

One embedder, any task: Instruction-finetuned text embeddings.arXiv preprint arXiv:2212.09741, 2022

Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A Smith, Luke Zettlemoyer, and Tao Yu. One embedder, any task: Instruction-finetuned text embeddings.arXiv preprint arXiv:2212.09741, 2022. 2, 3

work page arXiv 2022

[31] [31]

Visual agents as fast and slow thinkers.arXiv preprint arXiv:2408.08862, 2024

Guangyan Sun, Mingyu Jin, Zhenting Wang, Cheng-Long Wang, Siqi Ma, Qifan Wang, Tong Geng, Ying Nian Wu, Yongfeng Zhang, and Dongfang Liu. Visual agents as fast and slow thinkers.arXiv preprint arXiv:2408.08862, 2024. 2

work page arXiv 2024

[32] [32]

Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search.Psychological Review, 2006

Antonio Torralba, Aude Oliva, Monica S Castelhano, and John M Henderson. Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search.Psychological Review, 2006. 2, 4

work page 2006

[33] [33]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception.arXiv preprint arXiv:2401.16158, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [36]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [37]

Learning active perception via self- evolving preference optimization for gui grounding.arXiv preprint arXiv:2509.04243, 2025

Wanfu Wang, Qipeng Huang, Guangquan Xue, Xiaobo Liang, and Juntao Li. Learning active perception via self- evolving preference optimization for gui grounding.arXiv preprint arXiv:2509.04243, 2025. 1, 2

work page arXiv 2025

[37] [38]

Visual search: How do we find what we are looking for?Annual review of vision science, 2020

Jeremy M Wolfe. Visual search: How do we find what we are looking for?Annual review of vision science, 2020. 2

work page 2020

[38] [39]

Visual search in scenes involves se- lective and nonselective pathways.Trends in Cognitive Sci- ences, 2011

Jeremy M Wolfe, Melissa L-H V ˜o, Karla K Evans, and Michelle R Greene. Visual search in scenes involves se- lective and nonselective pathways.Trends in Cognitive Sci- ences, 2011. 2, 4

work page 2011

[39] [40]

Dimo-gui: Advancing test-time scaling in gui grounding via modality- aware visual reasoning.arXiv preprint arXiv:2507.00008,

Hang Wu, Hongkai Chen, Yujun Cai, Chang Liu, Qing- wen Ye, Ming-Hsuan Yang, and Yiwei Wang. Dimo-gui: Advancing test-time scaling in gui grounding via modality- aware visual reasoning.arXiv preprint arXiv:2507.00008,

work page arXiv

[40] [41]

V*: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024. 2

work page 2024

[41] [42]

Gui-actor: Coordinate-free visual grounding for gui agents.arXiv preprint arXiv:2506.03143,

Qianhui Wu, Kanzhi Cheng, Rui Yang, Chaoyun Zhang, Jianwei Yang, Huiqiang Jiang, Jian Mu, Baolin Peng, Bo Qiao, Reuben Tan, et al. Gui-actor: Coordinate-free visual grounding for gui agents.arXiv preprint arXiv:2506.03143,

work page arXiv

[42] [43]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024. 1, 2, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [44]

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tian- bao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction.arXiv preprint arXiv:2412.04454, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [45]

McAuley, Zicheng Gao, Lijuan Liu, and Lijuan Wang

An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Lin- jie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, et al. Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv:2311.07562, 2023. 2

work page arXiv 2023

[45] [46]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [47]

Aria-ui: Visual grounding for gui instructions

Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, and Junnan Li. Aria-ui: Visual grounding for gui instructions. InFindings of the Association for Compu- tational Linguistics: ACL 2025, pages 22418–22433, 2025. 1, 6

work page 2025

[47] [48]

CAT: enhancing multimodal large language model to answer questions in dynamic audio-visual scenar- ios

Qilang Ye, Zitong Yu, Rui Shao, Xinyu Xie, Philip Torr, and Xiaochun Cao. CAT: enhancing multimodal large language model to answer questions in dynamic audio-visual scenar- ios. InComputer Vision - ECCV 2024 - 18th European Con- ference, Milan, Italy, September 29-October 4, 2024, Pro- ceedings, Part X, pages 146–164, 2024. 2

work page 2024

[48] [49]

Cat+: Investigating and enhancing audio-visual understanding in large language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Qilang Ye, Zitong Yu, Rui Shao, Yawen Cui, Xiangui Kang, Xin Liu, Philip Torr, and Xiaochun Cao. Cat+: Investigating and enhancing audio-visual understanding in large language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 2

work page 2025

[49] [50]

When eyes and ears disagree: Can mllms discern audio-visual confusion? InAAAI Conference on Artificial Intelligence, 2026

Qilang Ye, Wei Zeng, Meng Liu, Jie Zhang, Yupeng Hu, Zi- tong Yu, and Yu Zhou. When eyes and ears disagree: Can mllms discern audio-visual confusion? InAAAI Conference on Artificial Intelligence, 2026. 2

work page 2026

[50] [51]

Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning

Xinbin Yuan, Jian Zhang, Kaixin Li, Zhuoxuan Cai, Lujian Yao, Jie Chen, Enguang Wang, Qibin Hou, Jinwei Chen, Peng-Tao Jiang, et al. Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning.arXiv preprint arXiv:2505.12370, 2025. 1

work page arXiv 2025

[51] [52]

Finding any waldo with zero- shot invariant and efficient visual search.Nature communi- cations, 2018

Mengmi Zhang, Jiashi Feng, Keng Teck Ma, Joo Hwee Lim, Qi Zhao, and Gabriel Kreiman. Finding any waldo with zero- shot invariant and efficient visual search.Nature communi- cations, 2018. 2

work page 2018

[52] [53]

Learn- ing gui grounding with spatial reasoning from visual feed- back.arXiv preprint arXiv:2509.21552, 2025

Yu Zhao, Wei-Ning Chen, Huseyin Atahan Inan, Samuel Kessler, Lu Wang, Lukas Wutschitz, Fangkai Yang, Chaoyun Zhang, Pasquale Minervini, Saravan Rajmohan, et al. Learn- ing gui grounding with spatial reasoning from visual feed- back.arXiv preprint arXiv:2509.21552, 2025. 2

work page arXiv 2025

[53] [54]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [55]

Active-o3: Empowering multimodal large language models with active perception via grpo.arXiv preprint arXiv:2505.21457, 2025

Muzhi Zhu, Hao Zhong, Canyu Zhao, Zongze Du, Zheng Huang, Mingyu Liu, Hao Chen, Cheng Zou, Jingdong Chen, Ming Yang, et al. Active-o3: Empowering multimodal large language models with active perception via grpo.arXiv preprint arXiv:2505.21457, 2025. 2

work page arXiv 2025