arxiv: 2602.14276 · v2 · submitted 2026-02-15 · 💻 cs.CV

Recognition: no theorem link

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

A. Said Gurbuz , Sunghwan Hong , Ahmed Nassar , Marc Pollefeys , Peter Staar

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords screen parsingdense UI annotationsvision-language modelscomputer-use agentsUI groundingweb screenshotsstructured markup

0 comments

The pith

Dense annotations of every UI element in 771K screenshots let a 316M VLM beat larger models on screen parsing and improve grounding after finetuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most existing datasets for computer-use agents label only a sparse subset of elements per screen, which restricts coverage and generalization. ScreenParse supplies complete annotations for all visible UI elements, their boxes, 55-class types, and text across 771K web screenshots through an automated rendering and VLM-relabeling pipeline. A compact 316M-parameter model trained on this data decodes a structured ScreenTag representation with a loss that emphasizes structural tokens, achieving substantially higher PageIoU than much larger foundation VLMs while transferring to public benchmarks. Finetuning foundation VLMs on the new dataset consistently raises their grounding performance, indicating that full structural supervision supplies useful priors. This setup targets low-latency, on-device perception for agents that must act reliably on what they see.

Core claim

ScreenParse supplies complete, dense supervision of all visible UI elements including boxes, 55-class types and text in 771K screenshots. Training ScreenVLM on it with a compact ScreenTag representation and structure-aware loss yields 0.592 PageIoU, substantially above the 0.294 of larger foundation VLMs, with strong transfer and consistent gains when used for finetuning.

What carries the argument

The ScreenParse dataset of dense UI annotations generated by the Webshot pipeline, together with the ScreenVLM model that decodes a compact ScreenTag markup representation under a structure-aware loss.

If this is right

ScreenVLM reaches higher dense parsing accuracy than much larger foundation VLMs on the ScreenParse benchmark.
Finetuning foundation VLMs on ScreenParse data consistently improves their performance on grounding tasks.
The trained model transfers effectively to existing public UI benchmarks.
Compact structured decoding supports low-latency on-device deployment for computer-use agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the pipeline generalizes, similar dense supervision could be extended to mobile or desktop interfaces beyond web pages.
Explicit structure-aware losses may reduce the parameter count needed for reliable agent perception.
Complete element coverage could support more robust multi-step instruction following in complex screens.

Load-bearing premise

The Webshot pipeline's VLM-based relabeling and quality filtering produces accurate, unbiased, and complete annotations for all visible elements across diverse web screenshots without systematic errors or coverage gaps.

What would settle it

A manual audit of randomly sampled screens from ScreenParse that reveals many visible UI elements missing annotations, assigned wrong classes, or given incorrect text would falsify the claim of reliable complete supervision.

Figures

Figures reproduced from arXiv: 2602.14276 by Ahmed Nassar, A. Said Gurbuz, Marc Pollefeys, Peter Staar, Sunghwan Hong.

**Figure 2.** Figure 2: Qualitative example from ScreenParse illustrating dense, complete UI annotations visualized as labeled bounding boxes. from the public 45 Million Websites dataset 1 . This dataset aggregates URLs from multiple sources, including Common Crawl, Alexa Top Sites, and public domain lists. We then curate a balanced subset of URLs spanning various categories (e.g., e-commerce, news, social media, blogs) to ensur… view at source ↗

**Figure 3.** Figure 3: Overview of the Webshot dataset generation pipeline. Our scalable framework renders diverse URLs with Playwright and extracts DOM-driven dense annotations. VLMs further refine UI element types and filter low-quality samples. preserve the DOM hierarchy: in addition to leaf nodes, we annotate enclosing container elements that carry semantic structure e.g., navigation bars, cards, and modals. See Appendix 7.… view at source ↗

**Figure 4.** Figure 4: Overview of the ScreenVLM architecture. A screenshot is encoded by the SigLIP-2 vision encoder (Tschannen et al., 2025) into patch tokens, which are projected and fed to the Granite-165M LLM (Mishra et al., 2024) decoder together with text tokens to generate the ScreenTag sequence. 4.2. ScreenTag: Compact Screen Structure Representation To train an autoregressive model for dense parsing, we serialize the … view at source ↗

**Figure 5.** Figure 5: Training/Validation loss and accuracy curves for the YOLO component. 7.3. Evaluation Metrics We use the indicator function 1[·], defined as 1[s] = 1 if statement s is true and 0 otherwise. Let G be the set of ground-truth boxes and P the set of predicted boxes for an image with pixel domain Ω. PageIoU. We define occupancy masks over pixels: MG(p) = 1[∃ g ∈ G s.t. p ∈ g] , (2) MP (p) = 1[∃ b ∈ P s.t. p ∈ b]… view at source ↗

**Figure 6.** Figure 6: Qualitative screen parsing predictions for VLMs. Each row shows the same screenshot across columns; bounding boxes and labels are rendered as overlays. As it can be seen, in terms of recall, localization and granularity of the predictions, our ScreenVLM model outperforms the Qwen3-VL-8B-Instruct model significantly. Some of the ground truth annotations contain errors due to the rendering or DOM extraction … view at source ↗

**Figure 7.** Figure 7: Qualitative screen parsing predictions for detector/parser baselines on GroundCUA dataset. Each row shows the same screenshot across columns. Our YOLO model has much less false negatives compared to OmniParser v2, and it covers text areas that may be important for understanding the UI. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Out-of-distribution qualitative results of our YOLO model on the ScreenSpot Mobile split. Each visualization shows ground truth (left) and the model prediction (right). The ground truth visualization is not complete since ScreenSpot provides sparse annotations. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Out-of-distribution qualitative results of our YOLO model on the GroundCUA dataset. Each visualization shows ground truth (left) and the model prediction (right). 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Additional qualitative results on ScreenSpot (PC). We compare OmniParser v2 against OmniParser v2 fine-tuned on ScreenParse. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Additional qualitative result on ScreenSpot (Mobile): OmniParser v2 vs. OmniParser v2 fine-tuned on ScreenParse. Ground Truth InternVL3-2B InternVL3-2B + ScreenParse [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Additional qualitative results on ScreenParse: InternVL3-2B before and after fine-tuning on ScreenParse. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Additional qualitative result on ScreenSpot (Web): prompted Qwen3-VL-8B vs. Qwen3-VL-2B fine-tuned on ScreenParse. Ground Truth OmniParser v2 OmniParser v2 + ScreenParse [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Additional qualitative result on ScreenParse: OmniParser v2 before and after fine-tuning on ScreenParse [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Out-of-distribution qualitative result of our YOLO detector on a complex desktop multi-window screen. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

read the original abstract

Modern computer-use agents (CUA) must perceive a screen as a structured state, what elements are visible, where they are, and what text they contain, before they can reliably ground instructions and act. Yet, most available grounding datasets provide sparse supervision, with insufficient and low-diversity labels that annotate only a small subset of task-relevant elements per screen, which limits both coverage and generalization; moreover, practical deployment requires efficiency to enable low-latency, on-device use. We introduce ScreenParse, a large-scale dataset for complete screen parsing, with dense annotations of all visible UI elements (boxes, 55-class types, and text) across 771K web screenshots (21M elements). ScreenParse is generated by Webshot, an automated, scalable pipeline that renders diverse urls, extracts annotations and applies VLM-based relabeling and quality filtering. Using ScreenParse, we train ScreenVLM, a compact, 316M-parameter vision language model (VLM) that decodes a compact ScreenTag markup representation with a structure-aware loss that upweights structure-critical tokens. ScreenVLM substantially outperforms much larger foundation VLMs on dense parsing (e.g., 0.592 vs. 0.294 PageIoU on ScreenParse) and shows strong transfer to public benchmarks. Moreover, finetuning foundation VLMs on ScreenParse consistently improves their grounding performance, suggesting that dense screen supervision provides transferable structural priors for UI understanding. Project page: https://saidgurbuz.github.io/screenparse/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ScreenParse ships a genuinely new dense UI dataset at scale and a compact model that beats bigger VLMs on it, but the VLM-relabeling pipeline has no human checks so the gains could be partly circular.

read the letter

The main thing to know is that this paper gives us the first large-scale dataset with complete, dense labels for every visible UI element across 771K web screenshots, plus a 316M model called ScreenVLM that outputs a compact ScreenTag markup and reports clear wins over much larger foundation VLMs on their test set. They also show that finetuning other VLMs on this data improves grounding on public benchmarks. That shift from sparse task-specific labels to full parsing supervision is the real contribution and could matter for building reliable computer-use agents. The structure-aware loss and the decision to keep the model small are practical choices that make sense for on-device use. What they did well is demonstrate that dense supervision transfers and that you can generate this kind of data at scale with an automated pipeline. The soft spot is exactly what the stress test flags: the Webshot pipeline relies on VLM-based relabeling and filtering with no reported human validation, inter-annotator numbers, or error analysis. Because train and test data come from the same process, any consistent labeling artifacts get rewarded at evaluation time, which makes the 0.592 vs 0.294 PageIoU gap harder to trust without more checks. The abstract also skips details on how PageIoU is defined and how baselines were implemented. This is for people working on UI perception and agent grounding who need better training signals. A reader who wants new data or ideas for structured output would get something useful from it. It deserves a serious referee because the scale and the empirical transfer results are substantive enough to check, even if the labeling quality needs direct scrutiny.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ScreenParse, a large-scale dataset of 771K web screenshots with dense annotations of all visible UI elements (bounding boxes, 55-class types, and text) generated via the automated Webshot pipeline of URL rendering followed by VLM-based relabeling and quality filtering. It trains ScreenVLM, a compact 316M-parameter VLM that decodes a ScreenTag markup representation under a structure-aware loss, and reports that this model substantially outperforms much larger foundation VLMs on dense parsing (e.g., 0.592 vs. 0.294 PageIoU on ScreenParse), transfers to public benchmarks, and that finetuning foundation VLMs on ScreenParse improves their grounding performance.

Significance. If the automated labels are shown to be accurate and unbiased, the work supplies a valuable source of complete, dense supervision for UI parsing that sparse grounding datasets lack, with potential to improve computer-use agents through transferable structural priors. The empirical scale (21M elements) and the reported transfer/finetuning results would constitute a concrete advance in data-driven screen understanding.

major comments (2)

[Dataset construction (Webshot pipeline) and Abstract] The headline PageIoU gains (0.592 vs. 0.294 on ScreenParse) and all transfer/finetuning claims rest on the quality of the ScreenParse labels. The Webshot pipeline description indicates that annotations are produced by VLM relabeling and filtering with no reported human-annotated validation subset, inter-annotator agreement, or systematic error analysis; because train and test splits derive from the identical pipeline, any consistent labeling artifacts (e.g., missed occluded elements, type misclassifications, or OCR drift) would be learned and rewarded at evaluation time.
[Evaluation section and Abstract] No definition or implementation details are supplied for the primary metric PageIoU, nor for the baseline foundation VLM setups (model sizes, prompting, decoding strategies). Without these, the numerical comparisons cannot be reproduced or stress-tested for robustness.

minor comments (2)

[Model and training description] The abstract states that the structure-aware loss 'upweights structure-critical tokens' but does not specify the token weighting scheme, how the weights were chosen, or ablation results showing their contribution.
[Transfer experiments] The public benchmarks used for transfer are not enumerated, nor are the exact metrics and baseline numbers reported for those benchmarks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the need to validate automated labels and ensure evaluation reproducibility. We address both major comments point-by-point below and will incorporate the requested clarifications and analyses into the revised manuscript.

read point-by-point responses

Referee: The headline PageIoU gains (0.592 vs. 0.294 on ScreenParse) and all transfer/finetuning claims rest on the quality of the ScreenParse labels. The Webshot pipeline description indicates that annotations are produced by VLM relabeling and filtering with no reported human-annotated validation subset, inter-annotator agreement, or systematic error analysis; because train and test splits derive from the identical pipeline, any consistent labeling artifacts (e.g., missed occluded elements, type misclassifications, or OCR drift) would be learned and rewarded at evaluation time.

Authors: We agree that explicit validation of the automated labels is essential to support the reported gains. The Webshot pipeline uses VLM-based relabeling and filtering on a large scale, but the submitted manuscript does not include a human-annotated validation subset, inter-annotator agreement, or detailed error analysis. In the revision we will add a human study on a random sample of 500 test screens, reporting element-wise agreement for bounding boxes (IoU), type classification accuracy, and text OCR fidelity, together with inter-annotator agreement and a categorized error analysis (occlusions, misclassifications, OCR drift). We will also discuss how the observed transfer to human-annotated public benchmarks provides evidence against overfitting to pipeline-specific artifacts. revision: yes
Referee: No definition or implementation details are supplied for the primary metric PageIoU, nor for the baseline foundation VLM setups (model sizes, prompting, decoding strategies). Without these, the numerical comparisons cannot be reproduced or stress-tested for robustness.

Authors: We apologize for the omission. PageIoU is defined as the average per-element IoU computed over the complete set of parsed UI elements on each screenshot (i.e., page-level aggregation of bounding-box overlap, type, and text matches). We will insert the exact mathematical definition, pseudocode, and all baseline implementation details—including model sizes, full prompting templates, and decoding hyperparameters (temperature, max tokens, beam size)—into the evaluation section and supplementary material of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset generation and external-benchmark evaluation

full rationale

The paper is entirely empirical and data-driven. It describes an automated Webshot pipeline to produce ScreenParse (rendering + VLM relabeling + filtering), trains ScreenVLM on that data, and reports measured PageIoU and transfer numbers against independent foundation VLMs on both ScreenParse and public benchmarks. No equations, fitted parameters, or self-citations appear in the provided text that reduce any claimed improvement to a quantity defined by the authors' own choices. The derivation chain therefore contains no self-definitional, fitted-input, or self-citation reductions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

Central claims depend on the accuracy of automated VLM relabeling for dense ground truth and on the assumption that structure-aware loss provides transferable priors. No free parameters are explicitly fitted in the abstract, but the loss weighting scheme and class taxonomy are introduced without external validation.

free parameters (1)

structure-aware loss token weights
Weights that upweight structure-critical tokens; chosen to emphasize parsing structure but not shown to be derived from first principles.

axioms (1)

domain assumption VLM-based relabeling and filtering yields high-quality dense annotations without systematic bias
Invoked in the description of the Webshot pipeline as the mechanism that produces the 21M element annotations.

invented entities (2)

ScreenTag markup representation no independent evidence
purpose: Compact structured output format decoded by the model
New output representation introduced for the ScreenVLM decoder.
ScreenVLM no independent evidence
purpose: Compact VLM specialized for dense screen parsing
New 316M-parameter model trained on ScreenParse.

pith-pipeline@v0.9.0 · 5587 in / 1486 out tokens · 54289 ms · 2026-05-15T21:31:45.165464+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Entropy-gradient grounding uses model uncertainty to retrieve evidence regions in VLMs, improving performance on detail-critical and compositional tasks across multiple architectures.
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
cs.CL 2026-05 unverdicted novelty 4.0

The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 2 Pith papers · 4 internal anchors

[1]

Qwen3-VL Technical Report

doi: 10.24963/ijcai.2021/235. URL https: //doi.org/10.24963/ijcai.2021/235. Main Track. Bai, S., Cai, Y ., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y ., Liu, D., Liu, S...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.24963/ijcai.2021/235 2021
[2]

ISBN 979-8-89176-256-5

Association for Computational Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/v1/2025.findings-acl

work page doi:10.18653/v1/2025.findings-acl 2025
[3]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

URL https://aclanthology.org/2025. findings-acl.110/. Cheng, K., Sun, Q., Chu, Y ., Xu, F., YanTao, L., Zhang, J., and Wu, Z. SeeClick: Harnessing GUI grounding for ad- vanced visual GUI agents. In Ku, L.-W., Martins, A., and Srikumar, V . (eds.),Proceedings of the 62nd Annual Meet- ing of the Association for Computational Linguistics (Vol- ume 1: Long Pa...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.505 2025
[4]

YOLOv11: An Overview of the Key Architectural Enhancements

URL https://aclanthology.org/2024. acl-long.371/. Khanam, R. and Hussain, M. Yolov11: An overview of the key architectural enhancements, 2024. URL https: //arxiv.org/abs/2410.17725. Kim, C., Shin, H., Hong, E., Yoon, H., Arnab, A., Seo, P. H., Hong, S., and Kim, S. Seg4diff: Unveiling open- vocabulary segmentation in text-to-image diffusion trans- formers...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.50 2024
[5]

org/CorpusID:271947166

URL https://api.semanticscholar. org/CorpusID:271947166. Li, K., ziyang, M., Lin, H., Luo, Z., Tian, Y ., Ma, J., Huang, Z., and Chua, T.-S. Screenspot-pro: GUI grounding for professional high-resolution computer use. InWorkshop on Reasoning and Planning for Large Language Models,

work page
[6]

Li, Y ., Li, G., He, L., Zheng, J., Li, H., and Guan, Z

URL https://openreview.net/forum? id=XaKNDIAHas. Li, Y ., Li, G., He, L., Zheng, J., Li, H., and Guan, Z. Wid- get captioning: Generating natural language description for mobile user interface elements. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5495–5510, 2020. Lu, Y ., Yang, J., Shen, Y ., and A...

work page doi:10.48550/arxiv.2405.04324 2020
[7]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

URL https://openreview.net/forum? id=oKn9c6ytLx. Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y ., Su, W., Shao, J., et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 11 ScreenParse Table 8.ScreenTag screen parsing classes (55 total) used in...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

visible UI elements

Appendix 7.1. Screen Parsing Label Set (ScreenTag) Tab. 8 lists the 55 semantic classes used for screen parsing in our ScreenTag annotation schema. 7.2. Training Details Qwen3-VL-2B-Instruct Finetuning.We fine-tune Qwen3-VL-2B-Instruct on ScreenParse with BF16 and DeepSpeed ZeRO-3 offload, updating only the multimodal LLM (vision tower and projector froze...

work page arXiv 2000
[9]

- Large and important items that should be covered: - main hero images, large central text, clearly clickable buttons or tabs, prominent fields

COVERAGE / MISSING ELEMENTS (0-100) - Look for visually obvious, distinct UI elements that **should** be annotated: - buttons, main text blocks, headings, input fields, icons, major images, cards, menu items, etc. - Large and important items that should be covered: - main hero images, large central text, clearly clickable buttons or tabs, prominent fields...

work page
[10]

container + child

FALSE POSITIVES / SPURIOUS BOXES (0-100) - Penalize boxes that are not aligned with any visible UI element, such as: - Boxes in completely blank areas. - Boxes that repeat the same position but shifted somewhere else on the screen where nothing exists. - Boxes over pure background images or whitespace where there is no clear object or control. - Do NOT tr...

work page
[11]

Text" box entirely inside another

DUPLICATION / REDUNDANCY (0-100) Focus especially on SAME-CLASS duplications: - For NON-NESTABLE classes (for example: Text, Heading, Button, Checkbox, Radiobox, Switch, Slider, Text Input, Search Field, Image, Logo, Icon, etc.): - Two boxes of the same class that heavily overlap OR where one box is completely inside another usually indicate a problem. - ...

work page
[12]

Text" labels stacked over the same text string, or multiple overlapping

LOCALIZATION / ALIGNMENT (0-100) - Evaluate how well each bounding box fits its intended UI element. - Good annotation: - The box tightly covers the element, with small margins. - It does not cut off major parts of the element. - Penalize: 24 ScreenParse - Boxes that are much larger than the element and include large amounts of unrelated background. - Box...

work page