arxiv: 2604.19081 · v1 · submitted 2026-04-21 · 💻 cs.SE

Recognition: unknown

Proactive Detection of GUI Defects in Multi-Window Scenarios via Multimodal Reasoning

Xinyao Zhang , Rui Wang , Jinhao Cui , Haotian Huang , Wei Xue , Wenhua Hu , Jianwen Xiang , Rui Hao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:58 UTC · model grok-4.3

classification 💻 cs.SE

keywords GUI defect detectionmulti-window scenariosmultimodal reasoningproactive testingAndroid applicationswidget occlusionchain-of-thought promptingSet-of-Mark

0 comments

The pith

A proactive framework using multimodal large language models detects GUI display defects in multi-window mobile scenarios more effectively than passive methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that GUI defects become much more likely when apps run in split-screen or on foldable devices, and that current detection tools miss many of them because they only look at finished screenshots. The authors build an end-to-end system that forces apps into these multi-window states during exploration, marks the widgets on the screenshots, and then asks multimodal models to reason step by step about possible display problems. This matters because phones and tablets increasingly support multitasking, so testing must keep up with how people actually use them. The method produces both detections at the app level and explanations at the widget level, outperforming earlier approaches on a new collection of real Android applications.

Core claim

The authors claim that their framework, which proactively triggers multi-window states such as split-screen and foldable modes, aligns screenshots to widgets using Set-of-Mark, and applies chain-of-thought prompting to multimodal large language models, can detect, localize, and explain GUI display defects. They support this with a benchmark built from 50 real-world Android apps, showing that multi-window conditions increase defect exposure and that the approach yields better results than OwlEye and YOLO-based baselines.

What carries the argument

The combination of proactive state triggering during app exploration, Set-of-Mark visual marking for widget alignment, and chain-of-thought reasoning in multimodal large language models, which together enable interpretation of complex multi-window interfaces.

If this is right

Multi-window settings make layout defects such as text truncation far more common than in full-screen operation.
The method can identify a large number of defect-prone applications with low rates of false positives and false negatives.
It achieves higher accuracy than existing tools for both app-level and fine-grained widget-level defect detection.
The new benchmark of 50 Android apps enables systematic study of defects in dynamic multi-window environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Testing teams could run this kind of proactive exploration as part of continuous integration to catch issues early in development.
The reliance on current multimodal models suggests that improvements in those models would directly improve defect detection reliability.
Similar techniques might help detect related problems like inconsistent behavior across different device orientations or screen sizes.
App designers may need to prioritize layout flexibility even more when targeting modern multitasking features.

Load-bearing premise

The multimodal large language models will produce accurate detections and explanations from the marked screenshots without frequent hallucinations or consistent blind spots.

What would settle it

A controlled test where the framework is applied to a collection of apps known to have or lack specific multi-window defects, then measuring whether its detections match the known ground truth at both app and widget levels.

Figures

Figures reproduced from arXiv: 2604.19081 by Haotian Huang, Jianwen Xiang, Jinhao Cui, Rui Hao, Rui Wang, Wei Xue, Wenhua Hu, Xinyao Zhang.

**Figure 2.** Figure 2: Framework overview of our approach. Enhanced DroidBot collects screenshots and view hierarchies across multiple window states, and the resulting [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Application-level FPR and FNR under conventional and split/fold [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Representative examples of the major GUI adaptation defect categories [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Examples of normal and mismatched GUI layouts under foldable and [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Workflow of the enhanced DroidBot framework for triggering split [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Example of SoM construction. The left image shows the original [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: A worked example of the multimodal chain-of-thought prompting [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

read the original abstract

Multi-window mobile scenarios, such as split-screen and foldable modes, make GUI display defects more likely by forcing applications to adapt to changing window sizes and dynamic layout reflow. Existing detection techniques are limited in two ways: they are largely passive, analyzing screenshots only after problematic states have been reached, and they are mainly designed for conventional full-screen interfaces, making them less effective in multi-window settings.We propose an end-to-end framework for GUI display defect detection in multi-window mobile scenarios. The framework proactively triggers split-screen, foldable, and window-transition states during app exploration, uses Set-of-Mark (SoM) to align screenshots with widget-level interface elements, and leverages multimodal large language models with chain-of-thought prompting to detect, localize, and explain display defects. We also construct a benchmark of GUI display defects using 50 real-world Android applications.Experimental results show that multi-window settings substantially increase the exposure of layout-related defects, with text truncation increasing by 184% compared with conventional full-screen settings. At the application level, our method detects 40 defect-prone apps with a false positive rate of 10.00% and a false negative rate of 11.11%, outperforming OwlEye and YOLO-based baselines. At the fine-grained level, it achieves the best F1 score of 87.2% for widget occlusion detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a proactive MLLM pipeline for catching GUI defects in multi-window mobile apps, with a useful benchmark idea but thin details on validating the model's outputs.

read the letter

The paper's key contribution is an end-to-end system that proactively triggers multi-window states like split-screen and foldables during app testing, aligns interface elements using Set-of-Mark, and applies multimodal LLMs with chain-of-thought to detect, locate, and explain GUI defects such as occlusion and truncation. This setup is new for the multi-window context, building on but going beyond passive single-screen methods. The work does well in demonstrating the increased defect risk in these scenarios, with the 184% rise in text truncation providing a clear data point. Constructing a benchmark from 50 real Android applications is a positive step, and the claimed improvements over OwlEye and YOLO baselines at both app and widget levels show the framework can be effective in practice. The proactive exploration angle makes sense for catching issues that only appear under dynamic layouts. The main soft spot is the validation of the MLLM component. The results depend on the model correctly identifying defects without hallucinations, yet the available description does not detail any human review, agreement metrics, or filtering for the model's outputs. This makes the F1 score of 87.2% and the low false rates harder to fully accept at face value. The experimental protocol details are also missing from the high-level summary, which limits how much we can assess the soundness. This paper is for researchers and practitioners in mobile software testing and automated GUI analysis. Readers interested in applying large models to software engineering tasks would get the most out of it. It is worth a serious referee because it tackles a timely issue with a concrete pipeline and some empirical results, even if more rigor on evaluation is needed. I would recommend sending it for peer review with specific requests for expanded validation sections and baseline implementation details.

Referee Report

3 major / 2 minor

Summary. The paper presents an end-to-end proactive framework for detecting GUI display defects (e.g., occlusion, truncation) in multi-window mobile scenarios. It triggers split-screen/foldable/window-transition states during exploration, aligns screenshots via Set-of-Mark (SoM) with widget elements, and uses multimodal LLMs with chain-of-thought prompting to detect, localize, and explain defects. A benchmark is built from 50 real Android apps; results claim multi-window settings increase defects (text truncation +184%), the method flags 40 defect-prone apps (FPR 10%, FNR 11.11%), and achieves 87.2% F1 on fine-grained widget occlusion, outperforming OwlEye and YOLO baselines.

Significance. If validated, the work could meaningfully advance automated GUI testing by shifting from passive post-facto analysis to proactive multi-window exploration and leveraging MLLM reasoning for both detection and explanation. The combination of state triggering, SoM alignment, and CoT prompting addresses a timely gap as foldables and split-screen become common; the reported 184% increase in truncation exposure and strong fine-grained F1 provide concrete evidence of the problem's severity and the method's potential utility.

major comments (3)

[Evaluation / Experimental Results] Evaluation section (and abstract): The central performance claims (40/50 apps flagged, FPR=10%, FNR=11.11%, F1=87.2% on occlusion) are presented without any description of the ground-truth annotation protocol, number of human annotators, inter-annotator agreement, or explicit hallucination-mitigation steps for the MLLM outputs. Because the benchmark itself is constructed via the same MLLM+CoT pipeline, this omission creates a circularity risk that uncaught systematic errors could inflate all reported metrics.
[Experimental Results] §4 (or equivalent baseline comparison subsection): The outperformance over OwlEye and YOLO-based baselines is stated quantitatively but without implementation details, version numbers, hyperparameter settings, or whether the baselines were adapted for multi-window inputs. This prevents assessment of whether the gains are due to the proposed framework or to differences in experimental setup.
[Benchmark Construction] Benchmark construction paragraph: The paper states that 50 apps were used to build the defect benchmark, yet provides no information on how apps were selected, how many screenshots per app were collected, or whether any independent human review was performed to confirm the MLLM-generated labels before computing FPR/FNR/F1. This detail is load-bearing for the reliability of all quantitative results.

minor comments (2)

[Abstract / Results] The abstract and results text use “multi-window settings substantially increase the exposure of layout-related defects” without citing the exact table or figure that quantifies the 184% truncation increase or providing confidence intervals.
[Method / Prompt Design] Notation for defect categories (occlusion, truncation, etc.) is introduced without a clear taxonomy or example images in the main text; readers must infer definitions from the MLLM prompt examples.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We appreciate the identification of areas where additional transparency is needed in our evaluation, baselines, and benchmark. We will revise the manuscript to incorporate the requested details while preserving the core contributions.

read point-by-point responses

Referee: [Evaluation / Experimental Results] Evaluation section (and abstract): The central performance claims (40/50 apps flagged, FPR=10%, FNR=11.11%, F1=87.2% on occlusion) are presented without any description of the ground-truth annotation protocol, number of human annotators, inter-annotator agreement, or explicit hallucination-mitigation steps for the MLLM outputs. Because the benchmark itself is constructed via the same MLLM+CoT pipeline, this omission creates a circularity risk that uncaught systematic errors could inflate all reported metrics.

Authors: We agree that the manuscript lacks explicit details on the ground-truth protocol, creating a valid concern about circularity. In the revision we will add a dedicated paragraph in the evaluation section describing the full annotation process. This will specify the human annotation protocol, number of annotators, inter-annotator agreement, and hallucination-mitigation steps (including cross-verification against UI hierarchies and consensus requirements). We will also clarify that benchmark labels received independent human validation separate from the automated detection pipeline, thereby addressing the circularity risk. revision: yes
Referee: [Experimental Results] §4 (or equivalent baseline comparison subsection): The outperformance over OwlEye and YOLO-based baselines is stated quantitatively but without implementation details, version numbers, hyperparameter settings, or whether the baselines were adapted for multi-window inputs. This prevents assessment of whether the gains are due to the proposed framework or to differences in experimental setup.

Authors: The referee is correct that implementation details for the baselines are absent. We will expand the baseline comparison subsection to include the exact versions of OwlEye and YOLO employed, all hyperparameter settings, and the adaptations made for multi-window inputs (such as per-window processing or screenshot concatenation). This addition will enable readers to verify that performance differences arise from our framework rather than experimental discrepancies. revision: yes
Referee: [Benchmark Construction] Benchmark construction paragraph: The paper states that 50 apps were used to build the defect benchmark, yet provides no information on how apps were selected, how many screenshots per app were collected, or whether any independent human review was performed to confirm the MLLM-generated labels before computing FPR/FNR/F1. This detail is load-bearing for the reliability of all quantitative results.

Authors: We acknowledge the omission of these load-bearing details. The revised benchmark construction paragraph will specify the app selection criteria, the number of screenshots collected per app, and the independent human review process used to validate MLLM-generated labels prior to metric computation. These additions will directly support the reliability of the reported FPR, FNR, and F1 scores. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on external apps

full rationale

The paper proposes a framework for GUI defect detection and reports performance metrics (40/50 apps flagged, FPR 10%, FNR 11.11%, F1 87.2% on occlusion) as direct experimental outcomes from running the method on 50 real-world Android applications. No equations, self-definitional constructs, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation chain. The benchmark construction and evaluation steps are presented as independent of the core claims by construction, satisfying the default expectation of a non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the unverified assumption that current multimodal LLMs can perform reliable visual reasoning about GUI defects; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Multimodal LLMs with chain-of-thought prompting can accurately detect, localize, and explain display defects from screenshots in multi-window scenarios.
Central to the detection and explanation step; no independent validation supplied.

pith-pipeline@v0.9.0 · 5562 in / 1328 out tokens · 71816 ms · 2026-05-10T02:58:46.357743+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Learn about foldables,

Android Developers, “Learn about foldables,” [Online]. Available: https: //developer.android.com/develop/ui/compose/layouts/adaptive/foldables/ learn-about-foldables, 2026, accessed: Mar. 11, 2026

2026
[2]

Support multi-window mode,

——, “Support multi-window mode,” [Online]. Available: https://develo per.android.com/develop/ui/views/layout/support-multi-window-mode, 2026, accessed: Mar. 11, 2026

2026
[3]

Screen compatibility overview,

——, “Screen compatibility overview,” [Online]. Available: https://de veloper.android.com/guide/practices/screens support, 2026, accessed: Mar. 11, 2026

2026
[4]

Sok: An exhaustive taxonomy of display issues for mobile applications,

L. Nie, K. S. Said, and M. Hu, “Sok: An exhaustive taxonomy of display issues for mobile applications,” inProceedings of the 29th International Conference on Intelligent User Interfaces, 2024, pp. 537–548

2024
[5]

The metamorphosis: Automatic detection of scaling issues for mobile apps,

Y . Su, C. Chen, J. Wang, Z. Liu, D. Wang, S. Li, and Q. Wang, “The metamorphosis: Automatic detection of scaling issues for mobile apps,” inProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022, pp. 1–12

2022
[6]

Owl eyes: Spotting UI display issues via visual understanding,

Z. Liu, C. Chen, J. Wang, Y . Huang, J. Hu, and Q. Wang, “Owl eyes: Spotting UI display issues via visual understanding,” inProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, 2020, pp. 398–409

2020
[7]

DroidBot: A lightweight UI- guided test input generator for Android,

Y . Li, Z. Yang, Y . Guo, and X. Chen, “DroidBot: A lightweight UI- guided test input generator for Android,” in2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE- C). IEEE, 2017, pp. 23–26

2017
[8]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

J. Yang, H. Zhang, F. Li, X. Zou, C. Li, and J. Gao, “Set-of-Mark prompting unleashes extraordinary visual grounding in GPT-4v,”arXiv preprint arXiv:2310.11441, 2023

work page internal anchor Pith review arXiv 2023
[9]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

2022
[10]

Adaptation of traditional usability testing methods for remote testing,

J. Scholtz, “Adaptation of traditional usability testing methods for remote testing,” inProceedings of the 34th Annual Hawaii International Conference on System Sciences. IEEE, 2001, pp. 8 pp.–

2001
[11]

A remote usability testing platform for mobile phones,

H. Liang, H. Song, Y . Fu, X. Cai, and Z. Zhang, “A remote usability testing platform for mobile phones,” in2011 IEEE International Confer- ence on Computer Science and Automation Engineering, vol. 2. IEEE, 2011, pp. 312–316

2011
[12]

Heuristics for the assessment of interfaces of mobile devices,

O. Machado Neto and M. D. G. Pimentel, “Heuristics for the assessment of interfaces of mobile devices,” inProceedings of the 19th Brazilian Symposium on Multimedia and the Web, 2013, pp. 93–96

2013
[13]

Heuristics for evaluating multi-touch gestures in mobile applications,

S. R. Humayoun, P. H. Chotala, M. S. Bashir, and A. Ebert, “Heuristics for evaluating multi-touch gestures in mobile applications,” inPro- ceedings of the 31st International BCS Human Computer Interaction Conference (HCI 2017). BCS Learning & Development, 2017

2017
[14]

EUHSA: Extending usability heuristics for smartphone application,

M. S. Bashir and A. Farooq, “EUHSA: Extending usability heuristics for smartphone application,”IEEE Access, vol. 7, pp. 100 838–100 859, 2019

2019
[15]

GUI information-based interaction logging and visualization for asynchronous usability testing,

J. Jeong, N. Kim, and H. P. In, “GUI information-based interaction logging and visualization for asynchronous usability testing,”Expert Systems with Applications, vol. 151, p. 113289, 2020

2020
[16]

Deep GUI: Black-box GUI input generation with deep learning,

F. YazdaniBanafsheDaragh and S. Malek, “Deep GUI: Black-box GUI input generation with deep learning,” in2021 36th IEEE/ACM Interna- tional Conference on Automated Software Engineering (ASE). IEEE, 2021, pp. 905–916

2021
[17]

Humanoid: A deep learning- based approach to automated black-box Android app testing,

Y . Li, Z. Yang, Y . Guo, and X. Chen, “Humanoid: A deep learning- based approach to automated black-box Android app testing,” in2019 34th IEEE/ACM International Conference on Automated Software En- gineering (ASE). IEEE, 2019, pp. 1070–1073

2019
[18]

MUBot: Learning to test large- scale commercial Android apps like a human,

C. Peng, Z. Zhang, Z. Lv, and P. Yang, “MUBot: Learning to test large- scale commercial Android apps like a human,” in2022 IEEE Interna- tional Conference on Software Maintenance and Evolution (ICSME). IEEE, 2022, pp. 543–552

2022
[19]

Reinforcement learning based curiosity-driven testing of Android applications,

M. Pan, A. Huang, G. Wang, T. Zhang, and X. Li, “Reinforcement learning based curiosity-driven testing of Android applications,” in Proceedings of the 29th ACM SIGSOFT international symposium on software testing and analysis, 2020, pp. 153–164

2020
[20]

Deep reinforce- ment learning for black-box testing of Android apps,

A. Romdhana, A. Merlo, M. Ceccato, and P. Tonella, “Deep reinforce- ment learning for black-box testing of Android apps,”ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 31, no. 4, pp. 1–29, 2022

2022
[21]

Chatting with GPT-3 for zero-shot human-like mobile automated GUI testing,

Z. Liu, C. Chen, J. Wang, M. Chen, B. Wu, X. Che, D. Wang, and Q. Wang, “Chatting with GPT-3 for zero-shot human-like mobile automated GUI testing,”arXiv preprint arXiv:2305.09434, 2023

work page arXiv 2023
[22]

Testing the limits: Unusual text inputs generation for mobile app crash detection with large language model,

Z. Liu, C. Chen, J. Wang, M. Chen, B. Wu, Z. Tian, Y . Huang, J. Hu, and Q. Wang, “Testing the limits: Unusual text inputs generation for mobile app crash detection with large language model,” inProceedings of the IEEE/ACM 46th International conference on software engineering, 2024, pp. 1–12

2024
[23]

Droidbot-gpt: Gpt-powered ui automation for android,

H. Wen, H. Wang, J. Liu, and Y . Li, “Droidbot-gpt: Gpt-powered ui automation for android,”arXiv preprint arXiv:2304.07061, 2023

work page arXiv 2023
[24]

Ultralytics YOLO11,

G. Jocher and J. Qiu, “Ultralytics YOLO11,” https://github.com/ultraly tics/ultralytics, 2024, version 11.0.0. APPENDIX This appendix provides additional implementation details, worked prompting examples, and representative defect cases used in our study. A. Examples of Collected Defective Applications Table IV lists example defective applications collect...

2024