pith. machine review for the scientific record. sign in

arxiv: 2604.19081 · v1 · submitted 2026-04-21 · 💻 cs.SE

Recognition: unknown

Proactive Detection of GUI Defects in Multi-Window Scenarios via Multimodal Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:58 UTC · model grok-4.3

classification 💻 cs.SE
keywords GUI defect detectionmulti-window scenariosmultimodal reasoningproactive testingAndroid applicationswidget occlusionchain-of-thought promptingSet-of-Mark
0
0 comments X

The pith

A proactive framework using multimodal large language models detects GUI display defects in multi-window mobile scenarios more effectively than passive methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that GUI defects become much more likely when apps run in split-screen or on foldable devices, and that current detection tools miss many of them because they only look at finished screenshots. The authors build an end-to-end system that forces apps into these multi-window states during exploration, marks the widgets on the screenshots, and then asks multimodal models to reason step by step about possible display problems. This matters because phones and tablets increasingly support multitasking, so testing must keep up with how people actually use them. The method produces both detections at the app level and explanations at the widget level, outperforming earlier approaches on a new collection of real Android applications.

Core claim

The authors claim that their framework, which proactively triggers multi-window states such as split-screen and foldable modes, aligns screenshots to widgets using Set-of-Mark, and applies chain-of-thought prompting to multimodal large language models, can detect, localize, and explain GUI display defects. They support this with a benchmark built from 50 real-world Android apps, showing that multi-window conditions increase defect exposure and that the approach yields better results than OwlEye and YOLO-based baselines.

What carries the argument

The combination of proactive state triggering during app exploration, Set-of-Mark visual marking for widget alignment, and chain-of-thought reasoning in multimodal large language models, which together enable interpretation of complex multi-window interfaces.

If this is right

  • Multi-window settings make layout defects such as text truncation far more common than in full-screen operation.
  • The method can identify a large number of defect-prone applications with low rates of false positives and false negatives.
  • It achieves higher accuracy than existing tools for both app-level and fine-grained widget-level defect detection.
  • The new benchmark of 50 Android apps enables systematic study of defects in dynamic multi-window environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Testing teams could run this kind of proactive exploration as part of continuous integration to catch issues early in development.
  • The reliance on current multimodal models suggests that improvements in those models would directly improve defect detection reliability.
  • Similar techniques might help detect related problems like inconsistent behavior across different device orientations or screen sizes.
  • App designers may need to prioritize layout flexibility even more when targeting modern multitasking features.

Load-bearing premise

The multimodal large language models will produce accurate detections and explanations from the marked screenshots without frequent hallucinations or consistent blind spots.

What would settle it

A controlled test where the framework is applied to a collection of apps known to have or lack specific multi-window defects, then measuring whether its detections match the known ground truth at both app and widget levels.

Figures

Figures reproduced from arXiv: 2604.19081 by Haotian Huang, Jianwen Xiang, Jinhao Cui, Rui Hao, Rui Wang, Wei Xue, Wenhua Hu, Xinyao Zhang.

Figure 1
Figure 1. Figure 1: Examples of amplified display defects in split-screen and foldable [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Framework overview of our approach. Enhanced DroidBot collects screenshots and view hierarchies across multiple window states, and the resulting [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Application-level FPR and FNR under conventional and split/fold [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Representative examples of the major GUI adaptation defect categories [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Examples of normal and mismatched GUI layouts under foldable and [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Workflow of the enhanced DroidBot framework for triggering split [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example of SoM construction. The left image shows the original [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: A worked example of the multimodal chain-of-thought prompting [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
read the original abstract

Multi-window mobile scenarios, such as split-screen and foldable modes, make GUI display defects more likely by forcing applications to adapt to changing window sizes and dynamic layout reflow. Existing detection techniques are limited in two ways: they are largely passive, analyzing screenshots only after problematic states have been reached, and they are mainly designed for conventional full-screen interfaces, making them less effective in multi-window settings.We propose an end-to-end framework for GUI display defect detection in multi-window mobile scenarios. The framework proactively triggers split-screen, foldable, and window-transition states during app exploration, uses Set-of-Mark (SoM) to align screenshots with widget-level interface elements, and leverages multimodal large language models with chain-of-thought prompting to detect, localize, and explain display defects. We also construct a benchmark of GUI display defects using 50 real-world Android applications.Experimental results show that multi-window settings substantially increase the exposure of layout-related defects, with text truncation increasing by 184% compared with conventional full-screen settings. At the application level, our method detects 40 defect-prone apps with a false positive rate of 10.00% and a false negative rate of 11.11%, outperforming OwlEye and YOLO-based baselines. At the fine-grained level, it achieves the best F1 score of 87.2% for widget occlusion detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents an end-to-end proactive framework for detecting GUI display defects (e.g., occlusion, truncation) in multi-window mobile scenarios. It triggers split-screen/foldable/window-transition states during exploration, aligns screenshots via Set-of-Mark (SoM) with widget elements, and uses multimodal LLMs with chain-of-thought prompting to detect, localize, and explain defects. A benchmark is built from 50 real Android apps; results claim multi-window settings increase defects (text truncation +184%), the method flags 40 defect-prone apps (FPR 10%, FNR 11.11%), and achieves 87.2% F1 on fine-grained widget occlusion, outperforming OwlEye and YOLO baselines.

Significance. If validated, the work could meaningfully advance automated GUI testing by shifting from passive post-facto analysis to proactive multi-window exploration and leveraging MLLM reasoning for both detection and explanation. The combination of state triggering, SoM alignment, and CoT prompting addresses a timely gap as foldables and split-screen become common; the reported 184% increase in truncation exposure and strong fine-grained F1 provide concrete evidence of the problem's severity and the method's potential utility.

major comments (3)
  1. [Evaluation / Experimental Results] Evaluation section (and abstract): The central performance claims (40/50 apps flagged, FPR=10%, FNR=11.11%, F1=87.2% on occlusion) are presented without any description of the ground-truth annotation protocol, number of human annotators, inter-annotator agreement, or explicit hallucination-mitigation steps for the MLLM outputs. Because the benchmark itself is constructed via the same MLLM+CoT pipeline, this omission creates a circularity risk that uncaught systematic errors could inflate all reported metrics.
  2. [Experimental Results] §4 (or equivalent baseline comparison subsection): The outperformance over OwlEye and YOLO-based baselines is stated quantitatively but without implementation details, version numbers, hyperparameter settings, or whether the baselines were adapted for multi-window inputs. This prevents assessment of whether the gains are due to the proposed framework or to differences in experimental setup.
  3. [Benchmark Construction] Benchmark construction paragraph: The paper states that 50 apps were used to build the defect benchmark, yet provides no information on how apps were selected, how many screenshots per app were collected, or whether any independent human review was performed to confirm the MLLM-generated labels before computing FPR/FNR/F1. This detail is load-bearing for the reliability of all quantitative results.
minor comments (2)
  1. [Abstract / Results] The abstract and results text use “multi-window settings substantially increase the exposure of layout-related defects” without citing the exact table or figure that quantifies the 184% truncation increase or providing confidence intervals.
  2. [Method / Prompt Design] Notation for defect categories (occlusion, truncation, etc.) is introduced without a clear taxonomy or example images in the main text; readers must infer definitions from the MLLM prompt examples.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We appreciate the identification of areas where additional transparency is needed in our evaluation, baselines, and benchmark. We will revise the manuscript to incorporate the requested details while preserving the core contributions.

read point-by-point responses
  1. Referee: [Evaluation / Experimental Results] Evaluation section (and abstract): The central performance claims (40/50 apps flagged, FPR=10%, FNR=11.11%, F1=87.2% on occlusion) are presented without any description of the ground-truth annotation protocol, number of human annotators, inter-annotator agreement, or explicit hallucination-mitigation steps for the MLLM outputs. Because the benchmark itself is constructed via the same MLLM+CoT pipeline, this omission creates a circularity risk that uncaught systematic errors could inflate all reported metrics.

    Authors: We agree that the manuscript lacks explicit details on the ground-truth protocol, creating a valid concern about circularity. In the revision we will add a dedicated paragraph in the evaluation section describing the full annotation process. This will specify the human annotation protocol, number of annotators, inter-annotator agreement, and hallucination-mitigation steps (including cross-verification against UI hierarchies and consensus requirements). We will also clarify that benchmark labels received independent human validation separate from the automated detection pipeline, thereby addressing the circularity risk. revision: yes

  2. Referee: [Experimental Results] §4 (or equivalent baseline comparison subsection): The outperformance over OwlEye and YOLO-based baselines is stated quantitatively but without implementation details, version numbers, hyperparameter settings, or whether the baselines were adapted for multi-window inputs. This prevents assessment of whether the gains are due to the proposed framework or to differences in experimental setup.

    Authors: The referee is correct that implementation details for the baselines are absent. We will expand the baseline comparison subsection to include the exact versions of OwlEye and YOLO employed, all hyperparameter settings, and the adaptations made for multi-window inputs (such as per-window processing or screenshot concatenation). This addition will enable readers to verify that performance differences arise from our framework rather than experimental discrepancies. revision: yes

  3. Referee: [Benchmark Construction] Benchmark construction paragraph: The paper states that 50 apps were used to build the defect benchmark, yet provides no information on how apps were selected, how many screenshots per app were collected, or whether any independent human review was performed to confirm the MLLM-generated labels before computing FPR/FNR/F1. This detail is load-bearing for the reliability of all quantitative results.

    Authors: We acknowledge the omission of these load-bearing details. The revised benchmark construction paragraph will specify the app selection criteria, the number of screenshots collected per app, and the independent human review process used to validate MLLM-generated labels prior to metric computation. These additions will directly support the reliability of the reported FPR, FNR, and F1 scores. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on external apps

full rationale

The paper proposes a framework for GUI defect detection and reports performance metrics (40/50 apps flagged, FPR 10%, FNR 11.11%, F1 87.2% on occlusion) as direct experimental outcomes from running the method on 50 real-world Android applications. No equations, self-definitional constructs, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation chain. The benchmark construction and evaluation steps are presented as independent of the core claims by construction, satisfying the default expectation of a non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the unverified assumption that current multimodal LLMs can perform reliable visual reasoning about GUI defects; no free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption Multimodal LLMs with chain-of-thought prompting can accurately detect, localize, and explain display defects from screenshots in multi-window scenarios.
    Central to the detection and explanation step; no independent validation supplied.

pith-pipeline@v0.9.0 · 5562 in / 1328 out tokens · 71816 ms · 2026-05-10T02:58:46.357743+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    Learn about foldables,

    Android Developers, “Learn about foldables,” [Online]. Available: https: //developer.android.com/develop/ui/compose/layouts/adaptive/foldables/ learn-about-foldables, 2026, accessed: Mar. 11, 2026

  2. [2]

    Support multi-window mode,

    ——, “Support multi-window mode,” [Online]. Available: https://develo per.android.com/develop/ui/views/layout/support-multi-window-mode, 2026, accessed: Mar. 11, 2026

  3. [3]

    Screen compatibility overview,

    ——, “Screen compatibility overview,” [Online]. Available: https://de veloper.android.com/guide/practices/screens support, 2026, accessed: Mar. 11, 2026

  4. [4]

    Sok: An exhaustive taxonomy of display issues for mobile applications,

    L. Nie, K. S. Said, and M. Hu, “Sok: An exhaustive taxonomy of display issues for mobile applications,” inProceedings of the 29th International Conference on Intelligent User Interfaces, 2024, pp. 537–548

  5. [5]

    The metamorphosis: Automatic detection of scaling issues for mobile apps,

    Y . Su, C. Chen, J. Wang, Z. Liu, D. Wang, S. Li, and Q. Wang, “The metamorphosis: Automatic detection of scaling issues for mobile apps,” inProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022, pp. 1–12

  6. [6]

    Owl eyes: Spotting UI display issues via visual understanding,

    Z. Liu, C. Chen, J. Wang, Y . Huang, J. Hu, and Q. Wang, “Owl eyes: Spotting UI display issues via visual understanding,” inProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, 2020, pp. 398–409

  7. [7]

    DroidBot: A lightweight UI- guided test input generator for Android,

    Y . Li, Z. Yang, Y . Guo, and X. Chen, “DroidBot: A lightweight UI- guided test input generator for Android,” in2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE- C). IEEE, 2017, pp. 23–26

  8. [8]

    Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

    J. Yang, H. Zhang, F. Li, X. Zou, C. Li, and J. Gao, “Set-of-Mark prompting unleashes extraordinary visual grounding in GPT-4v,”arXiv preprint arXiv:2310.11441, 2023

  9. [9]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

  10. [10]

    Adaptation of traditional usability testing methods for remote testing,

    J. Scholtz, “Adaptation of traditional usability testing methods for remote testing,” inProceedings of the 34th Annual Hawaii International Conference on System Sciences. IEEE, 2001, pp. 8 pp.–

  11. [11]

    A remote usability testing platform for mobile phones,

    H. Liang, H. Song, Y . Fu, X. Cai, and Z. Zhang, “A remote usability testing platform for mobile phones,” in2011 IEEE International Confer- ence on Computer Science and Automation Engineering, vol. 2. IEEE, 2011, pp. 312–316

  12. [12]

    Heuristics for the assessment of interfaces of mobile devices,

    O. Machado Neto and M. D. G. Pimentel, “Heuristics for the assessment of interfaces of mobile devices,” inProceedings of the 19th Brazilian Symposium on Multimedia and the Web, 2013, pp. 93–96

  13. [13]

    Heuristics for evaluating multi-touch gestures in mobile applications,

    S. R. Humayoun, P. H. Chotala, M. S. Bashir, and A. Ebert, “Heuristics for evaluating multi-touch gestures in mobile applications,” inPro- ceedings of the 31st International BCS Human Computer Interaction Conference (HCI 2017). BCS Learning & Development, 2017

  14. [14]

    EUHSA: Extending usability heuristics for smartphone application,

    M. S. Bashir and A. Farooq, “EUHSA: Extending usability heuristics for smartphone application,”IEEE Access, vol. 7, pp. 100 838–100 859, 2019

  15. [15]

    GUI information-based interaction logging and visualization for asynchronous usability testing,

    J. Jeong, N. Kim, and H. P. In, “GUI information-based interaction logging and visualization for asynchronous usability testing,”Expert Systems with Applications, vol. 151, p. 113289, 2020

  16. [16]

    Deep GUI: Black-box GUI input generation with deep learning,

    F. YazdaniBanafsheDaragh and S. Malek, “Deep GUI: Black-box GUI input generation with deep learning,” in2021 36th IEEE/ACM Interna- tional Conference on Automated Software Engineering (ASE). IEEE, 2021, pp. 905–916

  17. [17]

    Humanoid: A deep learning- based approach to automated black-box Android app testing,

    Y . Li, Z. Yang, Y . Guo, and X. Chen, “Humanoid: A deep learning- based approach to automated black-box Android app testing,” in2019 34th IEEE/ACM International Conference on Automated Software En- gineering (ASE). IEEE, 2019, pp. 1070–1073

  18. [18]

    MUBot: Learning to test large- scale commercial Android apps like a human,

    C. Peng, Z. Zhang, Z. Lv, and P. Yang, “MUBot: Learning to test large- scale commercial Android apps like a human,” in2022 IEEE Interna- tional Conference on Software Maintenance and Evolution (ICSME). IEEE, 2022, pp. 543–552

  19. [19]

    Reinforcement learning based curiosity-driven testing of Android applications,

    M. Pan, A. Huang, G. Wang, T. Zhang, and X. Li, “Reinforcement learning based curiosity-driven testing of Android applications,” in Proceedings of the 29th ACM SIGSOFT international symposium on software testing and analysis, 2020, pp. 153–164

  20. [20]

    Deep reinforce- ment learning for black-box testing of Android apps,

    A. Romdhana, A. Merlo, M. Ceccato, and P. Tonella, “Deep reinforce- ment learning for black-box testing of Android apps,”ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 31, no. 4, pp. 1–29, 2022

  21. [21]

    Chatting with GPT-3 for zero-shot human-like mobile automated GUI testing,

    Z. Liu, C. Chen, J. Wang, M. Chen, B. Wu, X. Che, D. Wang, and Q. Wang, “Chatting with GPT-3 for zero-shot human-like mobile automated GUI testing,”arXiv preprint arXiv:2305.09434, 2023

  22. [22]

    Testing the limits: Unusual text inputs generation for mobile app crash detection with large language model,

    Z. Liu, C. Chen, J. Wang, M. Chen, B. Wu, Z. Tian, Y . Huang, J. Hu, and Q. Wang, “Testing the limits: Unusual text inputs generation for mobile app crash detection with large language model,” inProceedings of the IEEE/ACM 46th International conference on software engineering, 2024, pp. 1–12

  23. [23]

    Droidbot-gpt: Gpt-powered ui automation for android,

    H. Wen, H. Wang, J. Liu, and Y . Li, “Droidbot-gpt: Gpt-powered ui automation for android,”arXiv preprint arXiv:2304.07061, 2023

  24. [24]

    Ultralytics YOLO11,

    G. Jocher and J. Qiu, “Ultralytics YOLO11,” https://github.com/ultraly tics/ultralytics, 2024, version 11.0.0. APPENDIX This appendix provides additional implementation details, worked prompting examples, and representative defect cases used in our study. A. Examples of Collected Defective Applications Table IV lists example defective applications collect...