Recognition: unknown
Proactive Detection of GUI Defects in Multi-Window Scenarios via Multimodal Reasoning
Pith reviewed 2026-05-10 02:58 UTC · model grok-4.3
The pith
A proactive framework using multimodal large language models detects GUI display defects in multi-window mobile scenarios more effectively than passive methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that their framework, which proactively triggers multi-window states such as split-screen and foldable modes, aligns screenshots to widgets using Set-of-Mark, and applies chain-of-thought prompting to multimodal large language models, can detect, localize, and explain GUI display defects. They support this with a benchmark built from 50 real-world Android apps, showing that multi-window conditions increase defect exposure and that the approach yields better results than OwlEye and YOLO-based baselines.
What carries the argument
The combination of proactive state triggering during app exploration, Set-of-Mark visual marking for widget alignment, and chain-of-thought reasoning in multimodal large language models, which together enable interpretation of complex multi-window interfaces.
If this is right
- Multi-window settings make layout defects such as text truncation far more common than in full-screen operation.
- The method can identify a large number of defect-prone applications with low rates of false positives and false negatives.
- It achieves higher accuracy than existing tools for both app-level and fine-grained widget-level defect detection.
- The new benchmark of 50 Android apps enables systematic study of defects in dynamic multi-window environments.
Where Pith is reading between the lines
- Testing teams could run this kind of proactive exploration as part of continuous integration to catch issues early in development.
- The reliance on current multimodal models suggests that improvements in those models would directly improve defect detection reliability.
- Similar techniques might help detect related problems like inconsistent behavior across different device orientations or screen sizes.
- App designers may need to prioritize layout flexibility even more when targeting modern multitasking features.
Load-bearing premise
The multimodal large language models will produce accurate detections and explanations from the marked screenshots without frequent hallucinations or consistent blind spots.
What would settle it
A controlled test where the framework is applied to a collection of apps known to have or lack specific multi-window defects, then measuring whether its detections match the known ground truth at both app and widget levels.
Figures
read the original abstract
Multi-window mobile scenarios, such as split-screen and foldable modes, make GUI display defects more likely by forcing applications to adapt to changing window sizes and dynamic layout reflow. Existing detection techniques are limited in two ways: they are largely passive, analyzing screenshots only after problematic states have been reached, and they are mainly designed for conventional full-screen interfaces, making them less effective in multi-window settings.We propose an end-to-end framework for GUI display defect detection in multi-window mobile scenarios. The framework proactively triggers split-screen, foldable, and window-transition states during app exploration, uses Set-of-Mark (SoM) to align screenshots with widget-level interface elements, and leverages multimodal large language models with chain-of-thought prompting to detect, localize, and explain display defects. We also construct a benchmark of GUI display defects using 50 real-world Android applications.Experimental results show that multi-window settings substantially increase the exposure of layout-related defects, with text truncation increasing by 184% compared with conventional full-screen settings. At the application level, our method detects 40 defect-prone apps with a false positive rate of 10.00% and a false negative rate of 11.11%, outperforming OwlEye and YOLO-based baselines. At the fine-grained level, it achieves the best F1 score of 87.2% for widget occlusion detection.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents an end-to-end proactive framework for detecting GUI display defects (e.g., occlusion, truncation) in multi-window mobile scenarios. It triggers split-screen/foldable/window-transition states during exploration, aligns screenshots via Set-of-Mark (SoM) with widget elements, and uses multimodal LLMs with chain-of-thought prompting to detect, localize, and explain defects. A benchmark is built from 50 real Android apps; results claim multi-window settings increase defects (text truncation +184%), the method flags 40 defect-prone apps (FPR 10%, FNR 11.11%), and achieves 87.2% F1 on fine-grained widget occlusion, outperforming OwlEye and YOLO baselines.
Significance. If validated, the work could meaningfully advance automated GUI testing by shifting from passive post-facto analysis to proactive multi-window exploration and leveraging MLLM reasoning for both detection and explanation. The combination of state triggering, SoM alignment, and CoT prompting addresses a timely gap as foldables and split-screen become common; the reported 184% increase in truncation exposure and strong fine-grained F1 provide concrete evidence of the problem's severity and the method's potential utility.
major comments (3)
- [Evaluation / Experimental Results] Evaluation section (and abstract): The central performance claims (40/50 apps flagged, FPR=10%, FNR=11.11%, F1=87.2% on occlusion) are presented without any description of the ground-truth annotation protocol, number of human annotators, inter-annotator agreement, or explicit hallucination-mitigation steps for the MLLM outputs. Because the benchmark itself is constructed via the same MLLM+CoT pipeline, this omission creates a circularity risk that uncaught systematic errors could inflate all reported metrics.
- [Experimental Results] §4 (or equivalent baseline comparison subsection): The outperformance over OwlEye and YOLO-based baselines is stated quantitatively but without implementation details, version numbers, hyperparameter settings, or whether the baselines were adapted for multi-window inputs. This prevents assessment of whether the gains are due to the proposed framework or to differences in experimental setup.
- [Benchmark Construction] Benchmark construction paragraph: The paper states that 50 apps were used to build the defect benchmark, yet provides no information on how apps were selected, how many screenshots per app were collected, or whether any independent human review was performed to confirm the MLLM-generated labels before computing FPR/FNR/F1. This detail is load-bearing for the reliability of all quantitative results.
minor comments (2)
- [Abstract / Results] The abstract and results text use “multi-window settings substantially increase the exposure of layout-related defects” without citing the exact table or figure that quantifies the 184% truncation increase or providing confidence intervals.
- [Method / Prompt Design] Notation for defect categories (occlusion, truncation, etc.) is introduced without a clear taxonomy or example images in the main text; readers must infer definitions from the MLLM prompt examples.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We appreciate the identification of areas where additional transparency is needed in our evaluation, baselines, and benchmark. We will revise the manuscript to incorporate the requested details while preserving the core contributions.
read point-by-point responses
-
Referee: [Evaluation / Experimental Results] Evaluation section (and abstract): The central performance claims (40/50 apps flagged, FPR=10%, FNR=11.11%, F1=87.2% on occlusion) are presented without any description of the ground-truth annotation protocol, number of human annotators, inter-annotator agreement, or explicit hallucination-mitigation steps for the MLLM outputs. Because the benchmark itself is constructed via the same MLLM+CoT pipeline, this omission creates a circularity risk that uncaught systematic errors could inflate all reported metrics.
Authors: We agree that the manuscript lacks explicit details on the ground-truth protocol, creating a valid concern about circularity. In the revision we will add a dedicated paragraph in the evaluation section describing the full annotation process. This will specify the human annotation protocol, number of annotators, inter-annotator agreement, and hallucination-mitigation steps (including cross-verification against UI hierarchies and consensus requirements). We will also clarify that benchmark labels received independent human validation separate from the automated detection pipeline, thereby addressing the circularity risk. revision: yes
-
Referee: [Experimental Results] §4 (or equivalent baseline comparison subsection): The outperformance over OwlEye and YOLO-based baselines is stated quantitatively but without implementation details, version numbers, hyperparameter settings, or whether the baselines were adapted for multi-window inputs. This prevents assessment of whether the gains are due to the proposed framework or to differences in experimental setup.
Authors: The referee is correct that implementation details for the baselines are absent. We will expand the baseline comparison subsection to include the exact versions of OwlEye and YOLO employed, all hyperparameter settings, and the adaptations made for multi-window inputs (such as per-window processing or screenshot concatenation). This addition will enable readers to verify that performance differences arise from our framework rather than experimental discrepancies. revision: yes
-
Referee: [Benchmark Construction] Benchmark construction paragraph: The paper states that 50 apps were used to build the defect benchmark, yet provides no information on how apps were selected, how many screenshots per app were collected, or whether any independent human review was performed to confirm the MLLM-generated labels before computing FPR/FNR/F1. This detail is load-bearing for the reliability of all quantitative results.
Authors: We acknowledge the omission of these load-bearing details. The revised benchmark construction paragraph will specify the app selection criteria, the number of screenshots collected per app, and the independent human review process used to validate MLLM-generated labels prior to metric computation. These additions will directly support the reliability of the reported FPR, FNR, and F1 scores. revision: yes
Circularity Check
No circularity: empirical results on external apps
full rationale
The paper proposes a framework for GUI defect detection and reports performance metrics (40/50 apps flagged, FPR 10%, FNR 11.11%, F1 87.2% on occlusion) as direct experimental outcomes from running the method on 50 real-world Android applications. No equations, self-definitional constructs, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation chain. The benchmark construction and evaluation steps are presented as independent of the core claims by construction, satisfying the default expectation of a non-circular empirical study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multimodal LLMs with chain-of-thought prompting can accurately detect, localize, and explain display defects from screenshots in multi-window scenarios.
Reference graph
Works this paper leans on
-
[1]
Learn about foldables,
Android Developers, “Learn about foldables,” [Online]. Available: https: //developer.android.com/develop/ui/compose/layouts/adaptive/foldables/ learn-about-foldables, 2026, accessed: Mar. 11, 2026
2026
-
[2]
Support multi-window mode,
——, “Support multi-window mode,” [Online]. Available: https://develo per.android.com/develop/ui/views/layout/support-multi-window-mode, 2026, accessed: Mar. 11, 2026
2026
-
[3]
Screen compatibility overview,
——, “Screen compatibility overview,” [Online]. Available: https://de veloper.android.com/guide/practices/screens support, 2026, accessed: Mar. 11, 2026
2026
-
[4]
Sok: An exhaustive taxonomy of display issues for mobile applications,
L. Nie, K. S. Said, and M. Hu, “Sok: An exhaustive taxonomy of display issues for mobile applications,” inProceedings of the 29th International Conference on Intelligent User Interfaces, 2024, pp. 537–548
2024
-
[5]
The metamorphosis: Automatic detection of scaling issues for mobile apps,
Y . Su, C. Chen, J. Wang, Z. Liu, D. Wang, S. Li, and Q. Wang, “The metamorphosis: Automatic detection of scaling issues for mobile apps,” inProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022, pp. 1–12
2022
-
[6]
Owl eyes: Spotting UI display issues via visual understanding,
Z. Liu, C. Chen, J. Wang, Y . Huang, J. Hu, and Q. Wang, “Owl eyes: Spotting UI display issues via visual understanding,” inProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, 2020, pp. 398–409
2020
-
[7]
DroidBot: A lightweight UI- guided test input generator for Android,
Y . Li, Z. Yang, Y . Guo, and X. Chen, “DroidBot: A lightweight UI- guided test input generator for Android,” in2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE- C). IEEE, 2017, pp. 23–26
2017
-
[8]
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
J. Yang, H. Zhang, F. Li, X. Zou, C. Li, and J. Gao, “Set-of-Mark prompting unleashes extraordinary visual grounding in GPT-4v,”arXiv preprint arXiv:2310.11441, 2023
work page internal anchor Pith review arXiv 2023
-
[9]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022
2022
-
[10]
Adaptation of traditional usability testing methods for remote testing,
J. Scholtz, “Adaptation of traditional usability testing methods for remote testing,” inProceedings of the 34th Annual Hawaii International Conference on System Sciences. IEEE, 2001, pp. 8 pp.–
2001
-
[11]
A remote usability testing platform for mobile phones,
H. Liang, H. Song, Y . Fu, X. Cai, and Z. Zhang, “A remote usability testing platform for mobile phones,” in2011 IEEE International Confer- ence on Computer Science and Automation Engineering, vol. 2. IEEE, 2011, pp. 312–316
2011
-
[12]
Heuristics for the assessment of interfaces of mobile devices,
O. Machado Neto and M. D. G. Pimentel, “Heuristics for the assessment of interfaces of mobile devices,” inProceedings of the 19th Brazilian Symposium on Multimedia and the Web, 2013, pp. 93–96
2013
-
[13]
Heuristics for evaluating multi-touch gestures in mobile applications,
S. R. Humayoun, P. H. Chotala, M. S. Bashir, and A. Ebert, “Heuristics for evaluating multi-touch gestures in mobile applications,” inPro- ceedings of the 31st International BCS Human Computer Interaction Conference (HCI 2017). BCS Learning & Development, 2017
2017
-
[14]
EUHSA: Extending usability heuristics for smartphone application,
M. S. Bashir and A. Farooq, “EUHSA: Extending usability heuristics for smartphone application,”IEEE Access, vol. 7, pp. 100 838–100 859, 2019
2019
-
[15]
GUI information-based interaction logging and visualization for asynchronous usability testing,
J. Jeong, N. Kim, and H. P. In, “GUI information-based interaction logging and visualization for asynchronous usability testing,”Expert Systems with Applications, vol. 151, p. 113289, 2020
2020
-
[16]
Deep GUI: Black-box GUI input generation with deep learning,
F. YazdaniBanafsheDaragh and S. Malek, “Deep GUI: Black-box GUI input generation with deep learning,” in2021 36th IEEE/ACM Interna- tional Conference on Automated Software Engineering (ASE). IEEE, 2021, pp. 905–916
2021
-
[17]
Humanoid: A deep learning- based approach to automated black-box Android app testing,
Y . Li, Z. Yang, Y . Guo, and X. Chen, “Humanoid: A deep learning- based approach to automated black-box Android app testing,” in2019 34th IEEE/ACM International Conference on Automated Software En- gineering (ASE). IEEE, 2019, pp. 1070–1073
2019
-
[18]
MUBot: Learning to test large- scale commercial Android apps like a human,
C. Peng, Z. Zhang, Z. Lv, and P. Yang, “MUBot: Learning to test large- scale commercial Android apps like a human,” in2022 IEEE Interna- tional Conference on Software Maintenance and Evolution (ICSME). IEEE, 2022, pp. 543–552
2022
-
[19]
Reinforcement learning based curiosity-driven testing of Android applications,
M. Pan, A. Huang, G. Wang, T. Zhang, and X. Li, “Reinforcement learning based curiosity-driven testing of Android applications,” in Proceedings of the 29th ACM SIGSOFT international symposium on software testing and analysis, 2020, pp. 153–164
2020
-
[20]
Deep reinforce- ment learning for black-box testing of Android apps,
A. Romdhana, A. Merlo, M. Ceccato, and P. Tonella, “Deep reinforce- ment learning for black-box testing of Android apps,”ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 31, no. 4, pp. 1–29, 2022
2022
-
[21]
Chatting with GPT-3 for zero-shot human-like mobile automated GUI testing,
Z. Liu, C. Chen, J. Wang, M. Chen, B. Wu, X. Che, D. Wang, and Q. Wang, “Chatting with GPT-3 for zero-shot human-like mobile automated GUI testing,”arXiv preprint arXiv:2305.09434, 2023
-
[22]
Testing the limits: Unusual text inputs generation for mobile app crash detection with large language model,
Z. Liu, C. Chen, J. Wang, M. Chen, B. Wu, Z. Tian, Y . Huang, J. Hu, and Q. Wang, “Testing the limits: Unusual text inputs generation for mobile app crash detection with large language model,” inProceedings of the IEEE/ACM 46th International conference on software engineering, 2024, pp. 1–12
2024
-
[23]
Droidbot-gpt: Gpt-powered ui automation for android,
H. Wen, H. Wang, J. Liu, and Y . Li, “Droidbot-gpt: Gpt-powered ui automation for android,”arXiv preprint arXiv:2304.07061, 2023
-
[24]
Ultralytics YOLO11,
G. Jocher and J. Qiu, “Ultralytics YOLO11,” https://github.com/ultraly tics/ultralytics, 2024, version 11.0.0. APPENDIX This appendix provides additional implementation details, worked prompting examples, and representative defect cases used in our study. A. Examples of Collected Defective Applications Table IV lists example defective applications collect...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.