Large Language Models for Mobile GUI Text Input Generation: An Empirical Study
read the original abstract
Mobile apps have become essential, making quality assurance increasingly important. GUI testing is widely used for automated exploration, yet text-input components remain a major obstacle, as many UI pages require semantically appropriate text inputs before proceeding. Large Language Models have shown promise in generating context-aware text, but the effectiveness of different UI representations, feedback mechanisms, and human intervention remains unclear. This paper presents a large-scale empirical study addressing these gaps. We evaluate nine state-of-the-art LLMs across 115 real-world apps, comparing three UI-context prompting settings: extracted textual context, UI-hierarchy XML, and screenshot-based vision input. Results show extracted context and XML achieve comparable page-pass-through rates of 71.4% and 71.0%, while vision-based input reaches 65.1% but incurs substantially higher token costs. In bug-detection experiments with 37 real-world text-input bugs, LLMs generating invalid inputs detect about 51% of issues across all evaluated models. A feedback-enhanced protocol, incorporating execution outcomes into subsequent attempts, improves average PPTRs to 69.2-73.8% and raises bug-detection rates to 51.0-64.5%. Human testers further refine inputs, yielding additional gains. We integrate the process into DroidBot, augmenting its UI-exploration capabilities. We derive actionable insights on context selection, cost-effectiveness, feedback strategies, and human-LLM collaboration, advancing both knowledge and practice in Android testing.
This paper has not been read by Pith yet.
Forward citations
Cited by 2 Pith papers
-
Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach
Introduces UXBench benchmark for MLLM UI UX reasoning and UI-UX model achieving 0.7963 accuracy via RL enhancements on Qwen3-VL base.
-
Improving Random Testing via LLM-powered UI Tarpit Escaping for Mobile Apps
LLM-powered monitoring of UI similarity allows random testing tools to escape tarpits, yielding 45-55% higher coverage and more unique bugs across 12 apps.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.