Large Language Models for Mobile GUI Text Input Generation: An Empirical Study

Chenhui Cui; Chunyang Chen; Dave Towey; Junjie Wang; Rubing Huang; Tao Li

arxiv: 2404.08948 · v4 · pith:VK4TDFMKnew · submitted 2024-04-13 · 💻 cs.SE

Large Language Models for Mobile GUI Text Input Generation: An Empirical Study

Chenhui Cui , Tao Li , Junjie Wang , Chunyang Chen , Dave Towey , Rubing Huang This is my paper

classification 💻 cs.SE

keywords contextinputinputsmodelstextacrossappsbug-detection

0 comments

read the original abstract

Mobile apps have become essential, making quality assurance increasingly important. GUI testing is widely used for automated exploration, yet text-input components remain a major obstacle, as many UI pages require semantically appropriate text inputs before proceeding. Large Language Models have shown promise in generating context-aware text, but the effectiveness of different UI representations, feedback mechanisms, and human intervention remains unclear. This paper presents a large-scale empirical study addressing these gaps. We evaluate nine state-of-the-art LLMs across 115 real-world apps, comparing three UI-context prompting settings: extracted textual context, UI-hierarchy XML, and screenshot-based vision input. Results show extracted context and XML achieve comparable page-pass-through rates of 71.4% and 71.0%, while vision-based input reaches 65.1% but incurs substantially higher token costs. In bug-detection experiments with 37 real-world text-input bugs, LLMs generating invalid inputs detect about 51% of issues across all evaluated models. A feedback-enhanced protocol, incorporating execution outcomes into subsequent attempts, improves average PPTRs to 69.2-73.8% and raises bug-detection rates to 51.0-64.5%. Human testers further refine inputs, yielding additional gains. We integrate the process into DroidBot, augmenting its UI-exploration capabilities. We derive actionable insights on context selection, cost-effectiveness, feedback strategies, and human-LLM collaboration, advancing both knowledge and practice in Android testing.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach
cs.AI 2026-06 unverdicted novelty 6.0

Introduces UXBench benchmark for MLLM UI UX reasoning and UI-UX model achieving 0.7963 accuracy via RL enhancements on Qwen3-VL base.
Improving Random Testing via LLM-powered UI Tarpit Escaping for Mobile Apps
cs.SE 2026-04 conditional novelty 6.0

LLM-powered monitoring of UI similarity allows random testing tools to escape tarpits, yielding 45-55% higher coverage and more unique bugs across 12 apps.