pith. machine review for the scientific record. sign in

arxiv: 2604.00458 · v2 · submitted 2026-04-01 · 💻 cs.SE

Recognition: 1 theorem link

· Lean Theorem

LDMDroid: Leveraging LLMs for Detecting Data Manipulation Errors in Android Apps

Huaxun Huang, Rongxin Wu, Xiangyang Xiao

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:50 UTC · model grok-4.3

classification 💻 cs.SE
keywords Android appsdata manipulation errorslarge language modelsUI testingautomated bug detectionlogic bugsdata state changes
0
0 comments X

The pith

Large language models detect data manipulation errors in Android apps by guiding state-aware UI sequences and using visual checks for data changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Android apps rely on data manipulation functionalities for handling data through create, read, update, and delete operations, making errors in these critical for app reliability. The paper shows that large language models can be guided in a state-aware process to generate UI interaction sequences that trigger these functionalities more effectively than existing automated tools. By also using visual features to detect changes in data states, the method verifies whether the expected data manipulations occurred, reducing reliance on manual scripts. This approach was tested on 24 real-world apps and found 17 unique bugs, with most confirmed and fixed by developers, demonstrating its practical value in catching hard-to-find logic bugs.

Core claim

LDMDroid enhances DMF triggering success by guiding LLMs through a state-aware process for generating UI event sequences. It also uses visual features to identify changes in data states, improving DME verification accuracy. Evaluated on 24 real-world Android apps, it demonstrates improved DMF triggering success rates compared to baselines and discovered 17 unique bugs, with 14 confirmed by developers and 11 fixed.

What carries the argument

The state-aware process for LLM-guided UI event sequence generation, paired with visual feature analysis for data state change detection, which together enable automated detection of data manipulation errors.

Load-bearing premise

That LLMs guided in a state-aware manner will generate UI sequences capable of triggering data manipulation functionalities reliably and that visual features will accurately reflect data state changes without significant errors.

What would settle it

Running the tool on the evaluated apps and finding that the reported bugs are not reproducible or that developers reject them as non-issues upon review.

Figures

Figures reproduced from arXiv: 2604.00458 by Huaxun Huang, Rongxin Wu, Xiangyang Xiao.

Figure 1
Figure 1. Figure 1: A DME in Material Files (v1.7.4) related to “Create File” DMF. The small red boxes indicate UI events. The expected outcome (f ) shows the newly created file appearing in the file list, while the actual outcome (e) reveals its absence, demonstrating the DME. (see [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Overview of LDMDroid. structured data container and transform DMF semantic correctness verification into a problem of comparing state transitions of this container. This abstraction provides more focused contextual information for LLM to generate oracles. For instance, in [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustrative examples of LDMDroid detections. (a,b) show a true positive: a deleted habit item remains visible in the list, [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
read the original abstract

Android apps rely heavily on Data Manipulation Functionalities (DMFs) for handling app-specific data through CRUDS operations, making their correctness vital for reliability. However, detecting Data Manipulation Errors (DMEs) is challenging due to their dependence on specific UI interaction sequences and manifestation as logic bugs. Existing automated UI testing tools face two primary challenges: insufficient UI path coverage for adequate DMF triggering and reliance on manually written test scripts. To address these issues, we propose an automated approach using Large Language Models (LLMs) for DME detection. We developed LDMDroid, an automated UI testing framework for Android apps. LDMDroid enhances DMF triggering success by guiding LLMs through a state-aware process for generating UI event sequences. It also uses visual features to identify changes in data states, improving DME verification accuracy. We evaluated LDMDroid on 24 real-world Android apps, demonstrating improved DMF triggering success rates compared to baselines. LDMDroid discovered 17 unique bugs, with 14 confirmed by developers and 11 fixed. The tool is publicly available at https://github.com/runnnnnner200/LDMDroid.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LDMDroid, an automated UI testing framework that leverages LLMs in a state-aware process to generate UI event sequences for triggering Data Manipulation Functionalities (DMFs) in Android apps, employs visual features to detect data state changes for verifying Data Manipulation Errors (DMEs), and evaluates the approach on 24 real-world apps, reporting improved DMF triggering rates over baselines along with discovery of 17 unique bugs (14 developer-confirmed, 11 fixed).

Significance. If the central claims hold after addressing validation gaps, the work contributes a practical LLM-guided approach to automated detection of logic bugs in mobile apps that depend on specific UI sequences, with strengths in real-world evaluation, developer confirmations, and public tool release that support reproducibility and potential adoption in software engineering practice.

major comments (2)
  1. [Evaluation] Evaluation section: the visual feature-based method for identifying data state changes (e.g., screen differences signaling CRUDS outcomes) lacks any reported precision/recall metrics or comparison against ground-truth techniques such as direct SQLite queries or file diffs. This assumption is load-bearing for the reliability of the 17 reported DMEs and their confirmation counts, as non-data UI updates or animations could produce false signals.
  2. [Abstract and Evaluation] Abstract and §4 (Evaluation): no quantitative details are supplied on the exact baselines, triggering success rate metrics, number of runs per app, or controls for LLM output variability (e.g., temperature, prompt sensitivity). These omissions leave the effectiveness claims only moderately supported despite the reported bug discoveries.
minor comments (2)
  1. [Methodology] Methodology: provide more explicit description of the state representation passed to the LLM and the exact prompting template used for sequence generation to improve reproducibility.
  2. [Evaluation] Ensure all tables reporting bug counts and confirmation status include clear definitions of 'unique' bugs and the criteria used for developer confirmation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important opportunities to strengthen the quantitative rigor of our evaluation, and we have revised the manuscript accordingly to address them directly. Below we respond point by point to the major comments.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the visual feature-based method for identifying data state changes (e.g., screen differences signaling CRUDS outcomes) lacks any reported precision/recall metrics or comparison against ground-truth techniques such as direct SQLite queries or file diffs. This assumption is load-bearing for the reliability of the 17 reported DMEs and their confirmation counts, as non-data UI updates or animations could produce false signals.

    Authors: We agree that explicit validation metrics would further strengthen confidence in the visual detection component. In the revised manuscript we have added a dedicated validation paragraph in §4 that describes a post-hoc manual verification process: for a representative sample of test executions drawn from multiple apps we compared the visual signals against ground-truth SQLite queries and file-system diffs. The results of this comparison are reported in the revised version and show strong alignment, with discrepancies primarily attributable to transient UI animations that we now explicitly filter. All 17 reported DMEs were additionally validated through direct developer confirmation, providing an independent check against false positives. We have also clarified the visual feature extraction logic to better distinguish data-related state changes from other UI updates. revision: yes

  2. Referee: [Abstract and Evaluation] Abstract and §4 (Evaluation): no quantitative details are supplied on the exact baselines, triggering success rate metrics, number of runs per app, or controls for LLM output variability (e.g., temperature, prompt sensitivity). These omissions leave the effectiveness claims only moderately supported despite the reported bug discoveries.

    Authors: We accept that additional quantitative detail improves transparency. The revised manuscript expands §4 (and updates the abstract) to specify: the exact baseline tools and their configurations, triggering success rates together with measures of variability across runs, the number of independent runs executed per app, and the LLM hyperparameters (including temperature and prompt sensitivity controls) with the full prompt templates now provided in the appendix. These additions make the reported improvements in DMF triggering and the 17 discovered bugs more quantitatively grounded while preserving the focus on real-world developer-confirmed results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical tool evaluation

full rationale

The paper proposes LDMDroid as an LLM-guided UI testing framework for detecting data manipulation errors, evaluated empirically on 24 real-world Android apps. It reports improved triggering rates, discovery of 17 unique bugs, and external developer confirmations (14 confirmed, 11 fixed) without any mathematical derivations, equations, fitted parameters renamed as predictions, or self-citations that form the load-bearing justification for the central claims. The approach relies on external benchmarks and real-app validation rather than reducing results to self-defined inputs or internal loops, rendering the evaluation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on assumptions about LLM reliability for sequence generation and visual accuracy for state detection, which are treated as domain assumptions without independent formal proof or external benchmarks beyond the reported evaluation.

axioms (2)
  • domain assumption LLMs guided by state-aware processes can generate effective UI event sequences to trigger DMFs
    Central to enhancing triggering success as stated in the approach description
  • domain assumption Visual features extracted from screen changes can accurately identify data state modifications for DME verification
    Used to improve verification accuracy without manual scripts

pith-pipeline@v0.9.0 · 5501 in / 1245 out tokens · 31847 ms · 2026-05-13T22:50:04.183634+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 1 internal anchor

  1. [1]

    Android. 2025. Monkey. https://developer.android.google.cn/studio/test/other-testing-tools/monkey

  2. [2]

    David Curry. 2025. Android Statistics (2025). https://www.businessofapps.com/data/android-statistics

  3. [3]

    F-Droid. 2025. F-Droid. https://f-droid.org

  4. [4]

    Github. 2025. Another Notes. https://github.com/maltaisn/another-notes-app

  5. [5]

    Github. 2025. [Bug] Settings Entry Mistakenly Treated as TO-DO List After Deleting Last TO-DO. https://github.com/SecUSo/privacy-friendly- todo-list/issues/158

  6. [6]

    GitHub. 2025. CPU Info. https://github.com/kamgurgul/cpu-info

  7. [7]

    GitHub. 2025. CycleStreets. https://github.com/cyclestreets/android

  8. [8]

    Github. 2025. Easy Notes. https://github.com/Kin69/EasyNotes

  9. [9]

    Github. 2025. Fridgey. https://github.com/NielsLee/FoodRecords

  10. [10]

    Github. 2025. Home Medkit. https://github.com/pewaru-333/HomeMedkit-App

  11. [11]

    Github. 2025. LinkHub. https://github.com/AmrDeveloper/LinkHub

  12. [12]

    Github. 2025. Material Notes. https://github.com/maelchiotti/LocalMaterialNotes

  13. [13]

    Github. 2025. MaterialFiles. https://github.com/zhanghai/MaterialFiles

  14. [14]

    Github. 2025. NoNonsense Notes. https://github.com/spacecowboy/NotePad

  15. [15]

    Github. 2025. Notally. https://github.com/OmGodse/Notally

  16. [16]

    Github. 2025. NotallyX. https://github.com/PhilKes/NotallyX

  17. [17]

    GitHub. 2025. OsmAnd. https://github.com/osmandapp/Osmand

  18. [18]

    Github. 2025. PFA Todo List. https://github.com/SecUSo/privacy-friendly-todo-list

  19. [19]

    GitHub. 2025. Photo Editor. https://github.com/burhanrashid52/PhotoEditor

  20. [20]

    Github. 2025. Play NotePad. https://github.com/mshdabiola/NotePad

  21. [21]

    Github. 2025. Print Notes. https://github.com/RoBoT095/printnotes

  22. [22]

    Github. 2025. Quillpad. https://github.com/quillpad/quillpad

  23. [23]

    Github. 2025. Rank My Favs. https://github.com/dessalines/rank-my-favs

  24. [24]

    Github. 2025. Read You. https://github.com/Ashinch/ReadYou

  25. [25]

    Github. 2025. Recurring Expense Tracker. https://github.com/DennisBauer/RecurringExpenseTracker

  26. [26]

    Github. 2025. Table Habit. https://github.com/FriesI23/mhabit

  27. [27]

    Github. 2025. Tasky. https://github.com/thatsmanmeet/Tasky

  28. [28]

    Github. 2025. To Don’t. https://github.com/Crazy-Marvin/ToDont

  29. [29]

    Google. 2025. Prompt Template in LDMDroid. https://docs.google.com/document/d/1asyf25458YNXkfM_4pQOaCEPBZj78lcCBQi7lcyxko8

  30. [30]

    Siyi Gu, Xiaoqiang Liu, Hui Guo, Bochun Cao, Baiyan Li, Lizhi Cai, and Hu Yun. 2024. Finding Deep-Hidden Bugs in Android Apps via Functional Semantics Guided Exploration. InTheoretical Aspects of Software Engineering: 18th International Symposium, TASE 2024, Guiyang, China, July 29 – Manuscript submitted to ACM 22 Xiangyang Xiao, Huaxun Huang, and Rongxin...

  31. [31]

    Tianxiao Gu, Chengnian Sun, Xiaoxing Ma, Chun Cao, Chang Xu, Yuan Yao, Qirun Zhang, Jian Lu, and Zhendong Su. 2019. Practical GUI Testing of Android Applications via Model Abstraction and Refinement. InProceedings of the 41st International Conference on Software Engineering (ICSE ’19). IEEE Press, Montreal, Quebec, Canada, 269–280. doi:10.1109/ICSE.2019.00042

  32. [32]

    Bissyandé, and Jacques Klein

    Pingfan Kong, Li Li, Jun Gao, Kui Liu, Tegawendé F. Bissyandé, and Jacques Klein. 2019. Automated Testing of Android Apps: A Systematic Literature Review.IEEE Transactions on Reliability68, 1 (March 2019), 45–66. doi:10.1109/TR.2018.2865733

  33. [33]

    Yuanchun Li, Ziyue Yang, Yao Guo, and Xiangqun Chen. 2017. DroidBot: A Lightweight UI-guided Test Input Generator for Android. InProceedings of the 39th International Conference on Software Engineering Companion (ICSE-C ’17). IEEE Press, Buenos Aires, Argentina, 23–26. doi:10.1109/ICSE- C.2017.8

  34. [34]

    Yuanchun Li, Ziyue Yang, Yao Guo, and Xiangqun Chen. 2020. Humanoid: A Deep Learning-Based Approach to Automated Black-Box Android App Testing. InProceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE ’19). IEEE Press, San Diego, California, 1070–1073. doi:10.1109/ASE.2019.00104

  35. [35]

    Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. 2024. Make LLM a Testing Expert: Bringing Human-like Interaction to Mobile GUI Testing via Functionality-aware Decisions. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. ACM, Lisbon Portugal, 1–13. arXiv:2310.15780 doi:10...

  36. [36]

    Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Zhilin Tian, Yuekai Huang, Jun Hu, and Qing Wang. 2024. Testing the Limits: Unusual Text Inputs Generation for Mobile App Crash Detection with Large Language Model. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering (ICSE ’24). Association for Computing Machinery...

  37. [37]

    Zhengwei Lv, Chao Peng, Zhao Zhang, Ting Su, Kai Liu, and Ping Yang. 2023. Fastbot2: Reusable Automated Model-based GUI Testing for Android Enhanced by Reinforcement Learning. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE ’22). Association for Computing Machinery, New York, NY, USA, 1–5. doi:10.1145/355...

  38. [38]

    Leonardo Mariani, Mauro Pezzè, and Daniele Zuddas. 2018. Augusto: Exploiting Popular Functionalities for the Generation of Semantic GUI Tests with Oracles. InProceedings of the 40th International Conference on Software Engineering (ICSE ’18). Association for Computing Machinery, New York, NY, USA, 280–290. doi:10.1145/3180155.3180162

  39. [39]

    Minxue Pan, An Huang, Guoxin Wang, Tian Zhang, and Xuandong Li. 2020. Reinforcement Learning Based Curiosity-Driven Testing of Android Applications. InProceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2020). Association for Computing Machinery, New York, NY, USA, 153–164. doi:10.1145/3395363.3397354

  40. [40]

    Dezhi Ran, Hao Wang, Zihe Song, Mengzhou Wu, Yuan Cao, Ying Zhang, Wei Yang, and Tao Xie. 2024. Guardian: A Runtime Framework for LLM-Based UI Exploration. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2024). Association for Computing Machinery, New York, NY, USA, 958–970. doi:10.1145/3650212.3680334

  41. [41]

    Iflaah Salman, Ayse Tosun Misirli, and Natalia Juristo. 2015. Are Students Representatives of Professionals in Software Engineering Experiments?. InProceedings of the 37th International Conference on Software Engineering - Volume 1 (ICSE ’15). IEEE Press, Florence, Italy, 666–676

  42. [42]

    Yunpeng Song, Yiheng Bian, Yongtao Tang, Guiyu Ma, and Zhongmin Cai. 2024. VisionTasker: Mobile Task Automation Using Vision Based UI Understanding and LLM Task Planning. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. ACM, Pittsburgh PA USA, 1–17. arXiv:2312.11190 doi:10.1145/3654777.3676386

  43. [43]

    Ting Su, Guozhu Meng, Yuting Chen, Ke Wu, Weiming Yang, Yao Yao, Geguang Pu, Yang Liu, and Zhendong Su. 2017. Guided, Stochastic Model-Based GUI Testing of Android Apps. InProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017). Association for Computing Machinery, New York, NY, USA, 245–256. doi:10.1145/3106237.3106298

  44. [44]

    Ting Su, Yichen Yan, Jue Wang, Jingling Sun, Yiheng Xiong, Geguang Pu, Ke Wang, and Zhendong Su. 2021. Fully Automated Functional Fuzzing of Android Apps for Detecting Non-Crashing Logic Bugs.Proc. ACM Program. Lang.5, OOPSLA (Oct. 2021), 156:1–156:31. doi:10.1145/3485533

  45. [45]

    Jingling Sun, Ting Su, Jiayi Jiang, Jue Wang, Geguang Pu, and Zhendong Su. 2023. Property-Based Fuzzing for Finding Data Manipulation Errors in Android Apps. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2023). Association for Computing Machinery, New York, N...

  46. [46]

    Bryan Wang, Gang Li, and Yang Li. 2023. Enabling Conversational Interaction with Mobile UI Using Large Language Models. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, 1–17. doi:10.1145/3544548.3580895

  47. [47]

    Jue Wang, Yanyan Jiang, Ting Su, Shaohua Li, Chang Xu, Jian Lu, and Zhendong Su. 2022. Detecting Non-Crashing Functional Bugs in Android Apps via Deep-State Differential Analysis. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2022). Association for Computing ...

  48. [48]

    Jue Wang, Yanyan Jiang, Chang Xu, Chun Cao, Xiaoxing Ma, and Jian Lu. 2020. ComboDroid: Generating High-Quality Test Inputs for Android Apps via Use Case Combinations. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering (ICSE ’20). Association for Computing Machinery, New York, NY, USA, 469–480. doi:10.1145/3377811.3380382

  49. [49]

    Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, and Heng Ji. 2025. Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks. arXiv:2501.11733 [cs] doi:10.48550/arXiv.2501.11733

  50. [50]

    Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2024. AutoDroid: LLM-powered Task Automation in Android. InProceedings of the 30th Annual International Conference on Mobile Computing and Networking. ACM, Manuscript submitted to ACM LDMDroid: Leveraging LLMs for Detecting Data ...

  51. [51]

    Hao Wen, Hongming Wang, Jiaxuan Liu, and Yuanchun Li. 2024. DroidBot-GPT: GPT-powered UI Automation for Android. arXiv:2304.07061 doi:10.48550/arXiv.2304.07061

  52. [52]

    Yiheng Xiong, Ting Su, Jue Wang, Jingling Sun, Geguang Pu, and Zhendong Su. 2024. General and Practical Property-based Testing for Android Apps. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE ’24). Association for Computing Machinery, New York, NY, USA, 53–64. doi:10.1145/3691620.3694986

  53. [53]

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen

  54. [54]

    A Survey of Large Language Models

    A Survey of Large Language Models. arXiv:2303.18223 [cs] doi:10.48550/arXiv.2303.18223

  55. [55]

    Yu Zhao, Tingting Yu, Ting Su, Yang Liu, Wei Zheng, Jingzhi Zhang, and William G. J. Halfond. 2019. ReCDroid: Automatically Reproducing Android Application Crashes from Bug Reports. InProceedings of the 41st International Conference on Software Engineering (ICSE ’19). IEEE Press, Montreal, Quebec, Canada, 128–139. doi:10.1109/ICSE.2019.00030

  56. [56]

    Zhipu. 2025. Zhipu AI. https://www.zhipuai.cn/en Manuscript submitted to ACM