pith. machine review for the scientific record. sign in

arxiv: 2604.02079 · v2 · submitted 2026-04-02 · 💻 cs.SE

Recognition: no theorem link

Automated Functional Testing for Malleable Mobile Application Driven from User Intent

Hao Deng, Jinxuan Zhou, Kaifeng Huang, Shengjie Zhao, Yuying Wang, Zhiyuan Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:24 UTC · model grok-4.3

classification 💻 cs.SE
keywords GUI testingautomated testingLLM oraclesmalleable softwaremobile applicationsuser requirementsfunctional verificationtest generation
0
0 comments X

The pith

ALADDIN generates automated GUI tests from user requests to verify LLM-implemented features in malleable mobile apps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ALADDIN as a framework that turns user-specified requirements into executable tests for mobile applications that can be altered after deployment through LLM code generation. It focuses on the problem of confirming that requested functionalities are both present and correctly implemented without manual test writing for each change. The method incrementally explores the app interface to reach and exercise the new features while building oracles from LLMs to judge outcomes. A benchmark covering six popular applications with both working and broken versions of user requests serves as the evaluation, showing that the generated tests can distinguish the cases. This supports the broader possibility of moving mobile app customization from expert-driven processes to direct end-user control.

Core claim

ALADDIN is a user-requirement-driven GUI test generation framework that incrementally navigates the UI, triggers desired functionalities, and constructs LLM-guided oracles to validate correctness. The framework is evaluated on a benchmark of six popular mobile applications containing both correct and faulty implementations of user-requested features, demonstrating that it can effectively validate per-user functionalities and is practical for real-world deployment.

What carries the argument

ALADDIN framework, which incrementally navigates the app UI to trigger user-requested actions and builds LLM-guided oracles to check functional correctness.

If this is right

  • User-requested functionalities can be automatically checked for presence and correctness in malleable mobile apps.
  • Testing shifts from product-manager-driven to end-user-driven development becomes feasible.
  • The approach works across multiple real mobile applications without per-app customization.
  • LLM oracles suffice to separate working implementations from faulty ones in the tested scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same navigation-plus-oracle pattern could be adapted to verify dynamic features on web or desktop platforms.
  • App stores might incorporate similar testing to support on-demand user customizations.
  • Repeated use could lower the cost of maintaining personalized app variants over time.

Load-bearing premise

LLM-generated oracles can reliably distinguish correct from faulty user-requested functionalities without domain-specific tuning or human oversight for each new request.

What would settle it

A set of user requests applied to the six benchmark apps where the LLM oracles classify faulty implementations as correct.

Figures

Figures reproduced from arXiv: 2604.02079 by Hao Deng, Jinxuan Zhou, Kaifeng Huang, Shengjie Zhao, Yuying Wang, Zhiyuan Sun.

Figure 1
Figure 1. Figure 1: Motivation Scenario and a Conceptual Model for Per-User Malleable Application Development and Release [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Aladdin tension arises between full automation in the face of a high volume of user requests and ensuring the quality of the customized app, such as whether the functional requirements are fully satisfied or whether non-functional properties are compromised. Open Challenges. We raise several open challenges in addressing per-user malleable app development and release: • End-user-driven Code Gen… view at source ↗
Figure 3
Figure 3. Figure 3: Illustrative Example EquivalentState. Given the history 𝐻, the base state 𝑛.𝑠, and the operation 𝑛.𝑜𝑝, we determine whether the next state resulting from applying 𝑛.𝑜𝑝 to 𝑛.𝑠 has already been visited in the history. Since the same state may be reached via different paths, we use the combination of the base state 𝑛.𝑠 and the operation 𝑛.𝑜𝑝 as a unique identifier. Formally, the EquivalentState function compu… view at source ↗
Figure 4
Figure 4. Figure 4: Prompt for Trigger State Judgement State Rating Update. Function lines 20 to 23 describe the process of updating state ratings. For each candidate operation 𝑜𝑝 ∈ 𝑂𝑝 returned by PageExplorer, Aladdin computes a relevance score Γ via computeScore. Each resulting tuple ⟨𝑠 ′ , 𝑜𝑝, 𝑠𝑐𝑜𝑟𝑒⟩ is then inserted into the priority queue, prioritizing more promising state-operation pairs in subsequent steps. computeScor… view at source ↗
read the original abstract

Software malleability allows applications to be easily changed, configured, and adapted even after deployment. While prior work has explored configurable systems, adaptive recommender systems, and malleable GUIs, these approaches are often tailored to specific software and lack generalizability. In this work, we envision per-user malleable mobile applications, where end-users can specify requirements that are automatically implemented via LLM-based code generation. However, realizing this vision requires overcoming the key challenge of designing automated test generation that can reliably verify both the presence and correctness of user-specified functionalities. We propose ALADDIN, a user-requirement-driven GUI test generation framework that incrementally navigates the UI, triggers desired functionalities, and constructs LLM-guided oracles to validate correctness. We build a benchmark spanning six popular mobile applications with both correct and faulty user-requested functionalities, demonstrating that ALADDIN effectively validates per-user features and is practical for real-world deployment. Our work highlights the feasibility of shifting mobile app development from a product-manager-driven to an end-user-driven paradigm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes ALADDIN, a user-requirement-driven GUI test generation framework for malleable mobile applications. User intents are realized via LLM-based code generation; ALADDIN then performs incremental UI navigation, triggers the requested functionalities, and employs LLM-guided oracles to validate correctness. The central empirical contribution is a benchmark constructed across six popular mobile applications that includes both correct and faulty user-requested functionalities; the authors conclude that ALADDIN effectively validates per-user features and is practical for real-world deployment, thereby enabling a shift from product-manager-driven to end-user-driven mobile-app development.

Significance. If the benchmark results hold under independent verification, the work would constitute a meaningful advance in software engineering by addressing automated testing for dynamically generated, per-user functionalities in mobile apps. It extends prior research on configurable systems and malleable GUIs toward greater generalizability through LLM integration. The emphasis on end-user malleability is timely, but the practical impact depends on demonstrating that the LLM oracles provide reliable, non-circular validation.

major comments (2)
  1. [Benchmark construction] Benchmark construction section: The process of injecting faulty variants and validating them both rely on LLM-generated oracles from the same model family, with no mention of independent human-authored ground-truth labels, inter-annotator agreement, or an external oracle (e.g., manually written test cases). This introduces a circularity risk that directly undermines the central claim that ALADDIN correctly distinguishes correct from faulty functionalities.
  2. [Evaluation] Evaluation section: The abstract and evaluation report a demonstration on six applications but supply no quantitative results (success rates, precision/recall of oracle decisions, error analysis) or details on fault-injection methodology. Without these data it is impossible to assess whether the effectiveness claim holds.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by explicitly naming the quantitative metrics used to support the 'effectively validates' claim.
  2. [Framework description] Notation for the oracle construction process is introduced without a clear algorithmic listing or pseudocode, making the incremental navigation and oracle steps difficult to reproduce from the text alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the manuscript requires additional details on benchmark construction and quantitative evaluation results to strengthen the claims. We will revise accordingly and address each major comment below.

read point-by-point responses
  1. Referee: [Benchmark construction] Benchmark construction section: The process of injecting faulty variants and validating them both rely on LLM-generated oracles from the same model family, with no mention of independent human-authored ground-truth labels, inter-annotator agreement, or an external oracle (e.g., manually written test cases). This introduces a circularity risk that directly undermines the central claim that ALADDIN correctly distinguishes correct from faulty functionalities.

    Authors: We acknowledge the circularity concern as a valid point. While fault injection was performed by manually defining realistic faults drawn from common mobile app bug reports (independent of the LLM), the oracle validation step does rely on LLM guidance from the same family. We will revise the benchmark construction section to explicitly describe the manual fault injection process with examples, add a human validation study on a subset of cases (including inter-annotator agreement metrics), and report agreement between LLM oracles and human ground truth to demonstrate non-circular reliability. revision: yes

  2. Referee: [Evaluation] Evaluation section: The abstract and evaluation report a demonstration on six applications but supply no quantitative results (success rates, precision/recall of oracle decisions, error analysis) or details on fault-injection methodology. Without these data it is impossible to assess whether the effectiveness claim holds.

    Authors: We agree that the current version omits the necessary quantitative metrics and methodology details. The evaluation was performed on the six apps with both correct and faulty functionalities, but results were presented only qualitatively. We will expand the evaluation section to report concrete metrics including success rates for UI navigation and functionality triggering, precision/recall/F1 for oracle decisions on correct vs. faulty cases, and a categorized error analysis. We will also add a dedicated subsection detailing the fault-injection methodology with per-app examples. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no self-referential derivations

full rationale

The paper presents ALADDIN as an LLM-based framework for GUI test generation and oracle construction, evaluated via a manually constructed benchmark across six mobile apps containing both correct and faulty user-requested features. No equations, fitted parameters, or mathematical derivations appear. The central claim rests on empirical demonstration rather than reducing any quantity to its own inputs by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in the provided text. The benchmark creation process does not exhibit the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework depends on the unproven assumption that current LLMs can produce oracles accurate enough for functional correctness without per-app calibration.

axioms (1)
  • domain assumption LLM-generated oracles can reliably validate correctness of user-specified functionalities
    Invoked when the framework constructs oracles to check triggered features.
invented entities (1)
  • ALADDIN framework no independent evidence
    purpose: Automated test generation and validation for per-user malleable mobile apps
    New system proposed in the paper; no independent evidence outside the described benchmark.

pith-pipeline@v0.9.0 · 5489 in / 1227 out tokens · 39923 ms · 2026-05-13T21:24:43.867659+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 1 internal anchor

  1. [1]

    Anonymous Replication Package

    2025. Anonymous Replication Package. http://zenodo.org/records/19250597. Accessed March 23, 2026

  2. [2]

    Ons Al-Shamaileh and Alistair Sutcliffe. 2023. Why people choose Apps: An evaluation of the ecology and user experience of mobile applications. International Journal of Human-Computer Studies170 (2023), 102965

  3. [3]

    apple. 2025. Malleable Systems Collective. https://appcircle.io/guides/ios/ios-releases. Accessed: 2026-02-22

  4. [4]

    2013.Design for a brain: The origin of adaptive behaviour

    William Ashby. 2013.Design for a brain: The origin of adaptive behaviour. Springer Science & Business Media

  5. [5]

    Timothy J Aveni, Hila Mor, Armando Fox, and Björn Hartmann. 2025. Generative Trigger-Action Programming with Ply. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. 1–17

  6. [6]

    Kesina Baral, John Johnson, Junayed Mahmud, Sabiha Salma, Mattia Fazzini, Julia Rubin, Jeff Offutt, and Kevin Moran. 2024. Automating gui-based test oracles for mobile apps. InProceedings of the 21st International Conference on Mining Software Repositories. 309–321

  7. [7]

    boagworld. 2025. A Non-Developer’s Experience Vibe Coding. https://boagworld.com/dev/a-non-developers-experience-vibe-coding/. Accessed: 2026-02-22

  8. [8]

    Yining Cao, Peiling Jiang, and Haijun Xia. 2025. Generative and malleable user interfaces with generative and evolving task-driven data model. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI). 1–20

  9. [9]

    Meng Chen and Amy Pavel. 2025. TaskArtisan: Flexible Authoring and Manipulation of Task-specific Interactive Widgets via Sketch and Voice. In Adjunct Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. 1–3

  10. [10]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  11. [11]

    Yinghao Chen, Zehao Hu, Chen Zhi, Junxiao Han, Shuiguang Deng, and Jianwei Yin. 2024. Chatunitest: A framework for llm-based test generation. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 572–576

  12. [12]

    Wei Cheng, Yuhan Wu, and Wei Hu. 2024. Dataflow-guided retrieval augmentation for repository-level code completion. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 7957–7977

  13. [13]

    Xiang Cheng, Fan Sang, Yizhuo Zhai, Xiaokuan Zhang, and Taesoo Kim. 2025. Rug: Turbo llm for rust unit test generation. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 634–634

  14. [14]

    OpenATX contributors. 2020. uiautomator2: Python library for Android UI automation. https://github.com/openatx/uiautomator2

  15. [15]

    Yangruibo Ding, Zijian Wang, Wasi Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, et al. 2023. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion.Advances in Neural Information Processing Systems36 (2023), 46701–46723

  16. [16]

    Jinhao Dong, Jun Sun, Wenjie Zhang, Jin Song Dong, and Dan Hao. 2025. Contested: Consistency-aided tested code generation with llm.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 596–617

  17. [17]

    Madeline Endres, Sarah Fakhoury, Saikat Chakraborty, and Shuvendu K Lahiri. 2024. Can large language models transform natural language intent into formal method postconditions?Proceedings of the ACM on Software Engineering1, FSE (2024), 1889–1912

  18. [18]

    Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, and Shuvendu K Lahiri. 2024. Llm-based test-driven interactive code generation: User study and empirical evaluation.IEEE Transactions on Software Engineering (TSE)50, 9 (2024), 2254–2268

  19. [19]

    Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, Madan Musuvathi, and Shuvendu Lahiri. 2024. Exploring the effectiveness of llm based test-driven interactive code generation: User study and empirical evaluation. InProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings. 390–391

  20. [20]

    Brian Foote and Ralph E Johnson. 1989. Reflective facilities in Smalltalk-80.ACM Sigplan Notices24, 10 (1989), 327–335

  21. [21]

    Camille Gobert and Michel Beaudouin-Lafon. 2023. Lorgnette: Creating Malleable Code Projections. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST). 1–16

  22. [22]

    google. 2025. Malleable Systems Collective. https://support.google.com/googleplay/android-developer/answer/9859348?hl=en. Accessed: 2026-02-22

  23. [23]

    Giovanni Grano, Andrea Di Sorbo, Francesco Mercaldo, Corrado A Visaggio, Gerardo Canfora, and Sebastiano Panichella. 2017. Android apps and user feedback: a dataset for software evolution and quality improvement. InProceedings of the 2nd ACM SIGSOFT international workshop on app market analytics. 8–11

  24. [24]

    Xiaodong Gu, Meng Chen, Yalan Lin, Yuhan Hu, Hongyu Zhang, Chengcheng Wan, Zhao Wei, Yong Xu, and Juhong Wang. 2025. On the effectiveness of large language models in domain-specific code generation.ACM Transactions on Software Engineering and Methodology34, 3 (2025), 1–22

  25. [25]

    Dan Han, Chenlei Zhang, Xiaochao Fan, Abram Hindle, Kenny Wong, and Eleni Stroulia. 2012. Understanding android fragmentation with topic analysis of vendor-specific bugs. In2012 19th Working Conference on Reverse Engineering. IEEE, 83–92

  26. [26]

    Hojae Han, Jaejin Kim, Jaeseok Yoo, Youngwon Lee, and Seung-won Hwang. 2024. Archcode: Incorporating software requirements in code generation with large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Manuscript submitted to ACM AAA et al. 13520–13552

  27. [27]

    Mark Harman, Peter O’Hearn, and Shubho Sengupta. 2025. Harden and Catch for Just-in-Time Assured LLM-Based Software Testing: Open Research Challenges. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. 1–17

  28. [28]

    Ishrak Hayet, Adam Scott, and Marcelo d’Amorim. 2024. Chatassert: Llm-based test oracle generation with external tools assistance.IEEE Transactions on Software Engineering51, 1 (2024), 305–319

  29. [29]

    Soneya Binta Hossain and Matthew B Dwyer. 2025. Togll: Correct and strong test oracle generation with llms. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 1475–1487

  30. [30]

    Soneya Binta Hossain, Raygan Taylor, and Matthew Dwyer. 2025. Doc2oracll: Investigating the impact of documentation on llm-based test oracle generation.Proceedings of the ACM on Software Engineering2, FSE (2025), 1870–1891

  31. [31]

    Jack Johnson, Junayed Mahmud, Oscar Chaparro, Kevin Moran, and Mattia Fazzini. 2025. Generating Failure-Based Oracles to Support Testing of Reported Bugs in Android Apps. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2196–2208

  32. [32]

    Rahul Kande, Hammond Pearce, Benjamin Tan, Brendan Dolan-Gavitt, Shailja Thakur, Ramesh Karri, and Jeyavijayan Rajendran. 2024. (Security) assertions by large language models.IEEE Transactions on Information Forensics and Security19 (2024), 4374–4389

  33. [33]

    Jeffrey O Kephart and David M Chess. 2003. The vision of autonomic computing.Computer36, 1 (2003), 41–50

  34. [34]

    Shaker Mahmud Khandaker, Fitsum Kifetew, Davide Prandi, and Angelo Susi. 2025. AugmenTest: Enhancing tests with LLM-driven oracles. In2025 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 279–289

  35. [35]

    Lava18. 2021. Google Play Store Apps Dataset. https://www.kaggle.com/datasets/lava18/google-play-store-apps. Accessed: 2025-03-26

  36. [36]

    Wei Liu, Ailun Yu, Daoguang Zan, Bo Shen, Wei Zhang, Haiyan Zhao, Zhi Jin, and Qianxiang Wang. 2024. Graphcoder: Enhancing repository-level code completion via coarse-to-fine retrieval based on code context graph. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 570–581

  37. [37]

    Qianou Ma, Weirui Peng, Chenyang Yang, Hua Shen, Ken Koedinger, and Tongshuang Wu. 2025. What should we engineer in prompts? training humans in requirement-driven llm use.ACM Transactions on Computer-Human Interaction32, 4 (2025), 1–27

  38. [38]

    malleable.systems. 2025. Malleable Systems Collective. https://malleable.systems/. Accessed: 2026-02-22

  39. [39]

    Iker Martín Álvarez, José I Aliaga, Maribel Castillo, and Sergio Iserte. 2023. Efficient data redistribution for malleable applications. InProceedings of the SC’23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis. 416–426

  40. [40]

    Iker Martín-Álvarez, José I Aliaga, Maribel Castillo, and Sergio Iserte. 2024. MaM: A User-Friendly Interface to Incorporate Malleability Into MPI Applications. InEuropean Conference on Parallel Processing. Springer, 346–358

  41. [41]

    Bryan Min, Allen Chen, Yining Cao, and Haijun Xia. 2025. Malleable overview-detail interfaces. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI). 1–25

  42. [42]

    Facundo Molina, Alessandra Gorla, and Marcelo d’Amorim. 2025. Test Oracle Automation in the era of LLMs.ACM Transactions on Software Engineering and Methodology34, 5 (2025), 1–24

  43. [43]

    Davide Molinelli, Luca Di Grazia, Alberto Martin-Lopez, Michael D Ernst, and Mauro Pezze. 2025. Do LLMs Generate Useful Test Oracles? An Empirical Study with an Unbiased Dataset. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 278–290

  44. [44]

    Fangwen Mu, Lin Shi, Song Wang, Zhuohao Yu, Binquan Zhang, ChenXue Wang, Shichao Liu, and Qing Wang. 2024. Clarifygpt: A framework for enhancing llm-based code generation via requirements clarification.Proceedings of the ACM on Software Engineering1, FSE (2024), 2332–2354

  45. [45]

    Zifan Nan, Zhaoqiang Guo, Kui Liu, and Xin Xia. 2025. Test intention guided LLM-based unit test generation. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 1026–1038

  46. [46]

    Vaishnavi Pulavarthi, Deeksha Nandal, Soham Dan, and Debjit Pal. 2025. Are LLMs Ready for Practical Adoption for Assertion Generation?. In2025 Design, Automation & Test in Europe Conference (DATE). IEEE, 1–7

  47. [47]

    Vaishnavi Pulavarthi, Deeksha Nandal, Soham Dan, and Debjit Pal. 2025. Assertionbench: A benchmark to evaluate large-language models for assertion generation. InFindings of the Association for Computational Linguistics: NAACL 2025. 8058–8065

  48. [48]

    Shanto Rahman, Sachit Kuhar, Berk Cirisci, Pranav Garg, Shiqi Wang, Xiaofei Ma, Anoop Deoras, and Baishakhi Ray. 2025. UTFix: Change aware unit test repairing using LLM.Proceedings of the ACM on Programming Languages, Issue OOPSLA9, OOPSLA1 (2025), 143–168

  49. [49]

    Partha Pratim Ray. 2025. A review on vibe coding: Fundamentals, state-of-the-art, challenges and future directions.Authorea Preprints(2025)

  50. [50]

    Gabriel Ryan, Siddhartha Jain, Mingyue Shang, Shiqi Wang, Xiaofei Ma, Murali Krishna Ramanathan, and Baishakhi Ray. 2024. Code-aware prompting: A study of coverage-guided test generation in regression setting using llm.Proceedings of the ACM on Software Engineering1, FSE (2024), 951–971

  51. [51]

    Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An empirical evaluation of using large language models for automated unit test generation.IEEE Transactions on Software Engineering50, 1 (2023), 85–105

  52. [52]

    Jiho Shin, Hadi Hemmati, Moshi Wei, and Song Wang. 2024. Assessing evaluation metrics for neural test oracle generation.IEEE Transactions on Software Engineering50, 9 (2024), 2337–2349

  53. [53]

    Yanlin Wang, Yanli Wang, Daya Guo, Jiachi Chen, Ruikai Zhang, Yuchi Ma, and Zibin Zheng. 2025. Rlcoder: Reinforcement learning for repository- level code completion. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 1140–1152

  54. [54]

    Yonghao Wang, Jiaxin Zhou, Yang Yin, Hongqin Lyu, Zhiteng Chao, Wenchao Ding, Jing Ye, Tiancheng Wang, and Huawei Li. 2026. Iterative LLM- Based Assertion Generation Using Syntax-Semantic Representations for Functional Coverage-Guided Verification.arXiv preprint arXiv:2602.15388 (2026). Manuscript submitted to ACM Automated Functional Testing for Malleabl...

  55. [55]

    Zejun Wang, Kaibo Liu, Ge Li, and Zhi Jin. 2024. Hits: High-coverage llm-based unit test generation via method slicing. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1258–1268

  56. [56]

    Di Wu, Wasi Uddin Ahmad, Dejiao Zhang, Murali Krishna Ramanathan, and Xiaofei Ma. 2024. Repoformer: Selective retrieval for repository-level code completion.arXiv preprint arXiv:2403.10059(2024)

  57. [57]

    Qinyun Wu, Chao Peng, Pengfei Gao, Ruida Hu, Haoyu Gan, Bo Jiang, Jinhe Tang, Zhiwen Deng, Zhanming Guan, Cuiyun Gao, et al . 2025. Repomastereval: Evaluating code completion via real-world repositories. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 3672–3683

  58. [58]

    JD Zamfirescu-Pereira, Eunice Jun, Michael Terry, Qian Yang, and Bjoern Hartmann. 2025. Beyond code generation: Llm-supported exploration of the program design space. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–17

  59. [59]

    Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. 2023. Repocoder: Repository- level code completion through iterative retrieval and generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2471–2484

  60. [60]

    Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. 2024. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 13643–13658

  61. [61]

    Lei Zhang, Yunshui Li, Jiaming Li, Xiaobo Xia, Jiaxi Yang, Run Luo, Minzheng Wang, Longze Chen, Junhao Liu, Qiang Qu, et al. 2025. Hierarchical context pruning: Optimizing real-world code completion with repository-level pretrained code llms. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 25886–25894

  62. [62]

    Quanjun Zhang, Weifeng Sun, Chunrong Fang, Bowen Yu, Hongyan Li, Meng Yan, Jianyi Zhou, and Zhenyu Chen. 2025. Exploring automated assertion generation via large language models.ACM Transactions on Software Engineering and Methodology34, 3 (2025), 1–25

  63. [63]

    Yating Zhang, Wei Dong, Jiaxin Liu, Shangwen Wang, Deze Wang, Tiecheng Ma, Yiwei Li, and Kang Yang. 2025. A Little Help Goes a Long Way: Tutoring LLMs in Solving Competitive Programming through Hints.IEEE Transactions on Software Engineering(2025)

  64. [64]

    Yuanhe Zhang, Zhiquan Yang, Shengyi Pan, and Zhongxin Liu. 2025. Unit Test Update through LLM-Driven Context Collection and Error-Type-Aware Refinement.arXiv preprint arXiv:2509.24419(2025)

  65. [65]

    Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, and Zibin Zheng. 2025. Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 481–503

  66. [66]

    Jie Zhou, Youshu Ji, Ning Wang, Yuchen Hu, Xinyao Jiao, Bingkun Yao, Xinwei Fang, Shuai Zhao, Nan Guan, and Zhe Jiang. 2025. Insights from rights and wrongs: A large language model for solving assertion failures in rtl design. In2025 62nd ACM/IEEE Design Automation Conference (DAC). IEEE, 1–7. Manuscript submitted to ACM