Recognition: no theorem link
Automated Functional Testing for Malleable Mobile Application Driven from User Intent
Pith reviewed 2026-05-13 21:24 UTC · model grok-4.3
The pith
ALADDIN generates automated GUI tests from user requests to verify LLM-implemented features in malleable mobile apps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ALADDIN is a user-requirement-driven GUI test generation framework that incrementally navigates the UI, triggers desired functionalities, and constructs LLM-guided oracles to validate correctness. The framework is evaluated on a benchmark of six popular mobile applications containing both correct and faulty implementations of user-requested features, demonstrating that it can effectively validate per-user functionalities and is practical for real-world deployment.
What carries the argument
ALADDIN framework, which incrementally navigates the app UI to trigger user-requested actions and builds LLM-guided oracles to check functional correctness.
If this is right
- User-requested functionalities can be automatically checked for presence and correctness in malleable mobile apps.
- Testing shifts from product-manager-driven to end-user-driven development becomes feasible.
- The approach works across multiple real mobile applications without per-app customization.
- LLM oracles suffice to separate working implementations from faulty ones in the tested scenarios.
Where Pith is reading between the lines
- The same navigation-plus-oracle pattern could be adapted to verify dynamic features on web or desktop platforms.
- App stores might incorporate similar testing to support on-demand user customizations.
- Repeated use could lower the cost of maintaining personalized app variants over time.
Load-bearing premise
LLM-generated oracles can reliably distinguish correct from faulty user-requested functionalities without domain-specific tuning or human oversight for each new request.
What would settle it
A set of user requests applied to the six benchmark apps where the LLM oracles classify faulty implementations as correct.
Figures
read the original abstract
Software malleability allows applications to be easily changed, configured, and adapted even after deployment. While prior work has explored configurable systems, adaptive recommender systems, and malleable GUIs, these approaches are often tailored to specific software and lack generalizability. In this work, we envision per-user malleable mobile applications, where end-users can specify requirements that are automatically implemented via LLM-based code generation. However, realizing this vision requires overcoming the key challenge of designing automated test generation that can reliably verify both the presence and correctness of user-specified functionalities. We propose ALADDIN, a user-requirement-driven GUI test generation framework that incrementally navigates the UI, triggers desired functionalities, and constructs LLM-guided oracles to validate correctness. We build a benchmark spanning six popular mobile applications with both correct and faulty user-requested functionalities, demonstrating that ALADDIN effectively validates per-user features and is practical for real-world deployment. Our work highlights the feasibility of shifting mobile app development from a product-manager-driven to an end-user-driven paradigm.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes ALADDIN, a user-requirement-driven GUI test generation framework for malleable mobile applications. User intents are realized via LLM-based code generation; ALADDIN then performs incremental UI navigation, triggers the requested functionalities, and employs LLM-guided oracles to validate correctness. The central empirical contribution is a benchmark constructed across six popular mobile applications that includes both correct and faulty user-requested functionalities; the authors conclude that ALADDIN effectively validates per-user features and is practical for real-world deployment, thereby enabling a shift from product-manager-driven to end-user-driven mobile-app development.
Significance. If the benchmark results hold under independent verification, the work would constitute a meaningful advance in software engineering by addressing automated testing for dynamically generated, per-user functionalities in mobile apps. It extends prior research on configurable systems and malleable GUIs toward greater generalizability through LLM integration. The emphasis on end-user malleability is timely, but the practical impact depends on demonstrating that the LLM oracles provide reliable, non-circular validation.
major comments (2)
- [Benchmark construction] Benchmark construction section: The process of injecting faulty variants and validating them both rely on LLM-generated oracles from the same model family, with no mention of independent human-authored ground-truth labels, inter-annotator agreement, or an external oracle (e.g., manually written test cases). This introduces a circularity risk that directly undermines the central claim that ALADDIN correctly distinguishes correct from faulty functionalities.
- [Evaluation] Evaluation section: The abstract and evaluation report a demonstration on six applications but supply no quantitative results (success rates, precision/recall of oracle decisions, error analysis) or details on fault-injection methodology. Without these data it is impossible to assess whether the effectiveness claim holds.
minor comments (2)
- [Abstract] The abstract would be strengthened by explicitly naming the quantitative metrics used to support the 'effectively validates' claim.
- [Framework description] Notation for the oracle construction process is introduced without a clear algorithmic listing or pseudocode, making the incremental navigation and oracle steps difficult to reproduce from the text alone.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that the manuscript requires additional details on benchmark construction and quantitative evaluation results to strengthen the claims. We will revise accordingly and address each major comment below.
read point-by-point responses
-
Referee: [Benchmark construction] Benchmark construction section: The process of injecting faulty variants and validating them both rely on LLM-generated oracles from the same model family, with no mention of independent human-authored ground-truth labels, inter-annotator agreement, or an external oracle (e.g., manually written test cases). This introduces a circularity risk that directly undermines the central claim that ALADDIN correctly distinguishes correct from faulty functionalities.
Authors: We acknowledge the circularity concern as a valid point. While fault injection was performed by manually defining realistic faults drawn from common mobile app bug reports (independent of the LLM), the oracle validation step does rely on LLM guidance from the same family. We will revise the benchmark construction section to explicitly describe the manual fault injection process with examples, add a human validation study on a subset of cases (including inter-annotator agreement metrics), and report agreement between LLM oracles and human ground truth to demonstrate non-circular reliability. revision: yes
-
Referee: [Evaluation] Evaluation section: The abstract and evaluation report a demonstration on six applications but supply no quantitative results (success rates, precision/recall of oracle decisions, error analysis) or details on fault-injection methodology. Without these data it is impossible to assess whether the effectiveness claim holds.
Authors: We agree that the current version omits the necessary quantitative metrics and methodology details. The evaluation was performed on the six apps with both correct and faulty functionalities, but results were presented only qualitatively. We will expand the evaluation section to report concrete metrics including success rates for UI navigation and functionality triggering, precision/recall/F1 for oracle decisions on correct vs. faulty cases, and a categorized error analysis. We will also add a dedicated subsection detailing the fault-injection methodology with per-app examples. revision: yes
Circularity Check
No circularity: empirical benchmark with no self-referential derivations
full rationale
The paper presents ALADDIN as an LLM-based framework for GUI test generation and oracle construction, evaluated via a manually constructed benchmark across six mobile apps containing both correct and faulty user-requested features. No equations, fitted parameters, or mathematical derivations appear. The central claim rests on empirical demonstration rather than reducing any quantity to its own inputs by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in the provided text. The benchmark creation process does not exhibit the enumerated circular patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM-generated oracles can reliably validate correctness of user-specified functionalities
invented entities (1)
-
ALADDIN framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
2025. Anonymous Replication Package. http://zenodo.org/records/19250597. Accessed March 23, 2026
-
[2]
Ons Al-Shamaileh and Alistair Sutcliffe. 2023. Why people choose Apps: An evaluation of the ecology and user experience of mobile applications. International Journal of Human-Computer Studies170 (2023), 102965
work page 2023
-
[3]
apple. 2025. Malleable Systems Collective. https://appcircle.io/guides/ios/ios-releases. Accessed: 2026-02-22
work page 2025
-
[4]
2013.Design for a brain: The origin of adaptive behaviour
William Ashby. 2013.Design for a brain: The origin of adaptive behaviour. Springer Science & Business Media
work page 2013
-
[5]
Timothy J Aveni, Hila Mor, Armando Fox, and Björn Hartmann. 2025. Generative Trigger-Action Programming with Ply. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. 1–17
work page 2025
-
[6]
Kesina Baral, John Johnson, Junayed Mahmud, Sabiha Salma, Mattia Fazzini, Julia Rubin, Jeff Offutt, and Kevin Moran. 2024. Automating gui-based test oracles for mobile apps. InProceedings of the 21st International Conference on Mining Software Repositories. 309–321
work page 2024
-
[7]
boagworld. 2025. A Non-Developer’s Experience Vibe Coding. https://boagworld.com/dev/a-non-developers-experience-vibe-coding/. Accessed: 2026-02-22
work page 2025
-
[8]
Yining Cao, Peiling Jiang, and Haijun Xia. 2025. Generative and malleable user interfaces with generative and evolving task-driven data model. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI). 1–20
work page 2025
-
[9]
Meng Chen and Amy Pavel. 2025. TaskArtisan: Flexible Authoring and Manipulation of Task-specific Interactive Widgets via Sketch and Voice. In Adjunct Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. 1–3
work page 2025
-
[10]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[11]
Yinghao Chen, Zehao Hu, Chen Zhi, Junxiao Han, Shuiguang Deng, and Jianwei Yin. 2024. Chatunitest: A framework for llm-based test generation. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 572–576
work page 2024
-
[12]
Wei Cheng, Yuhan Wu, and Wei Hu. 2024. Dataflow-guided retrieval augmentation for repository-level code completion. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 7957–7977
work page 2024
-
[13]
Xiang Cheng, Fan Sang, Yizhuo Zhai, Xiaokuan Zhang, and Taesoo Kim. 2025. Rug: Turbo llm for rust unit test generation. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 634–634
work page 2025
-
[14]
OpenATX contributors. 2020. uiautomator2: Python library for Android UI automation. https://github.com/openatx/uiautomator2
work page 2020
-
[15]
Yangruibo Ding, Zijian Wang, Wasi Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, et al. 2023. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion.Advances in Neural Information Processing Systems36 (2023), 46701–46723
work page 2023
-
[16]
Jinhao Dong, Jun Sun, Wenjie Zhang, Jin Song Dong, and Dan Hao. 2025. Contested: Consistency-aided tested code generation with llm.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 596–617
work page 2025
-
[17]
Madeline Endres, Sarah Fakhoury, Saikat Chakraborty, and Shuvendu K Lahiri. 2024. Can large language models transform natural language intent into formal method postconditions?Proceedings of the ACM on Software Engineering1, FSE (2024), 1889–1912
work page 2024
-
[18]
Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, and Shuvendu K Lahiri. 2024. Llm-based test-driven interactive code generation: User study and empirical evaluation.IEEE Transactions on Software Engineering (TSE)50, 9 (2024), 2254–2268
work page 2024
-
[19]
Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, Madan Musuvathi, and Shuvendu Lahiri. 2024. Exploring the effectiveness of llm based test-driven interactive code generation: User study and empirical evaluation. InProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings. 390–391
work page 2024
-
[20]
Brian Foote and Ralph E Johnson. 1989. Reflective facilities in Smalltalk-80.ACM Sigplan Notices24, 10 (1989), 327–335
work page 1989
-
[21]
Camille Gobert and Michel Beaudouin-Lafon. 2023. Lorgnette: Creating Malleable Code Projections. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST). 1–16
work page 2023
- [22]
-
[23]
Giovanni Grano, Andrea Di Sorbo, Francesco Mercaldo, Corrado A Visaggio, Gerardo Canfora, and Sebastiano Panichella. 2017. Android apps and user feedback: a dataset for software evolution and quality improvement. InProceedings of the 2nd ACM SIGSOFT international workshop on app market analytics. 8–11
work page 2017
-
[24]
Xiaodong Gu, Meng Chen, Yalan Lin, Yuhan Hu, Hongyu Zhang, Chengcheng Wan, Zhao Wei, Yong Xu, and Juhong Wang. 2025. On the effectiveness of large language models in domain-specific code generation.ACM Transactions on Software Engineering and Methodology34, 3 (2025), 1–22
work page 2025
-
[25]
Dan Han, Chenlei Zhang, Xiaochao Fan, Abram Hindle, Kenny Wong, and Eleni Stroulia. 2012. Understanding android fragmentation with topic analysis of vendor-specific bugs. In2012 19th Working Conference on Reverse Engineering. IEEE, 83–92
work page 2012
-
[26]
Hojae Han, Jaejin Kim, Jaeseok Yoo, Youngwon Lee, and Seung-won Hwang. 2024. Archcode: Incorporating software requirements in code generation with large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Manuscript submitted to ACM AAA et al. 13520–13552
work page 2024
-
[27]
Mark Harman, Peter O’Hearn, and Shubho Sengupta. 2025. Harden and Catch for Just-in-Time Assured LLM-Based Software Testing: Open Research Challenges. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. 1–17
work page 2025
-
[28]
Ishrak Hayet, Adam Scott, and Marcelo d’Amorim. 2024. Chatassert: Llm-based test oracle generation with external tools assistance.IEEE Transactions on Software Engineering51, 1 (2024), 305–319
work page 2024
-
[29]
Soneya Binta Hossain and Matthew B Dwyer. 2025. Togll: Correct and strong test oracle generation with llms. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 1475–1487
work page 2025
-
[30]
Soneya Binta Hossain, Raygan Taylor, and Matthew Dwyer. 2025. Doc2oracll: Investigating the impact of documentation on llm-based test oracle generation.Proceedings of the ACM on Software Engineering2, FSE (2025), 1870–1891
work page 2025
-
[31]
Jack Johnson, Junayed Mahmud, Oscar Chaparro, Kevin Moran, and Mattia Fazzini. 2025. Generating Failure-Based Oracles to Support Testing of Reported Bugs in Android Apps. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2196–2208
work page 2025
-
[32]
Rahul Kande, Hammond Pearce, Benjamin Tan, Brendan Dolan-Gavitt, Shailja Thakur, Ramesh Karri, and Jeyavijayan Rajendran. 2024. (Security) assertions by large language models.IEEE Transactions on Information Forensics and Security19 (2024), 4374–4389
work page 2024
-
[33]
Jeffrey O Kephart and David M Chess. 2003. The vision of autonomic computing.Computer36, 1 (2003), 41–50
work page 2003
-
[34]
Shaker Mahmud Khandaker, Fitsum Kifetew, Davide Prandi, and Angelo Susi. 2025. AugmenTest: Enhancing tests with LLM-driven oracles. In2025 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 279–289
work page 2025
-
[35]
Lava18. 2021. Google Play Store Apps Dataset. https://www.kaggle.com/datasets/lava18/google-play-store-apps. Accessed: 2025-03-26
work page 2021
-
[36]
Wei Liu, Ailun Yu, Daoguang Zan, Bo Shen, Wei Zhang, Haiyan Zhao, Zhi Jin, and Qianxiang Wang. 2024. Graphcoder: Enhancing repository-level code completion via coarse-to-fine retrieval based on code context graph. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 570–581
work page 2024
-
[37]
Qianou Ma, Weirui Peng, Chenyang Yang, Hua Shen, Ken Koedinger, and Tongshuang Wu. 2025. What should we engineer in prompts? training humans in requirement-driven llm use.ACM Transactions on Computer-Human Interaction32, 4 (2025), 1–27
work page 2025
-
[38]
malleable.systems. 2025. Malleable Systems Collective. https://malleable.systems/. Accessed: 2026-02-22
work page 2025
-
[39]
Iker Martín Álvarez, José I Aliaga, Maribel Castillo, and Sergio Iserte. 2023. Efficient data redistribution for malleable applications. InProceedings of the SC’23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis. 416–426
work page 2023
-
[40]
Iker Martín-Álvarez, José I Aliaga, Maribel Castillo, and Sergio Iserte. 2024. MaM: A User-Friendly Interface to Incorporate Malleability Into MPI Applications. InEuropean Conference on Parallel Processing. Springer, 346–358
work page 2024
-
[41]
Bryan Min, Allen Chen, Yining Cao, and Haijun Xia. 2025. Malleable overview-detail interfaces. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI). 1–25
work page 2025
-
[42]
Facundo Molina, Alessandra Gorla, and Marcelo d’Amorim. 2025. Test Oracle Automation in the era of LLMs.ACM Transactions on Software Engineering and Methodology34, 5 (2025), 1–24
work page 2025
-
[43]
Davide Molinelli, Luca Di Grazia, Alberto Martin-Lopez, Michael D Ernst, and Mauro Pezze. 2025. Do LLMs Generate Useful Test Oracles? An Empirical Study with an Unbiased Dataset. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 278–290
work page 2025
-
[44]
Fangwen Mu, Lin Shi, Song Wang, Zhuohao Yu, Binquan Zhang, ChenXue Wang, Shichao Liu, and Qing Wang. 2024. Clarifygpt: A framework for enhancing llm-based code generation via requirements clarification.Proceedings of the ACM on Software Engineering1, FSE (2024), 2332–2354
work page 2024
-
[45]
Zifan Nan, Zhaoqiang Guo, Kui Liu, and Xin Xia. 2025. Test intention guided LLM-based unit test generation. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 1026–1038
work page 2025
-
[46]
Vaishnavi Pulavarthi, Deeksha Nandal, Soham Dan, and Debjit Pal. 2025. Are LLMs Ready for Practical Adoption for Assertion Generation?. In2025 Design, Automation & Test in Europe Conference (DATE). IEEE, 1–7
work page 2025
-
[47]
Vaishnavi Pulavarthi, Deeksha Nandal, Soham Dan, and Debjit Pal. 2025. Assertionbench: A benchmark to evaluate large-language models for assertion generation. InFindings of the Association for Computational Linguistics: NAACL 2025. 8058–8065
work page 2025
-
[48]
Shanto Rahman, Sachit Kuhar, Berk Cirisci, Pranav Garg, Shiqi Wang, Xiaofei Ma, Anoop Deoras, and Baishakhi Ray. 2025. UTFix: Change aware unit test repairing using LLM.Proceedings of the ACM on Programming Languages, Issue OOPSLA9, OOPSLA1 (2025), 143–168
work page 2025
-
[49]
Partha Pratim Ray. 2025. A review on vibe coding: Fundamentals, state-of-the-art, challenges and future directions.Authorea Preprints(2025)
work page 2025
-
[50]
Gabriel Ryan, Siddhartha Jain, Mingyue Shang, Shiqi Wang, Xiaofei Ma, Murali Krishna Ramanathan, and Baishakhi Ray. 2024. Code-aware prompting: A study of coverage-guided test generation in regression setting using llm.Proceedings of the ACM on Software Engineering1, FSE (2024), 951–971
work page 2024
-
[51]
Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An empirical evaluation of using large language models for automated unit test generation.IEEE Transactions on Software Engineering50, 1 (2023), 85–105
work page 2023
-
[52]
Jiho Shin, Hadi Hemmati, Moshi Wei, and Song Wang. 2024. Assessing evaluation metrics for neural test oracle generation.IEEE Transactions on Software Engineering50, 9 (2024), 2337–2349
work page 2024
-
[53]
Yanlin Wang, Yanli Wang, Daya Guo, Jiachi Chen, Ruikai Zhang, Yuchi Ma, and Zibin Zheng. 2025. Rlcoder: Reinforcement learning for repository- level code completion. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 1140–1152
work page 2025
-
[54]
Yonghao Wang, Jiaxin Zhou, Yang Yin, Hongqin Lyu, Zhiteng Chao, Wenchao Ding, Jing Ye, Tiancheng Wang, and Huawei Li. 2026. Iterative LLM- Based Assertion Generation Using Syntax-Semantic Representations for Functional Coverage-Guided Verification.arXiv preprint arXiv:2602.15388 (2026). Manuscript submitted to ACM Automated Functional Testing for Malleabl...
-
[55]
Zejun Wang, Kaibo Liu, Ge Li, and Zhi Jin. 2024. Hits: High-coverage llm-based unit test generation via method slicing. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1258–1268
work page 2024
- [56]
-
[57]
Qinyun Wu, Chao Peng, Pengfei Gao, Ruida Hu, Haoyu Gan, Bo Jiang, Jinhe Tang, Zhiwen Deng, Zhanming Guan, Cuiyun Gao, et al . 2025. Repomastereval: Evaluating code completion via real-world repositories. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 3672–3683
work page 2025
-
[58]
JD Zamfirescu-Pereira, Eunice Jun, Michael Terry, Qian Yang, and Bjoern Hartmann. 2025. Beyond code generation: Llm-supported exploration of the program design space. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–17
work page 2025
-
[59]
Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. 2023. Repocoder: Repository- level code completion through iterative retrieval and generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2471–2484
work page 2023
-
[60]
Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. 2024. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 13643–13658
work page 2024
-
[61]
Lei Zhang, Yunshui Li, Jiaming Li, Xiaobo Xia, Jiaxi Yang, Run Luo, Minzheng Wang, Longze Chen, Junhao Liu, Qiang Qu, et al. 2025. Hierarchical context pruning: Optimizing real-world code completion with repository-level pretrained code llms. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 25886–25894
work page 2025
-
[62]
Quanjun Zhang, Weifeng Sun, Chunrong Fang, Bowen Yu, Hongyan Li, Meng Yan, Jianyi Zhou, and Zhenyu Chen. 2025. Exploring automated assertion generation via large language models.ACM Transactions on Software Engineering and Methodology34, 3 (2025), 1–25
work page 2025
-
[63]
Yating Zhang, Wei Dong, Jiaxin Liu, Shangwen Wang, Deze Wang, Tiecheng Ma, Yiwei Li, and Kang Yang. 2025. A Little Help Goes a Long Way: Tutoring LLMs in Solving Competitive Programming through Hints.IEEE Transactions on Software Engineering(2025)
work page 2025
- [64]
-
[65]
Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, and Zibin Zheng. 2025. Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 481–503
work page 2025
-
[66]
Jie Zhou, Youshu Ji, Ning Wang, Yuchen Hu, Xinyao Jiao, Bingkun Yao, Xinwei Fang, Shuai Zhao, Nan Guan, and Zhe Jiang. 2025. Insights from rights and wrongs: A large language model for solving assertion failures in rtl design. In2025 62nd ACM/IEEE Design Automation Conference (DAC). IEEE, 1–7. Manuscript submitted to ACM
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.