pith. machine review for the scientific record. sign in

arxiv: 2604.06373 · v1 · submitted 2026-04-07 · 💻 cs.SE · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Beyond Functional Correctness: Design Issues in AI IDE-Generated Large-Scale Projects

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:21 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords AI-generated codedesign issuesCursor IDElarge-scale projectsmaintainabilityFD-HITL frameworkstatic analysissoftware quality
0
0 comments X

The pith

Cursor with a feature-driven human-in-the-loop process generates functional projects averaging 17,000 lines of code, yet these projects contain thousands of design issues that threaten long-term maintainability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates the limits of current AI IDEs when asked to produce complete, large-scale software systems rather than small snippets. By introducing the FD-HITL framework to steer Cursor through curated feature lists, the authors produced ten working projects that average 16,965 lines of code and 114 files with 91 percent functional correctness. Static analysis with CodeScene and SonarQube then revealed 1,305 and 3,193 design issues respectively, dominated by duplication, high complexity, large methods, and violations of single-responsibility and separation-of-concerns principles. The central observation is that functional success alone does not deliver sustainable architecture, so experienced developers must still perform careful review.

Core claim

When used with the FD-HITL framework, Cursor can generate functional large-scale projects averaging 16,965 LoC and 114 files; the generated projects nevertheless contain design issues that may pose long-term maintainability and evolvability risks, requiring careful review by experienced developers. The most prevalent issues include Code Duplication, high Code Complexity, Large Methods, Framework Best-Practice Violations, Exception-Handling Issues and Accessibility Issues; these design issues violate design principles such as SRP, SoC, and DRY.

What carries the argument

The Feature-Driven Human-In-The-Loop (FD-HITL) framework that systematically guides project generation from curated descriptions, paired with CodeScene and SonarQube static analysis to surface design issues.

Load-bearing premise

The design issues reported by CodeScene and SonarQube accurately predict real maintainability and evolvability problems that would appear when the generated projects are maintained or extended in practice.

What would settle it

A controlled maintenance experiment that measures the time and defect rates required to add new features to one of the AI-generated projects versus an equivalent human-written codebase over several iterations.

Figures

Figures reproduced from arXiv: 2604.06373 by Amjed Tahir, Mojtaba Shahin, Peng Liang, Qiong Feng, Ruiyin Li, Syed Mohammad Kashif, Zengyang Li.

Figure 1
Figure 1. Figure 1: Overview of our research process large-scale projects using Cursor. RQ1 also explores the extent to which Cursor enables developers to build large-scale projects. Finally, RQ2 identifies and analyzes the design issues in Cursor-generated projects. Particularly, we address the following Research Questions (RQs): RQ1: To what extent can Cursor generate large-scale projects? Motivation. AI-powered IDEs (e.g.,… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Feature-Driven Human-In-The-Loop (FD-HITL) Framework [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An example of a mobile application (P8_SocialApp) generated using Cursor The P7_BlogWebsite project has 21 main functional requirements. Our manual evaluation of this project shows 95% functional correctness. We provide a few example screenshots illustrating the functionality of the P7_BlogWebsite project in [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An example of a Web application (P7_BlogWebsite) generated using Cursor The manual evaluation of the 10 projects against their intended functional requirements yields an average functional correctness of 91%, with the highest at 96% and the lowest at 85%. The highest functional correctness is observed in P8_SocialApp, with 96.2% of requirements marked as complete. These results suggest that the FD-HITL fra… view at source ↗
Figure 5
Figure 5. Figure 5: Overview of the categories of the design issues identified by CodeScene [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: An example of Code Duplication design issue (2) Complex Method 377 (27.9%). This category of design issues includes methods with high cyclomatic complexity, which measures code complexity by counting the number of linearly inde￾pendent logical paths through the code, thereby indicating the difficulty of the code in terms of understanding, testing, and maintenance. After quantitatively analyzing Complex Met… view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of cyclomatic complexity (CC) in [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples of Complex Conditional design issue (5) Overall Code Complexity 76 (5.6%). This category of design issues indicates that a source file contains many conditional statements (e.g., if, for, while) throughout its implementation. These design issues are detected based on the cyclomatic complexity metric. Unlike Complex Method, this category focuses on the cyclomatic complexity of the entire file. Inte… view at source ↗
Figure 9
Figure 9. Figure 9: An example of Bumpy Road Ahead design issue in the code and introduces several challenges. First, primitive types require separate validation logic in the application code. Second, primitive types can lead to fragile code because they do not constrain the value range as a domain-specific type would. We found that Primitive Obsession design issues occur more frequently in backend code (14, 23.3%) than in fr… view at source ↗
Figure 10
Figure 10. Figure 10: An example of Excess Number of Function Arguments and Primitive Obsession design issues ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 0. Publication date: 2026 [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Overview of the design issues identified by SonarQube [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: An example of Duplicated CSS Selectors design issue not found, permission violations, business rule validation failures) (see [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: An example of Generic Exceptions design issue (4) Code Complexity 273 (8.5%). This category focuses on issues related to complexity that make code difficult to understand and test, including high cognitive complexity and deeply nested expressions. We illustrate this category with an identified design issue, “Refactor this function to reduce its Cognitive Complexity from 37 to the 15 allowed”, in the updat… view at source ↗
Figure 14
Figure 14. Figure 14: An example of Complex Ternary Operators design issue (5) Design Principle Violation 216 (6.8%). This category includes issues that violate design princi￾ples, such as keeping units of code (e.g., constructors, methods) small and focused. Similarly, we demonstrate an identified issue, “Constructor has 20 parameters, which is greater than 7 authorized”, with an example from OrderResponse.java (P3_Ecommerce)… view at source ↗
Figure 15
Figure 15. Figure 15: An example of a Higher Number of Constructor Parameters design issue (6) Type and Collection Issue 218 (6.8%). The issues in this category are related to type safety and the appropriate use of collections and iteration (e.g., use of ‘for. . . of’, keys in lists). For instance, we demonstrate an identified issue, “Use ‘for. . . of’ instead of ‘.forEach(. . . )’”, using an example from educationController.j… view at source ↗
Figure 16
Figure 16. Figure 16: An example of a Non-Native Interac￾tive Element design issue [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗
Figure 18
Figure 18. Figure 18: An example of Merging Conditional design issue (10) String, Regex, and Text-Handling Issue 134 (4.2%). This category includes issues related to best practices for handling strings, regular expressions, and escaping special characters. We present an example issue from Profile.js (P4_JobApplication) which states, “HTML entity, ‘}’, must be escaped”. The issue is triggered due to the use of the HTML entity “… view at source ↗
Figure 19
Figure 19. Figure 19: Overview of overlap between issues identified by SonarQube and CodeScene [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗
read the original abstract

New generation of AI coding tools, including AI-powered IDEs equipped with agentic capabilities, can generate code within the context of the project. These AI IDEs are increasingly perceived as capable of producing project-level code at scale. However, there is limited empirical evidence on the extent to which they can generate large-scale software systems and what design issues such systems may exhibit. To address this gap, we conducted a study to explore the capability of Cursor in generating large-scale projects and to evaluate the design quality of projects generated by Cursor. First, we propose a Feature-Driven Human-In-The-Loop (FD-HITL) framework that systematically guides project generation from curated project descriptions. We generated 10 projects using Cursor with the FD-HITL framework across three application domains and multiple technologies. We assessed the functional correctness of these projects through manual evaluation, obtaining an average functional correctness score of 91%. Next, we analyzed the generated projects using two static analysis tools, CodeScene and SonarQube, to detect design issues. We identified 1,305 design issues categorized into 9 categories by CodeScene and 3,193 issues in 11 categories by SonarQube. Our findings show that (1) when used with the FD-HITL framework, Cursor can generate functional large-scale projects averaging 16,965 LoC and 114 files; (2) the generated projects nevertheless contain design issues that may pose long-term maintainability and evolvability risks, requiring careful review by experienced developers; (3) the most prevalent issues include Code Duplication, high Code Complexity, Large Methods, Framework Best-Practice Violations, Exception-Handling Issues and Accessibility Issues; (4) these design issues violate design principles such as SRP, SoC, and DRY. The replication package is at https://github.com/Kashifraz/DIinAGP

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a Feature-Driven Human-In-The-Loop (FD-HITL) framework to guide Cursor AI IDE in generating large-scale projects from curated descriptions. It reports results from 10 projects (averaging 16,965 LoC and 114 files across three domains) that achieve 91% functional correctness by manual evaluation. Static analysis via CodeScene (1,305 issues in 9 categories) and SonarQube (3,193 issues in 11 categories) identifies prevalent problems including code duplication, high complexity, large methods, framework violations, exception handling, and accessibility issues. These are mapped to violations of SRP, SoC, and DRY, leading to the conclusion that the projects contain design issues posing long-term maintainability and evolvability risks and thus require careful review by experienced developers. A replication package is provided.

Significance. If the link between static-analysis flags and real maintainability risks holds, the work provides valuable empirical evidence on the current limits of agentic AI IDEs for producing production-scale code. The FD-HITL framework, concrete project-scale metrics, and open replication package are strengths that could inform both practitioners using these tools and researchers studying AI-assisted software engineering. The shift from functional correctness alone to design-quality assessment addresses a timely gap.

major comments (3)
  1. [Abstract] Abstract, finding (2): the claim that the identified design issues 'may pose long-term maintainability and evolvability risks' rests entirely on heuristic outputs from CodeScene and SonarQube without any direct validation. No expert review, maintenance-task experiments, defect-rate measurements, or comparison against human-written equivalents is reported to show that the 1,305 + 3,193 detections would actually increase costs or defects in practice, especially for LLM-generated structures that may trigger false positives in rule-based tools.
  2. [Methods] Methods section (project selection and manual evaluation): the description of how the 10 projects and three domains were chosen is insufficiently detailed, and the 91% functional-correctness score lacks reporting of inter-rater reliability, number of evaluators, or the concrete test cases and acceptance criteria used. These omissions weaken confidence in both the scale and correctness claims.
  3. [Results] Results (design-issue categorization): the mapping of tool detections to SRP, SoC, and DRY violations is stated without per-category examples, code snippets, or quantitative justification showing how specific CodeScene/SonarQube rules correspond to these principles in the generated artifacts. This makes the principle-violation claim difficult to assess.
minor comments (2)
  1. [Abstract] Abstract: the three application domains and the specific technologies used are mentioned but not named; adding this information would improve readability.
  2. [Replication package] The replication package URL is given but should be accompanied by a brief description of its contents (e.g., project prompts, generated code, tool reports) to aid reviewers and readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which helps us improve the clarity and rigor of our work. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract, finding (2): the claim that the identified design issues 'may pose long-term maintainability and evolvability risks' rests entirely on heuristic outputs from CodeScene and SonarQube without any direct validation. No expert review, maintenance-task experiments, defect-rate measurements, or comparison against human-written equivalents is reported to show that the 1,305 + 3,193 detections would actually increase costs or defects in practice, especially for LLM-generated structures that may trigger false positives in rule-based tools.

    Authors: We agree that the maintainability-risk claim is grounded in outputs from established static-analysis tools rather than direct validation experiments. CodeScene and SonarQube are widely adopted in both industry and research, with prior studies demonstrating correlations between their metrics and real maintenance effort; however, we acknowledge that LLM-generated code may introduce unique false-positive risks not fully explored here. To address this, we will revise the abstract to qualify the statement (e.g., 'may indicate potential long-term risks according to static-analysis heuristics') and add an explicit limitations subsection discussing tool applicability to AI-generated artifacts and the absence of direct maintenance-task validation, while noting this as future work. This keeps the contribution focused on applying these tools to AI IDE output while avoiding overstatement. revision: partial

  2. Referee: [Methods] Methods section (project selection and manual evaluation): the description of how the 10 projects and three domains were chosen is insufficiently detailed, and the 91% functional-correctness score lacks reporting of inter-rater reliability, number of evaluators, or the concrete test cases and acceptance criteria used. These omissions weaken confidence in both the scale and correctness claims.

    Authors: We thank the referee for identifying these reporting gaps. In the revised manuscript we will expand the Methods section with: (1) explicit selection criteria for the 10 projects and three domains, including rationale for domain diversity and technology choices to support representativeness; (2) details that functional evaluation was performed by two independent evaluators with a third resolving conflicts; (3) the computed inter-rater reliability (Cohen's kappa); and (4) representative test cases and acceptance criteria per project. These additions will strengthen reproducibility and reader confidence in the scale and correctness results. revision: yes

  3. Referee: [Results] Results (design-issue categorization): the mapping of tool detections to SRP, SoC, and DRY violations is stated without per-category examples, code snippets, or quantitative justification showing how specific CodeScene/SonarQube rules correspond to these principles in the generated artifacts. This makes the principle-violation claim difficult to assess.

    Authors: We accept that the current mapping lacks sufficient illustrative support. In the revision we will augment the Results section by adding: concrete per-category examples (e.g., a duplicated code fragment violating DRY), anonymized code snippets drawn from the generated projects, and a justification table that links specific tool rules (such as SonarQube cognitive complexity or CodeScene duplication metrics) to the relevant principles (SRP, SoC, DRY) with quantitative counts of how many issues fall under each mapping. This will make the principle-violation analysis transparent and easier to evaluate. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical analysis of generated projects

full rationale

This is a direct empirical study: the authors define an FD-HITL framework, use it to generate 10 projects via Cursor, manually score functional correctness at 91%, and then apply independent external static-analysis tools (CodeScene, SonarQube) to count design-issue detections. No equations, fitted parameters, or predictions are defined in terms of the target claims; the mapping of tool flags to SRP/SoC/DRY is interpretive but not self-referential. The central results (LoC counts, issue tallies, prevalence rankings) are measurements of generated artifacts rather than derivations that collapse to the paper's own inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked to force the conclusions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that static analysis tools provide valid indicators of maintainability risks and that the ten generated projects generalize to real large-scale development.

axioms (2)
  • domain assumption Static analysis tools CodeScene and SonarQube accurately detect design issues that matter for long-term maintainability and evolvability.
    The study uses these tools to categorize 4,498 issues without additional human validation of their practical impact mentioned in the abstract.
  • domain assumption The ten projects generated via FD-HITL are representative of large-scale software systems.
    Generalization from a small curated set of domains and technologies.

pith-pipeline@v0.9.0 · 5665 in / 1392 out tokens · 42292 ms · 2026-05-10T18:21:33.039117+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 12 canonical work pages

  1. [1]

    Shraddha Barke, Michael B James, and Nadia Polikarpova. 2023. Grounded copilot: How programmers interact with code-generating models.Proceedings of the ACM on Programming Languages7, OOPSLA1 (2023), 85–111

  2. [2]

    Joel Becker, Nate Rush, Elizabeth Barnes, and David Rein. 2025. Measuring the impact of early-2025 AI on experienced open-source developer productivity.arXiv preprint arXiv:2507.09089(2025)

  3. [3]

    Markus Borg, Marwa Ezzouhri, and Adam Tornhill. 2024. Ghost Echoes Revealed: Benchmarking Maintainability Metrics and Machine Learning Predictions Against Human Assessments. InProceedings of the 40th International Conference on Software Maintenance and Evolution (ICSME). IEEE, 678–688

  4. [4]

    Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology.Qualitative Research in Psychology3, 2 (2006), 77–101

  5. [5]

    John L Campbell, Charles Quincy, Jordan Osserman, and Ove K Pedersen. 2013. Coding in-depth semistructured interviews: Problems of unitization and intercoder reliability and agreement.Sociological Methods & Research42, 3 (2013), 294–320

  6. [6]

    Alexander Chatzigeorgiou and Anastasios Manakos. 2014. Investigating the evolution of code smells in object-oriented systems.Innovations in Systems and Software Engineering10, 1 (2014), 3–18

  7. [7]

    2026.Claude Code

    Claude. 2026.Claude Code. https://code.claude.com/docs/en/overview#jetbrains [Accessed 2026-01-11]

  8. [8]

    2026.CodeScene

    CodeScene. 2026.CodeScene. https://codescene.com/ [Accessed 2026-01-11]

  9. [9]

    2024.Amazon CodeWhisperer

    CodeWhisperer. 2024.Amazon CodeWhisperer. https://docs.aws.amazon.com/codewhisperer/latest/userguide/what-is- cwspr.html [Accessed 2025-08-10]

  10. [10]

    2024.GitHub Copilot

    Copilot. 2024.GitHub Copilot. https://github.com/copilot/ [Accessed 2025-08-10]. ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 0. Publication date: 2026. 0:38 Kashif et al

  11. [11]

    Domenico Cotroneo, Cristina Improta, and Pietro Liguori. 2025. Human-written vs. ai-generated code: A large-scale study of defects, vulnerabilities, and complexity. InProceeding of the 36th IEEE International Symposium on Software Reliability Engineering (ISSRE). IEEE, 252–263

  12. [12]

    2026.Cursor

    Cursor. 2026.Cursor. https://cursor.com/ [Accessed 2026-01-11]

  13. [13]

    Arghavan Moradi Dakhel, Vahid Majdinasab, Amin Nikanjam, Foutse Khomh, Michel C Desmarais, and Zhen Ming Jack Jiang. 2023. Github copilot ai pair programmer: Asset or liability?Journal of Systems and Software203 (2023), 111734

  14. [14]

    Simone Daniotti, Johannes Wachs, Xiangnan Feng, and Frank Neffke. 2026. Who is using AI to code? Global diffusion and impact of generative AI.Science391, 6787 (2026), 831–835

  15. [15]

    2024.Gemini

    Google DeepMind. 2024.Gemini. https://gemini.google.com/ [Accessed 2025-08-10]

  16. [16]

    Georgios Digkas, Mircea Lungu, Alexander Chatzigeorgiou, and Paris Avgeriou. 2017. The evolution of technical debt in the apache ecosystem. InProceedings of the 11th European Conference on Software Architecture (ECSA). Springer, 51–66

  17. [17]

    Philipp Eibl, Sadra Sabouri, and Souti Chattopadhyay. 2025. Exploring the Challenges and Opportunities of AI-assisted Codebase Generation. InProceeding of the 41st IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 241–252

  18. [18]

    Etemadi, N

    K. Etemadi, N. Harrand, S. Larsen, H. Adzemovic, H. L. Phu, A. Verma, others, and M. Monperrus. 2022. Sorald: Automatic patch suggestions for sonarqube static analysis violations.IEEE Transactions on Dependable and Secure Computing20, 4 (2022), 2794–2810

  19. [19]

    Ahmed Fawzy, Amjed Tahir, and Kelly Blincoe. 2026. Vibe Coding in Practice: Motivations, Challenges, and a Future Outlook–a Grey Literature Review. InProceedings of the 48th International Conference on Software Engineering (ICSE): SEIP Track. ACM

  20. [20]

    2008.A voiding Repetition

    Martin Fowler. 2008.A voiding Repetition. https://martinfowler.com/ieeeSoftware/repetition.pdf [Accessed 2026-02-27]

  21. [21]

    Yujia Fu, Peng Liang, Amjed Tahir, Zengyang Li, Mojtaba Shahin, Jiaxin Yu, and Jinfu Chen. 2025. Security weaknesses of copilot-generated code in github projects: An empirical study.ACM Transactions on Software Engineering and Methodology34, 8 (2025), 1–34

  22. [22]

    Hao He, Courtney Miller, Shyam Agarwal, Christian Kästner, and Bogdan Vasilescu. 2025. Speed at the Cost of Quality: How Cursor AI Increases Short-Term Velocity and Long-Term Complexity in Open-Source Projects.arXiv preprint arXiv:2511.04427(2025)

  23. [23]

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. 2023. MetaGPT: Meta programming for a multi-agent collaborative framework. InProceedings of the 12th International Conference on Learning Representations (ICLR). OpenReview.net, 1–29

  24. [24]

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79

  25. [25]

    Haoming Huang, Pongchai Jaisri, Shota Shimizu, Lingfeng Chen, Sota Nakashima, and Gema Rodríguez-Pérez. 2026. More Code, Less Reuse: Investigating Code Quality and Reviewer Sentiment towards AI-generated Pull Requests. arXiv preprint arXiv:2601.21276(2026)

  26. [26]

    Kailun Jin, Chung-Yu Wang, Hung Viet Pham, and Hadi Hemmati. 2024. Can ChatGPT Support Developers? An Empirical Evaluation of Large Language Models for Code Generation. InProceedings of the 21st IEEE/ACM International Conference on Mining Software Repositories (MSR). 167–171

  27. [27]

    Syed Mohammad Kashif, Ruiyin Li, Peng Liang, Amjed Tahir, Qiong Feng, Zengyang Li, and Mojtaba Shahin. 2026. Dataset for the Paper: Beyond Functional Correctness: Design Issues in AI IDE Generated Large-Scale Projects. https://github.com/Kashifraz/DIinAGP

  28. [28]

    Syed Mohammad Kashif, Peng Liang, and Amjed Tahir. 2025. On Developers’ Self-Declaration of AI-Generated Code: An Analysis of Practices.ACM Transactions on Software Engineering and Methodology(2025)

  29. [29]

    Myeongsoo Kim, Shweta Garg, Baishakhi Ray, Varun Kumar, and Anoop Deoras. 2025. CodeAssistBench (CAB): Dataset & benchmarking for multi-turn chat-based code assistance.arXiv preprint arXiv:2507.10646(2025)

  30. [30]

    Yulia Kumar, Israel Akinwunmi, and Dov Kruger. 2025. Evaluating the Advantage of an AI-Native IDE Cursor on Programmer Performance. InProceedings of the 15th IEEE Integrated STEM Education Conference (ISEC). IEEE, 1–8

  31. [31]

    Jasmine Latendresse, Suhaib Mujahid, Diego Elias Costa, and Emad Shihab. 2022. Not all dependencies are equal: An empirical study on production dependencies in npm. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE). ACM, 1–12

  32. [32]

    Hao Li, Haoxiang Zhang, and Ahmed E Hassan. 2025. The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents are Reshaping Software Engineering.arXiv preprint arXiv:2507.15003(2025)

  33. [33]

    Ruiyin Li, Peng Liang, Yifei Wang, Yangxiao Cai, Weisong Sun, and Zengyang Li. 2025. Unveiling the role of chatgpt in software development: Insights from developer-chatgpt interactions on github.ACM Transactions on Software Engineering and Methodology(2025). ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 0. Publication date: 2026. Beyond Functio...

  34. [34]

    Yichen Li, Yun Peng, Yintong Huo, and Michael R Lyu. 2024. Enhancing LLM-based coding tools through native integration of ide-derived static context. InProceedings of the 1st International Workshop on Large Language Models for Code (LLM4Code). ACM, 70–74

  35. [35]

    Jenny T Liang, Chenyang Yang, and Brad A Myers. 2024. A large-scale survey on the usability of AI programming assis- tants: Successes and challenges. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE). ACM, 1–13

  36. [36]

    Yue Liu, Thanh Le-Cong, Ratnadira Widyasari, Chakkrit Tantithamthavorn, Li Li, Xuan-Bach D Le, and David Lo. 2024. Refining chatgpt-generated code: Characterizing and mitigating code quality issues.ACM Transactions on Software Engineering and Methodology33, 5 (2024), 1–26

  37. [37]

    2025.Harness Engineering: Leveraging Codex in an Agent-First World

    Ryan Lopopolo. 2025.Harness Engineering: Leveraging Codex in an Agent-First World. https://openai.com/index/harness- engineering/ [Accessed 2026-03-25]

  38. [38]

    Pengrui Lu, Shiqi Zhang, Yunzhong Hou, Lyumanshan Ye, Chaoyi Huang, Zixi Chen, Ji Zeng, Hantao Jiang, Pengfei Liu, Yiwei Wang, et al. 2026. ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development. arXiv preprint arXiv:2602.01655(2026)

  39. [39]

    Md Abdullah Al Mamun, Christian Berger, and Jörgen Hansson. 2017. Correlations of software code metrics: an empirical study. InProceedings of the 27th International Workshop on Software Measurement and 12th International Conference on Software Process and Product Measurement (IWSM Mensura). ACM, 255–266

  40. [40]

    2009.Clean Code: A Handbook of Agile Software Craftsmanship

    Robert C Martin. 2009.Clean Code: A Handbook of Agile Software Craftsmanship. Pearson Education

  41. [41]

    Robert C. Martin. 2014.The Single Responsibility Principle. https://blog.cleancoder.com/uncle-bob/2014/05/08/ SingleReponsibilityPrinciple.html [Accessed 2026-02-27]

  42. [42]

    Nora McDonald, Sarita Schoenebeck, and Andrea Forte. 2019. Reliability and Inter-rater Reliability in Qualitative Research: Norms and Guidelines for CSCW and HCI Practice.Proceedings of the ACM on Human-Computer Interaction 3, CSCW (2019), 1–23

  43. [43]

    Naouel Moha, Yann-Gaël Guéhéneuc, Laurence Duchien, and Anne-Francoise Le Meur. 2010. DECOR: A Method for the Specification and Detection of Code and Design Smells.IEEE Transactions on Software Engineering36, 1 (2010), 20–36

  44. [44]

    Minh Huynh Nguyen, Thang Phan Chau, Phong X Nguyen, and Nghi DQ Bui. 2025. Agilecoder: Dynamic collaborative agents for software development based on agile methodology. InProceedings of the 2nd IEEE/ACM International Conference on AI Foundation Models and Software Engineering (FORGE). IEEE, 156–167

  45. [45]

    2024.ChatGPT

    OpenAI. 2024.ChatGPT. https://openai.com/index/chatgpt/ [Accessed 2025-08-10]

  46. [46]

    2026.Codex

    OpenAI. 2026.Codex. https://openai.com/codex/ [Accessed 2026-01-11]

  47. [47]

    Shuyin Ouyang, Jie M Zhang, Mark Harman, and Meng Wang. 2024. An empirical study of the non-determinism of chatgpt in code generation.ACM Transactions on Software Engineering and Methodology34, 2 (2024), 1–28

  48. [48]

    2001.A Practical Guide to Feature-Driven Development

    Steve R Palmer and Mac Felsing. 2001.A Practical Guide to Feature-Driven Development. Pearson Education

  49. [49]

    Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2022. Asleep at the keyboard? assessing the security of github copilot’s code contributions. InProceeings of the 43rd IEEE Symposium on Security and Privacy (SP). IEEE, 754–768

  50. [50]

    2023.The 10 Most Common Projects

    Dan Politz. 2023.The 10 Most Common Projects. https://getcredo.com/common-project-custom-web-development- company/

  51. [51]

    Zeeshan Rasheed, Muhammad Waseem, Mika Saari, Kari Systä, and Pekka Abrahamsson. 2024. Codepori: Large scale model for autonomous software development by using multi-agents.arXiv preprint arXiv:2402.01411(2024)

  52. [52]

    Abbas Sabra, Olivier Schmitt, and Joseph Tyler. 2025. Assessing the Quality and Security of AI-Generated Code: A Quantitative Analysis.arXiv preprint arXiv:2508.14727(2025)

  53. [53]

    Alfred Santa Molison, Marcia Moraes, Glaucia Melo, Fabio Santos, and Wesley KG Assuncao. 2025. Is LLM-Generated Code More Maintainable & Reliable Than Human-Written Code?. InProceeding of the 19th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). IEEE, 151–162

  54. [54]

    Jim Shore. 2004. Fail fast [software debugging].IEEE Software21, 5 (2004), 21–25

  55. [55]

    2025.100+ Mobile App Ideas

    Nikhil Solanki. 2025.100+ Mobile App Ideas. https://www.manektech.com/blog/mobile-app-ideas

  56. [56]

    2026.SonarQube

    SonarSource. 2026.SonarQube. https://www.sonarsource.com/products/sonarqube/ [Accessed 2026-01-11]

  57. [57]

    César Soto-Valero, Nicolas Harrand, Martin Monperrus, and Benoit Baudry. 2021. A comprehensive study of bloated dependencies in the maven ecosystem.Empirical Software Engineering26, 3 (2021), 1–44

  58. [58]

    2025.Stack Overflow Developer Survey

    StackOverflow. 2025.Stack Overflow Developer Survey. https://survey.stackoverflow.co/2025/ [Accessed on 2025-08-27]

  59. [59]

    Florian Tambon, Arghavan Moradi-Dakhel, Amin Nikanjam, Foutse Khomh, Michel C Desmarais, and Giuliano Antoniol. 2025. Bugs in large language models generated code: An empirical study.Empirical Software Engineering30, 3 (2025), 1–48

  60. [60]

    Yicheng Tao, Yao Qin, and Yepang Liu. 2025. Retrieval-augmented code generation: A survey with focus on repository- level approaches.arXiv preprint arXiv:2510.04905(2025). ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 0. Publication date: 2026. 0:40 Kashif et al

  61. [61]

    Peri Tarr, Harold Ossher, William Harrison, and Stanley M Sutton Jr. 1999. N Degrees of Separation: Multi-Dimensional Separation of Concerns. InProceedings of the 21st International Conference on Software Engineering (ICSE). ACM, 107–119

  62. [62]

    Haoye Tian, Weiqi Lu, Tsz On Li, Xunzhu Tang, Shing-Chi Cheung, Jacques Klein, and Tegawendé F Bissyandé. 2023. Is ChatGPT the ultimate programming assistant–how far is it?arXiv preprint arXiv:2304.11938(2023)

  63. [63]

    Adam Tornhill and Markus Borg. 2022. Code red: the business impact of code quality-a quantitative study of 39 proprietary production codebases. InProceedings of the 5th International Conference on Technical Debt (TechDebt). IEEE, 11–20

  64. [64]

    Jonathan Ullrich, Matthias Koch, and Andreas Vogelsang. 2025. From requirements to code: Understanding developer practices in llm-assisted software engineering. InProceeding of the 33rd IEEE International Requirements Engineering Conference (RE). IEEE, 257–266

  65. [65]

    Yuxuan Wan, Tingshuo Liang, Jiakai Xu, Jingyu Xiao, Yintong Huo, and Michael R Lyu. 2025. Automatically Generating Web Applications from Requirements Via Multi-Agent Test-Driven Development.arXiv preprint arXiv:2509.25297 (2025)

  66. [66]

    Thomas Weber, Maximilian Brandmaier, Albrecht Schmidt, and Sven Mayer. 2024. Significant productivity gains through programming with large language models.Proceedings of the ACM on Human-Computer Interaction8, EICS (2024), 1–29

  67. [67]

    Aiko Yamashita and Leon Moonen. 2012. Do code smells reflect important maintainability aspects?. InProceedings of the 28th IEEE International Conference on Software Maintenance (ICSM). IEEE, 306–315

  68. [68]

    Daniel M Yellin. 2023. The premature obituary of programming.Commun. ACM66 (2023), 41–44

  69. [69]

    P. Yu, Y. Wu, J. Peng, J. Zhang, and P. Xie. 2023. Towards Understanding Fixes of SonarQube Static Analysis Violations: A Large-Scale Empirical Study. InProceedings of the 30th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 569–580

  70. [70]

    Xiao Yu, Lei Liu, Xing Hu, Jacky Wai Keung, Jin Liu, and Xin Xia. 2024. Where Are Large Language Models for Code Generation on GitHub?arXiv preprint arXiv:2406.19544(2024). ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 0. Publication date: 2026