pith. sign in

arxiv: 2505.13766 · v5 · submitted 2025-05-19 · 💻 cs.SE · cs.AI· cs.CL

A Blueprint for AI-Driven Software Quality: Integrating LLMs with Established Standards

Pith reviewed 2026-05-22 13:41 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CL
keywords large language modelssoftware quality assuranceISO/IEC standardsCMMIprocess maturityAI-driven qualitycompliance mappingtest generation
0
0 comments X

The pith

Large language models can perform software quality assurance tasks while aligning with standards such as ISO/IEC 12207, ISO 9001, and CMMI.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper surveys the use of large language models to support software quality assurance processes including requirement validation, code review, test generation, and compliance verification. It reviews foundational standards like ISO/IEC 12207, ISO/IEC 25010, ISO/IEC 5055, ISO 9001, CMMI, and TMM, then maps LLM applications onto the specific requirements and metrics each standard defines. A sympathetic reader would care because the integration offers a way to automate repetitive quality work without losing the structured compliance and process maturity that these frameworks enforce. Case studies and open-source examples are presented to show current feasibility, while sections on data privacy, model bias, and explainability discuss the governance needed to keep the benefits intact.

Core claim

The paper establishes that LLM-based applications can address specific requirements and metrics within each standard, allowing AI-driven solutions to augment traditional SQA approaches while maintaining compliance and process maturity. It does this by first covering the standards and LLM fundamentals, then exploring applications such as defect detection and documentation maintenance, and finally mapping those applications directly to the provisions in ISO/IEC 12207, CMMI, and the others. Empirical examples illustrate viability and the text outlines governance steps to handle associated risks.

What carries the argument

The mapping of LLM-based SQA applications such as requirement analysis, defect detection, and test generation onto the requirements and metrics specified by established quality standards.

If this is right

  • Requirement validation and compliance checks performed by LLMs can directly satisfy provisions in ISO/IEC 12207.
  • Automated defect detection and test generation can contribute to the quality metrics defined in ISO/IEC 25010 and TMM.
  • Documentation maintenance with LLMs supports ongoing compliance under ISO 9001 and ISO/IEC 90003.
  • Governance structures for bias and privacy can be layered onto existing maturity models like CMMI without lowering process levels.
  • Future adaptive learning in LLMs could enable standards themselves to evolve toward AI-inclusive quality practices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar mappings could be developed for other AI systems beyond current LLMs to extend the same compliance benefits.
  • Teams in regulated sectors might reduce the effort required to reach higher CMMI maturity levels by adopting these LLM-supported processes.
  • Standards organizations could be prompted to issue updated guidance that explicitly accounts for AI contributions to quality evidence.
  • Pilot implementations in safety-critical domains would provide concrete data on whether the governance proposals actually preserve audit outcomes.

Load-bearing premise

That case studies and governance practices are sufficient to show LLMs can be integrated without the challenges of bias, privacy, or explainability undermining compliance with the standards.

What would settle it

An industry audit or controlled project where LLM use in SQA tasks results in failure to satisfy a specific requirement or metric in CMMI or ISO/IEC 12207 even after applying the paper's recommended governance and auditing steps.

Figures

Figures reproduced from arXiv: 2505.13766 by Avinash Patil.

Figure 2
Figure 2. Figure 2: Distribution of dataset themes used in the surveyed literature. A [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 1
Figure 1. Figure 1: Number of papers published per year from 2023 to 2025, showing [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Frequency of evaluation approaches used in the papers. Comparative [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of prompting techniques employed across papers. Few [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of LLMs reported in the literature. GPT-4, GPT-3.5, and [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Proposed Architecture of LLM-Enhanced Software Quality Assurance (SQA) Framework. The diagram illustrates how LLM-based components [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Software Quality Assurance (SQA) is critical for delivering reliable, secure, and efficient software products. The Software Quality Assurance Process aims to provide assurance that work products and processes comply with predefined provisions and plans. Recent advancements in Large Language Models (LLMs) present new opportunities to enhance existing SQA processes by automating tasks like requirement analysis, code review, test generation, and compliance checks. Simultaneously, established standards such as ISO/IEC 12207, ISO/IEC 25010, ISO/IEC 5055, ISO 9001/ISO/IEC 90003, CMMI, and TMM provide structured frameworks for ensuring robust quality practices. This paper surveys the intersection of LLM-based SQA methods and these recognized standards, highlighting how AI-driven solutions can augment traditional approaches while maintaining compliance and process maturity. We first review the foundational software quality standards and the technical fundamentals of LLMs in software engineering. Next, we explore various LLM-based SQA applications, including requirement validation, defect detection, test generation, and documentation maintenance. We then map these applications to key software quality frameworks, illustrating how LLMs can address specific requirements and metrics within each standard. Empirical case studies and open-source initiatives demonstrate the practical viability of these methods. At the same time, discussions on challenges (e.g., data privacy, model bias, explainability) underscore the need for deliberate governance and auditing. Finally, we propose future directions encompassing adaptive learning, privacy-focused deployments, multimodal analysis, and evolving standards for AI-driven software quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript surveys the intersection of LLM-based software quality assurance (SQA) methods and established standards including ISO/IEC 12207, ISO/IEC 25010, CMMI, and others. It reviews foundational standards and LLM fundamentals in software engineering, explores applications such as requirement validation, defect detection, test generation, and documentation maintenance, maps these to the standards to show how LLMs address specific requirements and metrics, presents empirical case studies and open-source initiatives to demonstrate practical viability, discusses challenges like data privacy, model bias, and explainability with need for governance, and proposes future directions including adaptive learning, privacy-focused deployments, multimodal analysis, and evolving standards.

Significance. This survey could provide a useful blueprint for integrating AI tools into SQA processes while aiming to preserve compliance with recognized standards. By highlighting mappings and case studies, it may help bridge the gap between emerging LLM technologies and traditional quality frameworks, potentially guiding practitioners in adopting these methods responsibly if the evidence for compliance is strengthened.

major comments (2)
  1. [Mapping LLM Applications to Software Quality Frameworks] The central claim that LLM-based SQA applications can address specific requirements and metrics within each standard while maintaining compliance and process maturity rests on the mappings described. However, for frameworks like CMMI, which emphasize documented, repeatable processes and specific practices at each maturity level, the mappings (e.g., linking test generation to verification) are likely high-level without explicit discussion of how LLM outputs are integrated into defined processes, assessed for consistency, or subjected to auditing to avoid introducing variability. This issue is load-bearing for the claim of preserving process maturity.
  2. [Discussions on Challenges] The paper notes that challenges such as data privacy, model bias, and explainability can be managed through deliberate governance and auditing. Yet, no concrete mechanisms are outlined that would ensure these governance approaches satisfy the assurance, documentation, and audit requirements of standards like ISO 9001 or CMMI. Without this, the assertion that LLM integration augments traditional approaches while maintaining compliance lacks sufficient support from the described evidence.
minor comments (2)
  1. The abstract and structure would benefit from a summary table that explicitly lists LLM tasks, corresponding standard clauses or practices, and the proposed augmentation methods for quick reference.
  2. Ensure that all cited case studies are clearly linked back to specific standards and metrics to strengthen the empirical demonstration of viability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which have helped us identify areas where the manuscript's discussion of compliance and process maturity could be strengthened. We address each major comment point by point below, clarifying the scope of our survey while making targeted revisions to improve the rigor of the mappings and governance discussions.

read point-by-point responses
  1. Referee: [Mapping LLM Applications to Software Quality Frameworks] The central claim that LLM-based SQA applications can address specific requirements and metrics within each standard while maintaining compliance and process maturity rests on the mappings described. However, for frameworks like CMMI, which emphasize documented, repeatable processes and specific practices at each maturity level, the mappings (e.g., linking test generation to verification) are likely high-level without explicit discussion of how LLM outputs are integrated into defined processes, assessed for consistency, or subjected to auditing to avoid introducing variability. This issue is load-bearing for the claim of preserving process maturity.

    Authors: We agree that the original mappings were primarily high-level and did not adequately address the integration of LLM outputs into repeatable processes, consistency assessment, or auditing requirements emphasized by CMMI. In the revised manuscript, we have expanded the relevant section to include a dedicated discussion on process integration. This addition outlines how LLM-generated artifacts (such as test cases and defect reports) can be incorporated into CMMI process areas like Verification and Validation through defined workflows that incorporate human oversight checkpoints, automated logging for traceability, and consistency checks against established baselines. We reference existing literature on AI-augmented maturity models to illustrate auditing approaches that mitigate variability. As a survey paper, our contribution synthesizes these approaches rather than introducing new empirical audits, but the revisions provide a more explicit blueprint for maintaining process maturity. revision: yes

  2. Referee: [Discussions on Challenges] The paper notes that challenges such as data privacy, model bias, and explainability can be managed through deliberate governance and auditing. Yet, no concrete mechanisms are outlined that would ensure these governance approaches satisfy the assurance, documentation, and audit requirements of standards like ISO 9001 or CMMI. Without this, the assertion that LLM integration augments traditional approaches while maintaining compliance lacks sufficient support from the described evidence.

    Authors: We acknowledge that the original discussion of challenges was insufficiently concrete regarding mechanisms that align with the documentation and audit requirements of standards such as ISO 9001 and CMMI. The revised manuscript adds a new subsection on governance frameworks that specifies mechanisms including: integration of LLM outputs into ISO 9001 document control via version-controlled repositories with mandatory review logs; use of bias detection and explainability reports that feed into CMMI's Process and Product Quality Assurance area; and alignment with emerging AI governance standards such as ISO/IEC 42001 for management system audits. These additions draw from synthesized best practices in the literature to demonstrate how governance can support compliance. While we cannot supply original empirical case studies of live audits in this survey, the expanded content strengthens the support for the claim by providing actionable outlines rather than general assertions. revision: partial

Circularity Check

0 steps flagged

No circularity: survey of external standards and literature

full rationale

The paper is a survey reviewing established software quality standards (ISO/IEC 12207, CMMI, ISO 25010, etc.) and LLM applications in SQA tasks such as requirement validation and test generation. It maps applications to frameworks and cites external empirical case studies and open-source initiatives for viability. No internal equations, fitted parameters, self-referential predictions, or load-bearing self-citations are present that reduce claims to the paper's own inputs by construction. All central assertions rely on external references rather than self-contained derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper that synthesizes existing literature on LLMs in software engineering and established quality standards. It introduces no new free parameters, axioms, or invented entities; the central claim rests on the described review and mapping process.

pith-pipeline@v0.9.0 · 5799 in / 1173 out tokens · 50641 ms · 2026-05-22T13:41:57.000946+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Iterative Audit Convergence in LLM-Managed Multi-Agent Systems: A Case Study in Prompt Engineering Quality Assurance

    cs.SE 2026-05 conditional novelty 4.0

    Nine LLM-agent audit rounds on a 7150-line prompt specification surface found 51 defects with non-monotonic convergence and a post-hoc seven-category taxonomy, showing single-file review misses defect classes.

Reference graph

Works this paper leans on

262 extracted references · 262 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    The making of cloud applications: An empirical study on software development for the cloud,

    J. Cito, P. Leitner, H. C. Gallet al., “The making of cloud applications: An empirical study on software development for the cloud,”IEEE Software, vol. 35, no. 1, pp. 50–57, 2018

  2. [2]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, 2017, pp. 5998–6008

  3. [3]

    Evaluating Large Language Models Trained on Code

    M. Chen, J. Tworek, H. Jun, Q. Yuan, J. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, and G. e. a. Brockman, “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021

  4. [4]

    Codebert: A pre-trained model for program- ming and natural languages,

    Z. Feng, D. Guoet al., “Codebert: A pre-trained model for program- ming and natural languages,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 1536–1547

  5. [5]

    Large language models for software engineer- ing: Review and reflections,

    R. Poldrack and Others, “Large language models for software engineer- ing: Review and reflections,” arXiv preprint arXiv:2210.12345, 2022

  6. [6]

    Deeptest: Automated testing of deep-neural-network-driven autonomous cars,

    Y . Tian, K. Pei, S. Jana, and B. Ray, “Deeptest: Automated testing of deep-neural-network-driven autonomous cars,” inProceedings of the 40th International Conference on Software Engineering, 2018, pp. 303– 314

  7. [7]

    Next-generation bug reporting: Enhancing de- velopment with ai automation,

    A. Patil and A. Jadon, “Next-generation bug reporting: Enhancing de- velopment with ai automation,” in2025 10th International Conference on Signal Processing and Communication (ICSC). IEEE, 2025, pp. 487–493

  8. [8]

    P. A. Laplante,What Every Engineer Should Know about Software Engineering. CRC Press, 2018

  9. [9]

    Ethical ai development: Mitigating bias in generative models,

    A. Jadon, “Ethical ai development: Mitigating bias in generative models,”Interplay of Artificial General Intelligence with Quantum Computing: Towards Sustainability, pp. 123–136, 2025

  10. [10]

    Iso/iec/ieee 12207:2017 systems and software engineering – software life cycle processes,

    “Iso/iec/ieee 12207:2017 systems and software engineering – software life cycle processes,” https://www.iso.org/standard/63712.html, 2017, accessed: 2025-03-31

  11. [11]

    Iso/iec 25010:2011 systems and software engineering – systems and software quality requirements and evaluation (square) – system and software quality models,

    “Iso/iec 25010:2011 systems and software engineering – systems and software quality requirements and evaluation (square) – system and software quality models,” https://www.iso.org/standard/35733.html, 2011, accessed: 2025-03-31

  12. [12]

    A Data Fusion Platform for Supporting Bridge Deck Condition Monitoring by Merging Aerial and Ground Inspection Imagery

    V . Garousi, K. Petersen, and B. Ozkan, “Industry-academia collabora- tions in software testing: experience and success stories from canada,” arXiv preprint arXiv:1904.04986, 2019

  13. [13]

    Iso/iec 5055:2021 information technology – software measurement – quality measure elements,

    “Iso/iec 5055:2021 information technology – software measurement – quality measure elements,” https://www.iso.org/standard/80649.html, 2021, accessed: 2025-03-31

  14. [14]

    Vulnerabilities, patches, and exploits in the wild: A case study of apache http server and nginx repositories,

    I. Pashchenko, H. Plate, and F. Massacci, “Vulnerabilities, patches, and exploits in the wild: A case study of apache http server and nginx repositories,” arXiv preprint arXiv:2108.01691, 2021

  15. [15]

    Iso 9001:2015 quality management systems – requirements,

    “Iso 9001:2015 quality management systems – requirements,” https://www.iso.org/standard/62085.html, 2015, accessed: 2025-03-31

  16. [16]

    Iso/iec 90003:2018 software engineering – guidelines for the application of iso 9001:2015 to computer software,

    “Iso/iec 90003:2018 software engineering – guidelines for the application of iso 9001:2015 to computer software,” https://www.iso.org/standard/53288.html, 2018, accessed: 2025-03-31

  17. [17]

    Cmmi v2.0,

    C. Institute, “Cmmi v2.0,” https://cmmiinstitute.com/cmmi/v2.0, 2018, accessed: 2025-03-31

  18. [18]

    Exploring software process improvement in agile teams through the lens of cmmi,

    B. Dingsør, N. B. Moe, and A. Øyvang, “Exploring software process improvement in agile teams through the lens of cmmi,”Journal of Software: Evolution and Process, vol. 31, no. 6, p. e2160, 2019

  19. [19]

    Burnstein,Practical Software Testing: A Process-Oriented Approach

    I. Burnstein,Practical Software Testing: A Process-Oriented Approach. Springer, 2003

  20. [20]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2019

  21. [21]

    Language Models are Few-Shot Learners

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, 2020

  22. [22]

    CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

    R. Puri, D. Kung, G. Janssenet al., “Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and gen- eration,” arXiv preprint arXiv:2109.00859, 2021

  23. [23]

    Systematic evaluation of large language models of code,

    F. F. Xu and Others, “Systematic evaluation of large language models of code,” arXiv preprint arXiv:2202.13169, 2022

  24. [24]

    Expectations vs. experience: Evaluating the usability of code generation tools powered by large language models,

    P. Vaithilingam and Others, “Expectations vs. experience: Evaluating the usability of code generation tools powered by large language models,” inProceedings of the 2022 CHI Conference on Human Factors in Computing Systems, 2022, pp. 1–14

  25. [25]

    Requirements engineering for ai: Opportunities and challenges,

    F. Dalpiazet al., “Requirements engineering for ai: Opportunities and challenges,”Requirements Engineering, vol. 24, no. 3, pp. 403–415, 2019

  26. [26]

    On the use of automated documentation generation in open-source projects: A preliminary study,

    L. Moreno and Others, “On the use of automated documentation generation in open-source projects: A preliminary study,”Empirical Software Engineering, vol. 25, no. 3, pp. 1880–1908, 2020

  27. [27]

    Using llms in software requirements specifications: an empirical evaluation,

    M. Krishna, B. Gaur, A. Verma, and P. Jalote, “Using llms in software requirements specifications: an empirical evaluation,” in2024 IEEE 32nd International Requirements Engineering Conference (RE). IEEE, 2024, pp. 475–483

  28. [28]

    Requirements are all you need: From requirements to code with llms,

    B. Wei, “Requirements are all you need: From requirements to code with llms,” in2024 IEEE 32nd International Requirements Engineering Conference (RE). IEEE, 2024, pp. 416–422

  29. [29]

    Advancing requirements engineering through generative ai: Assessing the role of llms,

    C. Arora, J. Grundy, and M. Abdelrazek, “Advancing requirements engineering through generative ai: Assessing the role of llms,” in Generative AI for Effective Software Development. Springer, 2024, pp. 129–148

  30. [30]

    Generating specifications from requirements documents for smart devices using large language models (llms),

    R. Lutze and K. Waldh ¨or, “Generating specifications from requirements documents for smart devices using large language models (llms),” in International Conference on Human-Computer Interaction. Springer, 2024, pp. 94–108

  31. [31]

    Leveraging llms for the quality assurance of software requirements,

    S. Lubos, A. Felfernig, T. N. T. Tran, D. Garber, M. El Mansi, S. P. Erdeniz, and V .-M. Le, “Leveraging llms for the quality assurance of software requirements,” in2024 IEEE 32nd International Requirements Engineering Conference (RE). IEEE, 2024, pp. 389–397

  32. [32]

    Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design,

    J. White, S. Hays, Q. Fu, J. Spencer-Smith, and D. C. Schmidt, “Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design,” inGenerative ai for effective software development. Springer, 2024, pp. 71–108

  33. [33]

    Requirements verification through the analysis of source code by large language models,

    J. O. Couder, D. Gomez, and O. Ochoa, “Requirements verification through the analysis of source code by large language models,” in SoutheastCon 2024. IEEE, 2024, pp. 75–80

  34. [34]

    Enhancing requirements engineering with large language models: From elicitation and classification to traceability, ambiguity management and api recommendation,

    V . Ocleppo, “Enhancing requirements engineering with large language models: From elicitation and classification to traceability, ambiguity management and api recommendation,” Ph.D. dissertation, Politecnico di Torino, 2025

  35. [35]

    Requirements are all you need: The final frontier for end-user software engineering,

    D. Robinson, C. Cabrera, A. D. Gordon, N. D. Lawrence, and L. Men- nen, “Requirements are all you need: The final frontier for end-user software engineering,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 5, pp. 1–22, 2025

  36. [36]

    Re- cover: Toward requirements generation from stakeholders’ conversa- tions,

    G. V oria, F. Casillo, C. Gravino, G. Catolino, and F. Palomba, “Re- cover: Toward requirements generation from stakeholders’ conversa- tions,”IEEE Transactions on Software Engineering, 2025

  37. [37]

    Cross-level requirements tracing based on large language models,

    C. Ge, T. Wang, X. Yang, and C. Treude, “Cross-level requirements tracing based on large language models,”IEEE Transactions on Soft- ware Engineering, 2025

  38. [38]

    Collaboration with generative ai to improve requirements change,

    Y . Kong, N. Zhang, Z. Duan, and B. Yu, “Collaboration with generative ai to improve requirements change,”Computer Standards & Interfaces, p. 104013, 2025

  39. [39]

    Using large language models for natural language processing tasks in requirements engineering: A systematic guideline,

    A. V ogelsang and J. Fischbach, “Using large language models for natural language processing tasks in requirements engineering: A systematic guideline,” inHandbook on Natural Language Processing for Requirements Engineering. Springer, 2025, pp. 435–456

  40. [40]

    Automated classification and iden- tification of non-functional requirements in agile-based requirements using pre-trained language models,

    A. Alhaizaey and M. Al-Mashari, “Automated classification and iden- tification of non-functional requirements in agile-based requirements using pre-trained language models,”IEEE Access, 2025

  41. [41]

    Natu- ral language processing for requirements traceability,

    J. L. Guo, J.-P. Stegh ¨ofer, A. V ogelsang, and J. Cleland-Huang, “Natu- ral language processing for requirements traceability,” inHandbook on Natural Language Processing for Requirements Engineering. Springer, 2025, pp. 89–116

  42. [42]

    Mantra: Enhancing automated method-level refactoring with contextual rag and multi-agent llm collaboration,

    Y . Xu, F. Lin, J. Yang, N. Tsantaliset al., “Mantra: Enhancing automated method-level refactoring with contextual rag and multi-agent llm collaboration,”arXiv preprint arXiv:2503.14340, 2025

  43. [43]

    Large language models (llms) for source code analysis: applications, models and datasets,

    H. Jelodar, M. Meymani, and R. Razavi-Far, “Large language models (llms) for source code analysis: applications, models and datasets,” arXiv preprint arXiv:2503.17502, 2025

  44. [44]

    An empirical study on the code refactoring capability of large language models,

    J. Cordeiro, S. Noei, and Y . Zou, “An empirical study on the code refactoring capability of large language models,”arXiv preprint arXiv:2411.02320, 2024

  45. [45]

    Leveraging llms to automate software architecture design from informal specifications,

    A. Tagliaferro, S. Corboe, and B. Guindani, “Leveraging llms to automate software architecture design from informal specifications,” in2025 IEEE 22nd International Conference on Software Architecture Companion (ICSA-C). IEEE, 2025, pp. 291–299

  46. [46]

    Design pattern recognition: a study of large language models,

    S. K. Pandey, S. Chand, J. Horkoff, M. Staron, M. Ochodek, and D. Durisic, “Design pattern recognition: a study of large language models,”Empirical Software Engineering, vol. 30, no. 3, p. 69, 2025

  47. [47]

    Large language models for constructing and optimizing machine learning workflows: A survey,

    Y . Gu, H. You, J. Cao, M. Yu, H. Fan, and S. Qian, “Large language models for constructing and optimizing machine learning workflows: A survey,”ACM Transactions on Software Engineering and Methodology, 2025

  48. [48]

    Knowledge-based multi-agent framework for automated software architecture design,

    Y . Zhang, R. Li, P. Liang, W. Sun, and Y . Liu, “Knowledge-based multi-agent framework for automated software architecture design,” inProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, 2025, pp. 530–534

  49. [49]

    Assessing llms for front-end software architecture knowledge,

    L. P. F. Guerra and N. Ernst, “Assessing llms for front-end software architecture knowledge,” in2025 IEEE/ACM International Workshop on Designing Software (Designing). IEEE, 2025, pp. 6–10

  50. [50]

    Gen- erative ai meets cad: enhancing engineering design to manufacturing processes with large language models,

    A. Daareyni, A. Martikkala, H. Mokhtarian, and I. F. Ituarte, “Gen- erative ai meets cad: enhancing engineering design to manufacturing processes with large language models,”The International Journal of Advanced Manufacturing Technology, pp. 1–10, 2025

  51. [51]

    L. H. Cheung and G. Di Marco, “Composing conversational architec- ture by integrating large language model: From reactive to suggestive architecture through exploring the mathematical nature of the trans- former model,”Nexus Network Journal, vol. 27, no. 1, pp. 203–220, 2025

  52. [52]

    Sa-ds: A dataset for large language model-driven ai accelerator design generation,

    D. Vungarala, M. Nazzal, M. Morsali, C. Zhang, A. Ghosh, A. Khreishah, and S. Angizi, “Sa-ds: A dataset for large language model-driven ai accelerator design generation,” in2025 IEEE Interna- tional Symposium on Circuits and Systems (ISCAS). IEEE, 2025, pp. 1–4

  53. [53]

    Llm-based test-driven interactive code generation: User study and empirical evaluation,

    S. Fakhoury, A. Naik, G. Sakkas, S. Chakraborty, and S. K. Lahiri, “Llm-based test-driven interactive code generation: User study and empirical evaluation,”IEEE Transactions on Software Engineering, 2024

  54. [54]

    Ai-powered code review with llms: Early results,

    Z. Rasheed, M. A. Sami, M. Waseem, K.-K. Kemell, X. Wang, A. Nguyen, K. Syst ¨a, and P. Abrahamsson, “Ai-powered code review with llms: Early results,”arXiv preprint arXiv:2404.18496, 2024

  55. [55]

    Em-assist: Safe automated extractmethod refactoring with llms,

    D. Pomian, A. Bellur, M. Dilhara, Z. Kurbatova, E. Bogomolov, A. Sokolov, T. Bryksin, and D. Dig, “Em-assist: Safe automated extractmethod refactoring with llms,” inCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, ser. FSE 2024. New York, NY , USA: Association for Computing Machinery, 2024, p. 582–...

  56. [56]

    Together we go further: Llms and ide static analysis for extract method refactoring,

    D. Pomian, A. Bellur, M. Dilhara, Z. Kurbatova, E. Bogomolov, T. Bryksin, and D. Dig, “Together we go further: Llms and ide static analysis for extract method refactoring,” 2024. [Online]. Available: https://arxiv.org/abs/2401.15298

  57. [57]

    ismell: Assembling llms with expert toolsets for code smell detection and refactoring,

    D. Wu, F. Mu, L. Shi, Z. Guo, K. Liu, W. Zhuang, Y . Zhong, and L. Zhang, “ismell: Assembling llms with expert toolsets for code smell detection and refactoring,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 1345–1357. [Online]...

  58. [58]

    C2hlsc: Leveraging large language models to bridge the software-to-hardware design gap,

    L. Collini, S. Garg, and R. Karri, “C2hlsc: Leveraging large language models to bridge the software-to-hardware design gap,”ACM Transac- tions on Design Automation of Electronic Systems, vol. 30, no. 6, pp. 1–24, 2025

  59. [59]

    Template-guided program repair in the era of large language models

    K. Huang, J. Zhang, X. Meng, and Y . Liu, “Template-guided program repair in the era of large language models.” inICSE, 2025, pp. 1895– 1907

  60. [60]

    Opencoder: The open cookbook for top- tier code large language models,

    S. Huang, T. Cheng, J. K. Liu, W. Xu, J. Hao, L. Song, Y . Xu, J. Yang, J. Liu, C. Zhanget al., “Opencoder: The open cookbook for top- tier code large language models,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 33 167–33 193

  61. [61]

    On the effectiveness of large language models in domain- specific code generation,

    X. Gu, M. Chen, Y . Lin, Y . Hu, H. Zhang, C. Wan, Z. Wei, Y . Xu, and J. Wang, “On the effectiveness of large language models in domain- specific code generation,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 3, pp. 1–22, 2025

  62. [62]

    Soleval: Benchmarking large language models for repository- level solidity code generation,

    Z. Peng, X. Yin, R. Qian, P. Lin, Y . Liu, H. Zhang, C. Ying, and Y . Luo, “Soleval: Benchmarking large language models for repository- level solidity code generation,”arXiv preprint arXiv:2502.18793, 2025

  63. [63]

    Fixing large language models’ specification misunderstanding for better code generation,

    Z. Tian, J. Chen, and X. Zhang, “Fixing large language models’ specification misunderstanding for better code generation,” in2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 2025, pp. 645–645

  64. [64]

    Exploring parameter-efficient fine-tuning techniques for code generation with large language models,

    M. Weyssow, X. Zhou, K. Kim, D. Lo, and H. Sahraoui, “Exploring parameter-efficient fine-tuning techniques for code generation with large language models,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 7, pp. 1–25, 2025

  65. [65]

    Scalable, validated code translation of entire projects using large language models,

    H. Zhang, C. David, M. Wang, B. Paulsen, and D. Kroening, “Scalable, validated code translation of entire projects using large language models,”Proceedings of the ACM on Programming Languages, vol. 9, no. PLDI, pp. 1616–1641, 2025

  66. [66]

    En- hancing large language models for text-to-testcase generation,

    S. Alagarsamy, C. Tantithamthavorn, C. Arora, and A. Aleti, “En- hancing large language models for text-to-testcase generation,”arXiv preprint arXiv:2402.11910, 2024

  67. [67]

    Generating test scenarios from nl requirements using retrieval-augmented llms: An industrial study,

    C. Arora, T. Herda, and V . Homm, “Generating test scenarios from nl requirements using retrieval-augmented llms: An industrial study,” in 2024 IEEE 32nd International Requirements Engineering Conference (RE). IEEE, 2024, pp. 240–251

  68. [68]

    Evaluating large language models for software testing,

    Y . Li, P. Liu, H. Wang, J. Chu, and W. E. Wong, “Evaluating large language models for software testing,”Computer Standards & Interfaces, vol. 93, p. 103942, 2025

  69. [69]

    A tool for test case scenarios generation using large language models,

    A. M. Sami, Z. Rasheed, M. Waseem, Z. Zhang, H. Tomas, and P. Abrahamsson, “A tool for test case scenarios generation using large language models,”arXiv preprint arXiv:2406.07021, 2024

  70. [70]

    An initial investigation of chatgpt unit test generation capability,

    V . Guilherme and A. Vincenzi, “An initial investigation of chatgpt unit test generation capability,” inProceedings of the 8th Brazilian Symposium on Systematic and Automated Software Testing, 2023, pp. 15–24

  71. [71]

    System test case design from requirements specifications: Insights and challenges of using chatgpt,

    S. Bhatia, T. Gandhi, D. Kumar, and P. Jalote, “System test case design from requirements specifications: Insights and challenges of using chatgpt,”arXiv preprint arXiv:2412.03693, 2024

  72. [72]

    Multi-language unit test generation using llms,

    R. Pan, M. Kim, R. Krishna, R. Pavuluri, and S. Sinha, “Multi-language unit test generation using llms,”arXiv preprint arXiv:2409.03093, 2024

  73. [73]

    Mutation-guided llm-based test gener- ation at meta,

    C. Foster, A. Gulati, M. Harman, I. Harper, K. Mao, J. Ritchey, H. Robert, and S. Sengupta, “Mutation-guided llm-based test gener- ation at meta,”arXiv preprint arXiv:2501.12862, 2025

  74. [74]

    Automated program refinement: Guide and verify code large language model with refinement calculus,

    Y . Cai, Z. Hou, D. San ´an, X. Luan, Y . Lin, J. Sun, and J. S. Dong, “Automated program refinement: Guide and verify code large language model with refinement calculus,”Proceedings of the ACM on Programming Languages, vol. 9, no. POPL, pp. 2057–2089, 2025

  75. [75]

    Exploring automated assertion generation via large language models,

    Q. Zhang, W. Sun, C. Fang, B. Yu, H. Li, M. Yan, J. Zhou, and Z. Chen, “Exploring automated assertion generation via large language models,” ACM Transactions on Software Engineering and Methodology, vol. 34, no. 3, pp. 1–25, 2025

  76. [76]

    Automating autograding: Large language models as test suite generators for introductory pro- gramming,

    U. Alkafaween, I. Albluwi, and P. Denny, “Automating autograding: Large language models as test suite generators for introductory pro- gramming,”Journal of Computer Assisted Learning, vol. 41, no. 1, p. e13100, 2025

  77. [77]

    Classinvgen: Class invariant synthesis using large language models,

    C. Sun, V . Agashe, S. Chakraborty, J. Taneja, C. Barrett, D. Dill, X. Qiu, and S. K. Lahiri, “Classinvgen: Class invariant synthesis using large language models,” inInternational Symposium on AI Verification. Springer, 2025, pp. 64–96

  78. [78]

    A large- scale empirical study on fine-tuning large language models for unit testing,

    Y . Shang, Q. Zhang, C. Fang, S. Gu, J. Zhou, and Z. Chen, “A large- scale empirical study on fine-tuning large language models for unit testing,”Proceedings of the ACM on Software Engineering, vol. 2, no. ISSTA, pp. 1678–1700, 2025

  79. [79]

    A system for automated unit test generation using large language models and assessment of generated test suites,

    A. Lops, F. Narducci, A. Ragone, M. Trizio, and C. Bartolini, “A system for automated unit test generation using large language models and assessment of generated test suites,” in2025 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 2025, pp. 29–36

  80. [80]

    Testeval: Benchmarking large language models for test case generation,

    W. Wang, C. Yang, Z. Wang, Y . Huang, Z. Chu, D. Song, L. Zhang, A. R. Chen, and L. Ma, “Testeval: Benchmarking large language models for test case generation,” inFindings of the Association for Computational Linguistics: NAACL 2025, 2025, pp. 3547–3562

Showing first 80 references.