A Blueprint for AI-Driven Software Quality: Integrating LLMs with Established Standards

Avinash Patil

arxiv: 2505.13766 · v5 · submitted 2025-05-19 · 💻 cs.SE · cs.AI· cs.CL

A Blueprint for AI-Driven Software Quality: Integrating LLMs with Established Standards

Avinash Patil This is my paper

Pith reviewed 2026-05-22 13:41 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CL

keywords large language modelssoftware quality assuranceISO/IEC standardsCMMIprocess maturityAI-driven qualitycompliance mappingtest generation

0 comments

The pith

Large language models can perform software quality assurance tasks while aligning with standards such as ISO/IEC 12207, ISO 9001, and CMMI.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper surveys the use of large language models to support software quality assurance processes including requirement validation, code review, test generation, and compliance verification. It reviews foundational standards like ISO/IEC 12207, ISO/IEC 25010, ISO/IEC 5055, ISO 9001, CMMI, and TMM, then maps LLM applications onto the specific requirements and metrics each standard defines. A sympathetic reader would care because the integration offers a way to automate repetitive quality work without losing the structured compliance and process maturity that these frameworks enforce. Case studies and open-source examples are presented to show current feasibility, while sections on data privacy, model bias, and explainability discuss the governance needed to keep the benefits intact.

Core claim

The paper establishes that LLM-based applications can address specific requirements and metrics within each standard, allowing AI-driven solutions to augment traditional SQA approaches while maintaining compliance and process maturity. It does this by first covering the standards and LLM fundamentals, then exploring applications such as defect detection and documentation maintenance, and finally mapping those applications directly to the provisions in ISO/IEC 12207, CMMI, and the others. Empirical examples illustrate viability and the text outlines governance steps to handle associated risks.

What carries the argument

The mapping of LLM-based SQA applications such as requirement analysis, defect detection, and test generation onto the requirements and metrics specified by established quality standards.

If this is right

Requirement validation and compliance checks performed by LLMs can directly satisfy provisions in ISO/IEC 12207.
Automated defect detection and test generation can contribute to the quality metrics defined in ISO/IEC 25010 and TMM.
Documentation maintenance with LLMs supports ongoing compliance under ISO 9001 and ISO/IEC 90003.
Governance structures for bias and privacy can be layered onto existing maturity models like CMMI without lowering process levels.
Future adaptive learning in LLMs could enable standards themselves to evolve toward AI-inclusive quality practices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar mappings could be developed for other AI systems beyond current LLMs to extend the same compliance benefits.
Teams in regulated sectors might reduce the effort required to reach higher CMMI maturity levels by adopting these LLM-supported processes.
Standards organizations could be prompted to issue updated guidance that explicitly accounts for AI contributions to quality evidence.
Pilot implementations in safety-critical domains would provide concrete data on whether the governance proposals actually preserve audit outcomes.

Load-bearing premise

That case studies and governance practices are sufficient to show LLMs can be integrated without the challenges of bias, privacy, or explainability undermining compliance with the standards.

What would settle it

An industry audit or controlled project where LLM use in SQA tasks results in failure to satisfy a specific requirement or metric in CMMI or ISO/IEC 12207 even after applying the paper's recommended governance and auditing steps.

Figures

Figures reproduced from arXiv: 2505.13766 by Avinash Patil.

**Figure 1.** Figure 1: Number of papers published per year from 2023 to 2025, showing [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 3.** Figure 3: Frequency of evaluation approaches used in the papers. Comparative [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 6.** Figure 6: Distribution of prompting techniques employed across papers. Few [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 5.** Figure 5: Distribution of LLMs reported in the literature. GPT-4, GPT-3.5, and [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 7.** Figure 7: Proposed Architecture of LLM-Enhanced Software Quality Assurance (SQA) Framework. The diagram illustrates how LLM-based components [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Software Quality Assurance (SQA) is critical for delivering reliable, secure, and efficient software products. The Software Quality Assurance Process aims to provide assurance that work products and processes comply with predefined provisions and plans. Recent advancements in Large Language Models (LLMs) present new opportunities to enhance existing SQA processes by automating tasks like requirement analysis, code review, test generation, and compliance checks. Simultaneously, established standards such as ISO/IEC 12207, ISO/IEC 25010, ISO/IEC 5055, ISO 9001/ISO/IEC 90003, CMMI, and TMM provide structured frameworks for ensuring robust quality practices. This paper surveys the intersection of LLM-based SQA methods and these recognized standards, highlighting how AI-driven solutions can augment traditional approaches while maintaining compliance and process maturity. We first review the foundational software quality standards and the technical fundamentals of LLMs in software engineering. Next, we explore various LLM-based SQA applications, including requirement validation, defect detection, test generation, and documentation maintenance. We then map these applications to key software quality frameworks, illustrating how LLMs can address specific requirements and metrics within each standard. Empirical case studies and open-source initiatives demonstrate the practical viability of these methods. At the same time, discussions on challenges (e.g., data privacy, model bias, explainability) underscore the need for deliberate governance and auditing. Finally, we propose future directions encompassing adaptive learning, privacy-focused deployments, multimodal analysis, and evolving standards for AI-driven software quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a survey that maps LLM tasks like test generation to standards such as CMMI and ISO 25010, but the mappings stay high-level and do not show how outputs would fit into auditable, repeatable processes.

read the letter

The paper surveys LLM uses in software quality assurance and lines them up against established standards including ISO/IEC 12207, CMMI, ISO 25010, and TMM. It covers tasks such as requirement validation, defect detection, test generation, and documentation, then sketches how each might address metrics or practices in those frameworks. Empirical case studies and open-source work are cited to show viability, and challenges like bias and explainability are flagged with a call for governance.

Referee Report

2 major / 2 minor

Summary. The manuscript surveys the intersection of LLM-based software quality assurance (SQA) methods and established standards including ISO/IEC 12207, ISO/IEC 25010, CMMI, and others. It reviews foundational standards and LLM fundamentals in software engineering, explores applications such as requirement validation, defect detection, test generation, and documentation maintenance, maps these to the standards to show how LLMs address specific requirements and metrics, presents empirical case studies and open-source initiatives to demonstrate practical viability, discusses challenges like data privacy, model bias, and explainability with need for governance, and proposes future directions including adaptive learning, privacy-focused deployments, multimodal analysis, and evolving standards.

Significance. This survey could provide a useful blueprint for integrating AI tools into SQA processes while aiming to preserve compliance with recognized standards. By highlighting mappings and case studies, it may help bridge the gap between emerging LLM technologies and traditional quality frameworks, potentially guiding practitioners in adopting these methods responsibly if the evidence for compliance is strengthened.

major comments (2)

[Mapping LLM Applications to Software Quality Frameworks] The central claim that LLM-based SQA applications can address specific requirements and metrics within each standard while maintaining compliance and process maturity rests on the mappings described. However, for frameworks like CMMI, which emphasize documented, repeatable processes and specific practices at each maturity level, the mappings (e.g., linking test generation to verification) are likely high-level without explicit discussion of how LLM outputs are integrated into defined processes, assessed for consistency, or subjected to auditing to avoid introducing variability. This issue is load-bearing for the claim of preserving process maturity.
[Discussions on Challenges] The paper notes that challenges such as data privacy, model bias, and explainability can be managed through deliberate governance and auditing. Yet, no concrete mechanisms are outlined that would ensure these governance approaches satisfy the assurance, documentation, and audit requirements of standards like ISO 9001 or CMMI. Without this, the assertion that LLM integration augments traditional approaches while maintaining compliance lacks sufficient support from the described evidence.

minor comments (2)

The abstract and structure would benefit from a summary table that explicitly lists LLM tasks, corresponding standard clauses or practices, and the proposed augmentation methods for quick reference.
Ensure that all cited case studies are clearly linked back to specific standards and metrics to strengthen the empirical demonstration of viability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which have helped us identify areas where the manuscript's discussion of compliance and process maturity could be strengthened. We address each major comment point by point below, clarifying the scope of our survey while making targeted revisions to improve the rigor of the mappings and governance discussions.

read point-by-point responses

Referee: [Mapping LLM Applications to Software Quality Frameworks] The central claim that LLM-based SQA applications can address specific requirements and metrics within each standard while maintaining compliance and process maturity rests on the mappings described. However, for frameworks like CMMI, which emphasize documented, repeatable processes and specific practices at each maturity level, the mappings (e.g., linking test generation to verification) are likely high-level without explicit discussion of how LLM outputs are integrated into defined processes, assessed for consistency, or subjected to auditing to avoid introducing variability. This issue is load-bearing for the claim of preserving process maturity.

Authors: We agree that the original mappings were primarily high-level and did not adequately address the integration of LLM outputs into repeatable processes, consistency assessment, or auditing requirements emphasized by CMMI. In the revised manuscript, we have expanded the relevant section to include a dedicated discussion on process integration. This addition outlines how LLM-generated artifacts (such as test cases and defect reports) can be incorporated into CMMI process areas like Verification and Validation through defined workflows that incorporate human oversight checkpoints, automated logging for traceability, and consistency checks against established baselines. We reference existing literature on AI-augmented maturity models to illustrate auditing approaches that mitigate variability. As a survey paper, our contribution synthesizes these approaches rather than introducing new empirical audits, but the revisions provide a more explicit blueprint for maintaining process maturity. revision: yes
Referee: [Discussions on Challenges] The paper notes that challenges such as data privacy, model bias, and explainability can be managed through deliberate governance and auditing. Yet, no concrete mechanisms are outlined that would ensure these governance approaches satisfy the assurance, documentation, and audit requirements of standards like ISO 9001 or CMMI. Without this, the assertion that LLM integration augments traditional approaches while maintaining compliance lacks sufficient support from the described evidence.

Authors: We acknowledge that the original discussion of challenges was insufficiently concrete regarding mechanisms that align with the documentation and audit requirements of standards such as ISO 9001 and CMMI. The revised manuscript adds a new subsection on governance frameworks that specifies mechanisms including: integration of LLM outputs into ISO 9001 document control via version-controlled repositories with mandatory review logs; use of bias detection and explainability reports that feed into CMMI's Process and Product Quality Assurance area; and alignment with emerging AI governance standards such as ISO/IEC 42001 for management system audits. These additions draw from synthesized best practices in the literature to demonstrate how governance can support compliance. While we cannot supply original empirical case studies of live audits in this survey, the expanded content strengthens the support for the claim by providing actionable outlines rather than general assertions. revision: partial

Circularity Check

0 steps flagged

No circularity: survey of external standards and literature

full rationale

The paper is a survey reviewing established software quality standards (ISO/IEC 12207, CMMI, ISO 25010, etc.) and LLM applications in SQA tasks such as requirement validation and test generation. It maps applications to frameworks and cites external empirical case studies and open-source initiatives for viability. No internal equations, fitted parameters, self-referential predictions, or load-bearing self-citations are present that reduce claims to the paper's own inputs by construction. All central assertions rely on external references rather than self-contained derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper that synthesizes existing literature on LLMs in software engineering and established quality standards. It introduces no new free parameters, axioms, or invented entities; the central claim rests on the described review and mapping process.

pith-pipeline@v0.9.0 · 5799 in / 1173 out tokens · 50641 ms · 2026-05-22T13:41:57.000946+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We then map these applications to key software quality frameworks, illustrating how LLMs can address specific requirements and metrics within each standard.
IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Empirical case studies and open-source initiatives demonstrate the practical viability of these methods.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Iterative Audit Convergence in LLM-Managed Multi-Agent Systems: A Case Study in Prompt Engineering Quality Assurance
cs.SE 2026-05 conditional novelty 4.0

Nine LLM-agent audit rounds on a 7150-line prompt specification surface found 51 defects with non-monotonic convergence and a post-hoc seven-category taxonomy, showing single-file review misses defect classes.

Reference graph

Works this paper leans on

262 extracted references · 262 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

The making of cloud applications: An empirical study on software development for the cloud,

J. Cito, P. Leitner, H. C. Gallet al., “The making of cloud applications: An empirical study on software development for the cloud,”IEEE Software, vol. 35, no. 1, pp. 50–57, 2018

work page 2018
[2]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, 2017, pp. 5998–6008

work page 2017
[3]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, J. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, and G. e. a. Brockman, “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Codebert: A pre-trained model for program- ming and natural languages,

Z. Feng, D. Guoet al., “Codebert: A pre-trained model for program- ming and natural languages,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 1536–1547

work page 2020
[5]

Large language models for software engineer- ing: Review and reflections,

R. Poldrack and Others, “Large language models for software engineer- ing: Review and reflections,” arXiv preprint arXiv:2210.12345, 2022

work page arXiv 2022
[6]

Deeptest: Automated testing of deep-neural-network-driven autonomous cars,

Y . Tian, K. Pei, S. Jana, and B. Ray, “Deeptest: Automated testing of deep-neural-network-driven autonomous cars,” inProceedings of the 40th International Conference on Software Engineering, 2018, pp. 303– 314

work page 2018
[7]

Next-generation bug reporting: Enhancing de- velopment with ai automation,

A. Patil and A. Jadon, “Next-generation bug reporting: Enhancing de- velopment with ai automation,” in2025 10th International Conference on Signal Processing and Communication (ICSC). IEEE, 2025, pp. 487–493

work page 2025
[8]

P. A. Laplante,What Every Engineer Should Know about Software Engineering. CRC Press, 2018

work page 2018
[9]

Ethical ai development: Mitigating bias in generative models,

A. Jadon, “Ethical ai development: Mitigating bias in generative models,”Interplay of Artificial General Intelligence with Quantum Computing: Towards Sustainability, pp. 123–136, 2025

work page 2025
[10]

Iso/iec/ieee 12207:2017 systems and software engineering – software life cycle processes,

“Iso/iec/ieee 12207:2017 systems and software engineering – software life cycle processes,” https://www.iso.org/standard/63712.html, 2017, accessed: 2025-03-31

work page 2017
[11]

Iso/iec 25010:2011 systems and software engineering – systems and software quality requirements and evaluation (square) – system and software quality models,

“Iso/iec 25010:2011 systems and software engineering – systems and software quality requirements and evaluation (square) – system and software quality models,” https://www.iso.org/standard/35733.html, 2011, accessed: 2025-03-31

work page 2011
[12]

A Data Fusion Platform for Supporting Bridge Deck Condition Monitoring by Merging Aerial and Ground Inspection Imagery

V . Garousi, K. Petersen, and B. Ozkan, “Industry-academia collabora- tions in software testing: experience and success stories from canada,” arXiv preprint arXiv:1904.04986, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[13]

Iso/iec 5055:2021 information technology – software measurement – quality measure elements,

“Iso/iec 5055:2021 information technology – software measurement – quality measure elements,” https://www.iso.org/standard/80649.html, 2021, accessed: 2025-03-31

work page 2021
[14]

Vulnerabilities, patches, and exploits in the wild: A case study of apache http server and nginx repositories,

I. Pashchenko, H. Plate, and F. Massacci, “Vulnerabilities, patches, and exploits in the wild: A case study of apache http server and nginx repositories,” arXiv preprint arXiv:2108.01691, 2021

work page arXiv 2021
[15]

Iso 9001:2015 quality management systems – requirements,

“Iso 9001:2015 quality management systems – requirements,” https://www.iso.org/standard/62085.html, 2015, accessed: 2025-03-31

work page 2015
[16]

Iso/iec 90003:2018 software engineering – guidelines for the application of iso 9001:2015 to computer software,

“Iso/iec 90003:2018 software engineering – guidelines for the application of iso 9001:2015 to computer software,” https://www.iso.org/standard/53288.html, 2018, accessed: 2025-03-31

work page 2018
[17]

Cmmi v2.0,

C. Institute, “Cmmi v2.0,” https://cmmiinstitute.com/cmmi/v2.0, 2018, accessed: 2025-03-31

work page 2018
[18]

Exploring software process improvement in agile teams through the lens of cmmi,

B. Dingsør, N. B. Moe, and A. Øyvang, “Exploring software process improvement in agile teams through the lens of cmmi,”Journal of Software: Evolution and Process, vol. 31, no. 6, p. e2160, 2019

work page 2019
[19]

Burnstein,Practical Software Testing: A Process-Oriented Approach

I. Burnstein,Practical Software Testing: A Process-Oriented Approach. Springer, 2003

work page 2003
[20]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019
[21]

Language Models are Few-Shot Learners

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[22]

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

R. Puri, D. Kung, G. Janssenet al., “Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and gen- eration,” arXiv preprint arXiv:2109.00859, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

Systematic evaluation of large language models of code,

F. F. Xu and Others, “Systematic evaluation of large language models of code,” arXiv preprint arXiv:2202.13169, 2022

work page arXiv 2022
[24]

Expectations vs. experience: Evaluating the usability of code generation tools powered by large language models,

P. Vaithilingam and Others, “Expectations vs. experience: Evaluating the usability of code generation tools powered by large language models,” inProceedings of the 2022 CHI Conference on Human Factors in Computing Systems, 2022, pp. 1–14

work page 2022
[25]

Requirements engineering for ai: Opportunities and challenges,

F. Dalpiazet al., “Requirements engineering for ai: Opportunities and challenges,”Requirements Engineering, vol. 24, no. 3, pp. 403–415, 2019

work page 2019
[26]

On the use of automated documentation generation in open-source projects: A preliminary study,

L. Moreno and Others, “On the use of automated documentation generation in open-source projects: A preliminary study,”Empirical Software Engineering, vol. 25, no. 3, pp. 1880–1908, 2020

work page 1908
[27]

Using llms in software requirements specifications: an empirical evaluation,

M. Krishna, B. Gaur, A. Verma, and P. Jalote, “Using llms in software requirements specifications: an empirical evaluation,” in2024 IEEE 32nd International Requirements Engineering Conference (RE). IEEE, 2024, pp. 475–483

work page 2024
[28]

Requirements are all you need: From requirements to code with llms,

B. Wei, “Requirements are all you need: From requirements to code with llms,” in2024 IEEE 32nd International Requirements Engineering Conference (RE). IEEE, 2024, pp. 416–422

work page 2024
[29]

Advancing requirements engineering through generative ai: Assessing the role of llms,

C. Arora, J. Grundy, and M. Abdelrazek, “Advancing requirements engineering through generative ai: Assessing the role of llms,” in Generative AI for Effective Software Development. Springer, 2024, pp. 129–148

work page 2024
[30]

Generating specifications from requirements documents for smart devices using large language models (llms),

R. Lutze and K. Waldh ¨or, “Generating specifications from requirements documents for smart devices using large language models (llms),” in International Conference on Human-Computer Interaction. Springer, 2024, pp. 94–108

work page 2024
[31]

Leveraging llms for the quality assurance of software requirements,

S. Lubos, A. Felfernig, T. N. T. Tran, D. Garber, M. El Mansi, S. P. Erdeniz, and V .-M. Le, “Leveraging llms for the quality assurance of software requirements,” in2024 IEEE 32nd International Requirements Engineering Conference (RE). IEEE, 2024, pp. 389–397

work page 2024
[32]

Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design,

J. White, S. Hays, Q. Fu, J. Spencer-Smith, and D. C. Schmidt, “Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design,” inGenerative ai for effective software development. Springer, 2024, pp. 71–108

work page 2024
[33]

Requirements verification through the analysis of source code by large language models,

J. O. Couder, D. Gomez, and O. Ochoa, “Requirements verification through the analysis of source code by large language models,” in SoutheastCon 2024. IEEE, 2024, pp. 75–80

work page 2024
[34]

Enhancing requirements engineering with large language models: From elicitation and classification to traceability, ambiguity management and api recommendation,

V . Ocleppo, “Enhancing requirements engineering with large language models: From elicitation and classification to traceability, ambiguity management and api recommendation,” Ph.D. dissertation, Politecnico di Torino, 2025

work page 2025
[35]

Requirements are all you need: The final frontier for end-user software engineering,

D. Robinson, C. Cabrera, A. D. Gordon, N. D. Lawrence, and L. Men- nen, “Requirements are all you need: The final frontier for end-user software engineering,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 5, pp. 1–22, 2025

work page 2025
[36]

Re- cover: Toward requirements generation from stakeholders’ conversa- tions,

G. V oria, F. Casillo, C. Gravino, G. Catolino, and F. Palomba, “Re- cover: Toward requirements generation from stakeholders’ conversa- tions,”IEEE Transactions on Software Engineering, 2025

work page 2025
[37]

Cross-level requirements tracing based on large language models,

C. Ge, T. Wang, X. Yang, and C. Treude, “Cross-level requirements tracing based on large language models,”IEEE Transactions on Soft- ware Engineering, 2025

work page 2025
[38]

Collaboration with generative ai to improve requirements change,

Y . Kong, N. Zhang, Z. Duan, and B. Yu, “Collaboration with generative ai to improve requirements change,”Computer Standards & Interfaces, p. 104013, 2025

work page 2025
[39]

Using large language models for natural language processing tasks in requirements engineering: A systematic guideline,

A. V ogelsang and J. Fischbach, “Using large language models for natural language processing tasks in requirements engineering: A systematic guideline,” inHandbook on Natural Language Processing for Requirements Engineering. Springer, 2025, pp. 435–456

work page 2025
[40]

Automated classification and iden- tification of non-functional requirements in agile-based requirements using pre-trained language models,

A. Alhaizaey and M. Al-Mashari, “Automated classification and iden- tification of non-functional requirements in agile-based requirements using pre-trained language models,”IEEE Access, 2025

work page 2025
[41]

Natu- ral language processing for requirements traceability,

J. L. Guo, J.-P. Stegh ¨ofer, A. V ogelsang, and J. Cleland-Huang, “Natu- ral language processing for requirements traceability,” inHandbook on Natural Language Processing for Requirements Engineering. Springer, 2025, pp. 89–116

work page 2025
[42]

Mantra: Enhancing automated method-level refactoring with contextual rag and multi-agent llm collaboration,

Y . Xu, F. Lin, J. Yang, N. Tsantaliset al., “Mantra: Enhancing automated method-level refactoring with contextual rag and multi-agent llm collaboration,”arXiv preprint arXiv:2503.14340, 2025

work page arXiv 2025
[43]

Large language models (llms) for source code analysis: applications, models and datasets,

H. Jelodar, M. Meymani, and R. Razavi-Far, “Large language models (llms) for source code analysis: applications, models and datasets,” arXiv preprint arXiv:2503.17502, 2025

work page arXiv 2025
[44]

An empirical study on the code refactoring capability of large language models,

J. Cordeiro, S. Noei, and Y . Zou, “An empirical study on the code refactoring capability of large language models,”arXiv preprint arXiv:2411.02320, 2024

work page arXiv 2024
[45]

Leveraging llms to automate software architecture design from informal specifications,

A. Tagliaferro, S. Corboe, and B. Guindani, “Leveraging llms to automate software architecture design from informal specifications,” in2025 IEEE 22nd International Conference on Software Architecture Companion (ICSA-C). IEEE, 2025, pp. 291–299

work page 2025
[46]

Design pattern recognition: a study of large language models,

S. K. Pandey, S. Chand, J. Horkoff, M. Staron, M. Ochodek, and D. Durisic, “Design pattern recognition: a study of large language models,”Empirical Software Engineering, vol. 30, no. 3, p. 69, 2025

work page 2025
[47]

Large language models for constructing and optimizing machine learning workflows: A survey,

Y . Gu, H. You, J. Cao, M. Yu, H. Fan, and S. Qian, “Large language models for constructing and optimizing machine learning workflows: A survey,”ACM Transactions on Software Engineering and Methodology, 2025

work page 2025
[48]

Knowledge-based multi-agent framework for automated software architecture design,

Y . Zhang, R. Li, P. Liang, W. Sun, and Y . Liu, “Knowledge-based multi-agent framework for automated software architecture design,” inProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, 2025, pp. 530–534

work page 2025
[49]

Assessing llms for front-end software architecture knowledge,

L. P. F. Guerra and N. Ernst, “Assessing llms for front-end software architecture knowledge,” in2025 IEEE/ACM International Workshop on Designing Software (Designing). IEEE, 2025, pp. 6–10

work page 2025
[50]

Gen- erative ai meets cad: enhancing engineering design to manufacturing processes with large language models,

A. Daareyni, A. Martikkala, H. Mokhtarian, and I. F. Ituarte, “Gen- erative ai meets cad: enhancing engineering design to manufacturing processes with large language models,”The International Journal of Advanced Manufacturing Technology, pp. 1–10, 2025

work page 2025
[51]

L. H. Cheung and G. Di Marco, “Composing conversational architec- ture by integrating large language model: From reactive to suggestive architecture through exploring the mathematical nature of the trans- former model,”Nexus Network Journal, vol. 27, no. 1, pp. 203–220, 2025

work page 2025
[52]

Sa-ds: A dataset for large language model-driven ai accelerator design generation,

D. Vungarala, M. Nazzal, M. Morsali, C. Zhang, A. Ghosh, A. Khreishah, and S. Angizi, “Sa-ds: A dataset for large language model-driven ai accelerator design generation,” in2025 IEEE Interna- tional Symposium on Circuits and Systems (ISCAS). IEEE, 2025, pp. 1–4

work page 2025
[53]

Llm-based test-driven interactive code generation: User study and empirical evaluation,

S. Fakhoury, A. Naik, G. Sakkas, S. Chakraborty, and S. K. Lahiri, “Llm-based test-driven interactive code generation: User study and empirical evaluation,”IEEE Transactions on Software Engineering, 2024

work page 2024
[54]

Ai-powered code review with llms: Early results,

Z. Rasheed, M. A. Sami, M. Waseem, K.-K. Kemell, X. Wang, A. Nguyen, K. Syst ¨a, and P. Abrahamsson, “Ai-powered code review with llms: Early results,”arXiv preprint arXiv:2404.18496, 2024

work page arXiv 2024
[55]

Em-assist: Safe automated extractmethod refactoring with llms,

D. Pomian, A. Bellur, M. Dilhara, Z. Kurbatova, E. Bogomolov, A. Sokolov, T. Bryksin, and D. Dig, “Em-assist: Safe automated extractmethod refactoring with llms,” inCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, ser. FSE 2024. New York, NY , USA: Association for Computing Machinery, 2024, p. 582–...

work page doi:10.1145/3663529.3663803 2024
[56]

Together we go further: Llms and ide static analysis for extract method refactoring,

D. Pomian, A. Bellur, M. Dilhara, Z. Kurbatova, E. Bogomolov, T. Bryksin, and D. Dig, “Together we go further: Llms and ide static analysis for extract method refactoring,” 2024. [Online]. Available: https://arxiv.org/abs/2401.15298

work page arXiv 2024
[57]

ismell: Assembling llms with expert toolsets for code smell detection and refactoring,

D. Wu, F. Mu, L. Shi, Z. Guo, K. Liu, W. Zhuang, Y . Zhong, and L. Zhang, “ismell: Assembling llms with expert toolsets for code smell detection and refactoring,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 1345–1357. [Online]...

work page doi:10.1145/3691620.3695508 2024
[58]

C2hlsc: Leveraging large language models to bridge the software-to-hardware design gap,

L. Collini, S. Garg, and R. Karri, “C2hlsc: Leveraging large language models to bridge the software-to-hardware design gap,”ACM Transac- tions on Design Automation of Electronic Systems, vol. 30, no. 6, pp. 1–24, 2025

work page 2025
[59]

Template-guided program repair in the era of large language models

K. Huang, J. Zhang, X. Meng, and Y . Liu, “Template-guided program repair in the era of large language models.” inICSE, 2025, pp. 1895– 1907

work page 2025
[60]

Opencoder: The open cookbook for top- tier code large language models,

S. Huang, T. Cheng, J. K. Liu, W. Xu, J. Hao, L. Song, Y . Xu, J. Yang, J. Liu, C. Zhanget al., “Opencoder: The open cookbook for top- tier code large language models,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 33 167–33 193

work page 2025
[61]

On the effectiveness of large language models in domain- specific code generation,

X. Gu, M. Chen, Y . Lin, Y . Hu, H. Zhang, C. Wan, Z. Wei, Y . Xu, and J. Wang, “On the effectiveness of large language models in domain- specific code generation,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 3, pp. 1–22, 2025

work page 2025
[62]

Soleval: Benchmarking large language models for repository- level solidity code generation,

Z. Peng, X. Yin, R. Qian, P. Lin, Y . Liu, H. Zhang, C. Ying, and Y . Luo, “Soleval: Benchmarking large language models for repository- level solidity code generation,”arXiv preprint arXiv:2502.18793, 2025

work page arXiv 2025
[63]

Fixing large language models’ specification misunderstanding for better code generation,

Z. Tian, J. Chen, and X. Zhang, “Fixing large language models’ specification misunderstanding for better code generation,” in2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 2025, pp. 645–645

work page 2025
[64]

Exploring parameter-efficient fine-tuning techniques for code generation with large language models,

M. Weyssow, X. Zhou, K. Kim, D. Lo, and H. Sahraoui, “Exploring parameter-efficient fine-tuning techniques for code generation with large language models,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 7, pp. 1–25, 2025

work page 2025
[65]

Scalable, validated code translation of entire projects using large language models,

H. Zhang, C. David, M. Wang, B. Paulsen, and D. Kroening, “Scalable, validated code translation of entire projects using large language models,”Proceedings of the ACM on Programming Languages, vol. 9, no. PLDI, pp. 1616–1641, 2025

work page 2025
[66]

En- hancing large language models for text-to-testcase generation,

S. Alagarsamy, C. Tantithamthavorn, C. Arora, and A. Aleti, “En- hancing large language models for text-to-testcase generation,”arXiv preprint arXiv:2402.11910, 2024

work page arXiv 2024
[67]

Generating test scenarios from nl requirements using retrieval-augmented llms: An industrial study,

C. Arora, T. Herda, and V . Homm, “Generating test scenarios from nl requirements using retrieval-augmented llms: An industrial study,” in 2024 IEEE 32nd International Requirements Engineering Conference (RE). IEEE, 2024, pp. 240–251

work page 2024
[68]

Evaluating large language models for software testing,

Y . Li, P. Liu, H. Wang, J. Chu, and W. E. Wong, “Evaluating large language models for software testing,”Computer Standards & Interfaces, vol. 93, p. 103942, 2025

work page 2025
[69]

A tool for test case scenarios generation using large language models,

A. M. Sami, Z. Rasheed, M. Waseem, Z. Zhang, H. Tomas, and P. Abrahamsson, “A tool for test case scenarios generation using large language models,”arXiv preprint arXiv:2406.07021, 2024

work page arXiv 2024
[70]

An initial investigation of chatgpt unit test generation capability,

V . Guilherme and A. Vincenzi, “An initial investigation of chatgpt unit test generation capability,” inProceedings of the 8th Brazilian Symposium on Systematic and Automated Software Testing, 2023, pp. 15–24

work page 2023
[71]

System test case design from requirements specifications: Insights and challenges of using chatgpt,

S. Bhatia, T. Gandhi, D. Kumar, and P. Jalote, “System test case design from requirements specifications: Insights and challenges of using chatgpt,”arXiv preprint arXiv:2412.03693, 2024

work page arXiv 2024
[72]

Multi-language unit test generation using llms,

R. Pan, M. Kim, R. Krishna, R. Pavuluri, and S. Sinha, “Multi-language unit test generation using llms,”arXiv preprint arXiv:2409.03093, 2024

work page arXiv 2024
[73]

Mutation-guided llm-based test gener- ation at meta,

C. Foster, A. Gulati, M. Harman, I. Harper, K. Mao, J. Ritchey, H. Robert, and S. Sengupta, “Mutation-guided llm-based test gener- ation at meta,”arXiv preprint arXiv:2501.12862, 2025

work page arXiv 2025
[74]

Automated program refinement: Guide and verify code large language model with refinement calculus,

Y . Cai, Z. Hou, D. San ´an, X. Luan, Y . Lin, J. Sun, and J. S. Dong, “Automated program refinement: Guide and verify code large language model with refinement calculus,”Proceedings of the ACM on Programming Languages, vol. 9, no. POPL, pp. 2057–2089, 2025

work page 2057
[75]

Exploring automated assertion generation via large language models,

Q. Zhang, W. Sun, C. Fang, B. Yu, H. Li, M. Yan, J. Zhou, and Z. Chen, “Exploring automated assertion generation via large language models,” ACM Transactions on Software Engineering and Methodology, vol. 34, no. 3, pp. 1–25, 2025

work page 2025
[76]

Automating autograding: Large language models as test suite generators for introductory pro- gramming,

U. Alkafaween, I. Albluwi, and P. Denny, “Automating autograding: Large language models as test suite generators for introductory pro- gramming,”Journal of Computer Assisted Learning, vol. 41, no. 1, p. e13100, 2025

work page 2025
[77]

Classinvgen: Class invariant synthesis using large language models,

C. Sun, V . Agashe, S. Chakraborty, J. Taneja, C. Barrett, D. Dill, X. Qiu, and S. K. Lahiri, “Classinvgen: Class invariant synthesis using large language models,” inInternational Symposium on AI Verification. Springer, 2025, pp. 64–96

work page 2025
[78]

A large- scale empirical study on fine-tuning large language models for unit testing,

Y . Shang, Q. Zhang, C. Fang, S. Gu, J. Zhou, and Z. Chen, “A large- scale empirical study on fine-tuning large language models for unit testing,”Proceedings of the ACM on Software Engineering, vol. 2, no. ISSTA, pp. 1678–1700, 2025

work page 2025
[79]

A system for automated unit test generation using large language models and assessment of generated test suites,

A. Lops, F. Narducci, A. Ragone, M. Trizio, and C. Bartolini, “A system for automated unit test generation using large language models and assessment of generated test suites,” in2025 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 2025, pp. 29–36

work page 2025
[80]

Testeval: Benchmarking large language models for test case generation,

W. Wang, C. Yang, Z. Wang, Y . Huang, Z. Chu, D. Song, L. Zhang, A. R. Chen, and L. Ma, “Testeval: Benchmarking large language models for test case generation,” inFindings of the Association for Computational Linguistics: NAACL 2025, 2025, pp. 3547–3562

work page 2025

Showing first 80 references.

[1] [1]

The making of cloud applications: An empirical study on software development for the cloud,

J. Cito, P. Leitner, H. C. Gallet al., “The making of cloud applications: An empirical study on software development for the cloud,”IEEE Software, vol. 35, no. 1, pp. 50–57, 2018

work page 2018

[2] [2]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, 2017, pp. 5998–6008

work page 2017

[3] [3]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, J. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, and G. e. a. Brockman, “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

Codebert: A pre-trained model for program- ming and natural languages,

Z. Feng, D. Guoet al., “Codebert: A pre-trained model for program- ming and natural languages,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 1536–1547

work page 2020

[5] [5]

Large language models for software engineer- ing: Review and reflections,

R. Poldrack and Others, “Large language models for software engineer- ing: Review and reflections,” arXiv preprint arXiv:2210.12345, 2022

work page arXiv 2022

[6] [6]

Deeptest: Automated testing of deep-neural-network-driven autonomous cars,

Y . Tian, K. Pei, S. Jana, and B. Ray, “Deeptest: Automated testing of deep-neural-network-driven autonomous cars,” inProceedings of the 40th International Conference on Software Engineering, 2018, pp. 303– 314

work page 2018

[7] [7]

Next-generation bug reporting: Enhancing de- velopment with ai automation,

A. Patil and A. Jadon, “Next-generation bug reporting: Enhancing de- velopment with ai automation,” in2025 10th International Conference on Signal Processing and Communication (ICSC). IEEE, 2025, pp. 487–493

work page 2025

[8] [8]

P. A. Laplante,What Every Engineer Should Know about Software Engineering. CRC Press, 2018

work page 2018

[9] [9]

Ethical ai development: Mitigating bias in generative models,

A. Jadon, “Ethical ai development: Mitigating bias in generative models,”Interplay of Artificial General Intelligence with Quantum Computing: Towards Sustainability, pp. 123–136, 2025

work page 2025

[10] [10]

Iso/iec/ieee 12207:2017 systems and software engineering – software life cycle processes,

“Iso/iec/ieee 12207:2017 systems and software engineering – software life cycle processes,” https://www.iso.org/standard/63712.html, 2017, accessed: 2025-03-31

work page 2017

[11] [11]

Iso/iec 25010:2011 systems and software engineering – systems and software quality requirements and evaluation (square) – system and software quality models,

“Iso/iec 25010:2011 systems and software engineering – systems and software quality requirements and evaluation (square) – system and software quality models,” https://www.iso.org/standard/35733.html, 2011, accessed: 2025-03-31

work page 2011

[12] [12]

A Data Fusion Platform for Supporting Bridge Deck Condition Monitoring by Merging Aerial and Ground Inspection Imagery

V . Garousi, K. Petersen, and B. Ozkan, “Industry-academia collabora- tions in software testing: experience and success stories from canada,” arXiv preprint arXiv:1904.04986, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[13] [13]

Iso/iec 5055:2021 information technology – software measurement – quality measure elements,

“Iso/iec 5055:2021 information technology – software measurement – quality measure elements,” https://www.iso.org/standard/80649.html, 2021, accessed: 2025-03-31

work page 2021

[14] [14]

Vulnerabilities, patches, and exploits in the wild: A case study of apache http server and nginx repositories,

I. Pashchenko, H. Plate, and F. Massacci, “Vulnerabilities, patches, and exploits in the wild: A case study of apache http server and nginx repositories,” arXiv preprint arXiv:2108.01691, 2021

work page arXiv 2021

[15] [15]

Iso 9001:2015 quality management systems – requirements,

“Iso 9001:2015 quality management systems – requirements,” https://www.iso.org/standard/62085.html, 2015, accessed: 2025-03-31

work page 2015

[16] [16]

Iso/iec 90003:2018 software engineering – guidelines for the application of iso 9001:2015 to computer software,

“Iso/iec 90003:2018 software engineering – guidelines for the application of iso 9001:2015 to computer software,” https://www.iso.org/standard/53288.html, 2018, accessed: 2025-03-31

work page 2018

[17] [17]

Cmmi v2.0,

C. Institute, “Cmmi v2.0,” https://cmmiinstitute.com/cmmi/v2.0, 2018, accessed: 2025-03-31

work page 2018

[18] [18]

Exploring software process improvement in agile teams through the lens of cmmi,

B. Dingsør, N. B. Moe, and A. Øyvang, “Exploring software process improvement in agile teams through the lens of cmmi,”Journal of Software: Evolution and Process, vol. 31, no. 6, p. e2160, 2019

work page 2019

[19] [19]

Burnstein,Practical Software Testing: A Process-Oriented Approach

I. Burnstein,Practical Software Testing: A Process-Oriented Approach. Springer, 2003

work page 2003

[20] [20]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019

[21] [21]

Language Models are Few-Shot Learners

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005

[22] [22]

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

R. Puri, D. Kung, G. Janssenet al., “Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and gen- eration,” arXiv preprint arXiv:2109.00859, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[23] [23]

Systematic evaluation of large language models of code,

F. F. Xu and Others, “Systematic evaluation of large language models of code,” arXiv preprint arXiv:2202.13169, 2022

work page arXiv 2022

[24] [24]

Expectations vs. experience: Evaluating the usability of code generation tools powered by large language models,

P. Vaithilingam and Others, “Expectations vs. experience: Evaluating the usability of code generation tools powered by large language models,” inProceedings of the 2022 CHI Conference on Human Factors in Computing Systems, 2022, pp. 1–14

work page 2022

[25] [25]

Requirements engineering for ai: Opportunities and challenges,

F. Dalpiazet al., “Requirements engineering for ai: Opportunities and challenges,”Requirements Engineering, vol. 24, no. 3, pp. 403–415, 2019

work page 2019

[26] [26]

On the use of automated documentation generation in open-source projects: A preliminary study,

L. Moreno and Others, “On the use of automated documentation generation in open-source projects: A preliminary study,”Empirical Software Engineering, vol. 25, no. 3, pp. 1880–1908, 2020

work page 1908

[27] [27]

Using llms in software requirements specifications: an empirical evaluation,

M. Krishna, B. Gaur, A. Verma, and P. Jalote, “Using llms in software requirements specifications: an empirical evaluation,” in2024 IEEE 32nd International Requirements Engineering Conference (RE). IEEE, 2024, pp. 475–483

work page 2024

[28] [28]

Requirements are all you need: From requirements to code with llms,

B. Wei, “Requirements are all you need: From requirements to code with llms,” in2024 IEEE 32nd International Requirements Engineering Conference (RE). IEEE, 2024, pp. 416–422

work page 2024

[29] [29]

Advancing requirements engineering through generative ai: Assessing the role of llms,

C. Arora, J. Grundy, and M. Abdelrazek, “Advancing requirements engineering through generative ai: Assessing the role of llms,” in Generative AI for Effective Software Development. Springer, 2024, pp. 129–148

work page 2024

[30] [30]

Generating specifications from requirements documents for smart devices using large language models (llms),

R. Lutze and K. Waldh ¨or, “Generating specifications from requirements documents for smart devices using large language models (llms),” in International Conference on Human-Computer Interaction. Springer, 2024, pp. 94–108

work page 2024

[31] [31]

Leveraging llms for the quality assurance of software requirements,

S. Lubos, A. Felfernig, T. N. T. Tran, D. Garber, M. El Mansi, S. P. Erdeniz, and V .-M. Le, “Leveraging llms for the quality assurance of software requirements,” in2024 IEEE 32nd International Requirements Engineering Conference (RE). IEEE, 2024, pp. 389–397

work page 2024

[32] [32]

Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design,

J. White, S. Hays, Q. Fu, J. Spencer-Smith, and D. C. Schmidt, “Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design,” inGenerative ai for effective software development. Springer, 2024, pp. 71–108

work page 2024

[33] [33]

Requirements verification through the analysis of source code by large language models,

J. O. Couder, D. Gomez, and O. Ochoa, “Requirements verification through the analysis of source code by large language models,” in SoutheastCon 2024. IEEE, 2024, pp. 75–80

work page 2024

[34] [34]

Enhancing requirements engineering with large language models: From elicitation and classification to traceability, ambiguity management and api recommendation,

V . Ocleppo, “Enhancing requirements engineering with large language models: From elicitation and classification to traceability, ambiguity management and api recommendation,” Ph.D. dissertation, Politecnico di Torino, 2025

work page 2025

[35] [35]

Requirements are all you need: The final frontier for end-user software engineering,

D. Robinson, C. Cabrera, A. D. Gordon, N. D. Lawrence, and L. Men- nen, “Requirements are all you need: The final frontier for end-user software engineering,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 5, pp. 1–22, 2025

work page 2025

[36] [36]

Re- cover: Toward requirements generation from stakeholders’ conversa- tions,

G. V oria, F. Casillo, C. Gravino, G. Catolino, and F. Palomba, “Re- cover: Toward requirements generation from stakeholders’ conversa- tions,”IEEE Transactions on Software Engineering, 2025

work page 2025

[37] [37]

Cross-level requirements tracing based on large language models,

C. Ge, T. Wang, X. Yang, and C. Treude, “Cross-level requirements tracing based on large language models,”IEEE Transactions on Soft- ware Engineering, 2025

work page 2025

[38] [38]

Collaboration with generative ai to improve requirements change,

Y . Kong, N. Zhang, Z. Duan, and B. Yu, “Collaboration with generative ai to improve requirements change,”Computer Standards & Interfaces, p. 104013, 2025

work page 2025

[39] [39]

Using large language models for natural language processing tasks in requirements engineering: A systematic guideline,

A. V ogelsang and J. Fischbach, “Using large language models for natural language processing tasks in requirements engineering: A systematic guideline,” inHandbook on Natural Language Processing for Requirements Engineering. Springer, 2025, pp. 435–456

work page 2025

[40] [40]

Automated classification and iden- tification of non-functional requirements in agile-based requirements using pre-trained language models,

A. Alhaizaey and M. Al-Mashari, “Automated classification and iden- tification of non-functional requirements in agile-based requirements using pre-trained language models,”IEEE Access, 2025

work page 2025

[41] [41]

Natu- ral language processing for requirements traceability,

J. L. Guo, J.-P. Stegh ¨ofer, A. V ogelsang, and J. Cleland-Huang, “Natu- ral language processing for requirements traceability,” inHandbook on Natural Language Processing for Requirements Engineering. Springer, 2025, pp. 89–116

work page 2025

[42] [42]

Mantra: Enhancing automated method-level refactoring with contextual rag and multi-agent llm collaboration,

Y . Xu, F. Lin, J. Yang, N. Tsantaliset al., “Mantra: Enhancing automated method-level refactoring with contextual rag and multi-agent llm collaboration,”arXiv preprint arXiv:2503.14340, 2025

work page arXiv 2025

[43] [43]

Large language models (llms) for source code analysis: applications, models and datasets,

H. Jelodar, M. Meymani, and R. Razavi-Far, “Large language models (llms) for source code analysis: applications, models and datasets,” arXiv preprint arXiv:2503.17502, 2025

work page arXiv 2025

[44] [44]

An empirical study on the code refactoring capability of large language models,

J. Cordeiro, S. Noei, and Y . Zou, “An empirical study on the code refactoring capability of large language models,”arXiv preprint arXiv:2411.02320, 2024

work page arXiv 2024

[45] [45]

Leveraging llms to automate software architecture design from informal specifications,

A. Tagliaferro, S. Corboe, and B. Guindani, “Leveraging llms to automate software architecture design from informal specifications,” in2025 IEEE 22nd International Conference on Software Architecture Companion (ICSA-C). IEEE, 2025, pp. 291–299

work page 2025

[46] [46]

Design pattern recognition: a study of large language models,

S. K. Pandey, S. Chand, J. Horkoff, M. Staron, M. Ochodek, and D. Durisic, “Design pattern recognition: a study of large language models,”Empirical Software Engineering, vol. 30, no. 3, p. 69, 2025

work page 2025

[47] [47]

Large language models for constructing and optimizing machine learning workflows: A survey,

Y . Gu, H. You, J. Cao, M. Yu, H. Fan, and S. Qian, “Large language models for constructing and optimizing machine learning workflows: A survey,”ACM Transactions on Software Engineering and Methodology, 2025

work page 2025

[48] [48]

Knowledge-based multi-agent framework for automated software architecture design,

Y . Zhang, R. Li, P. Liang, W. Sun, and Y . Liu, “Knowledge-based multi-agent framework for automated software architecture design,” inProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, 2025, pp. 530–534

work page 2025

[49] [49]

Assessing llms for front-end software architecture knowledge,

L. P. F. Guerra and N. Ernst, “Assessing llms for front-end software architecture knowledge,” in2025 IEEE/ACM International Workshop on Designing Software (Designing). IEEE, 2025, pp. 6–10

work page 2025

[50] [50]

Gen- erative ai meets cad: enhancing engineering design to manufacturing processes with large language models,

A. Daareyni, A. Martikkala, H. Mokhtarian, and I. F. Ituarte, “Gen- erative ai meets cad: enhancing engineering design to manufacturing processes with large language models,”The International Journal of Advanced Manufacturing Technology, pp. 1–10, 2025

work page 2025

[51] [51]

L. H. Cheung and G. Di Marco, “Composing conversational architec- ture by integrating large language model: From reactive to suggestive architecture through exploring the mathematical nature of the trans- former model,”Nexus Network Journal, vol. 27, no. 1, pp. 203–220, 2025

work page 2025

[52] [52]

Sa-ds: A dataset for large language model-driven ai accelerator design generation,

D. Vungarala, M. Nazzal, M. Morsali, C. Zhang, A. Ghosh, A. Khreishah, and S. Angizi, “Sa-ds: A dataset for large language model-driven ai accelerator design generation,” in2025 IEEE Interna- tional Symposium on Circuits and Systems (ISCAS). IEEE, 2025, pp. 1–4

work page 2025

[53] [53]

Llm-based test-driven interactive code generation: User study and empirical evaluation,

S. Fakhoury, A. Naik, G. Sakkas, S. Chakraborty, and S. K. Lahiri, “Llm-based test-driven interactive code generation: User study and empirical evaluation,”IEEE Transactions on Software Engineering, 2024

work page 2024

[54] [54]

Ai-powered code review with llms: Early results,

Z. Rasheed, M. A. Sami, M. Waseem, K.-K. Kemell, X. Wang, A. Nguyen, K. Syst ¨a, and P. Abrahamsson, “Ai-powered code review with llms: Early results,”arXiv preprint arXiv:2404.18496, 2024

work page arXiv 2024

[55] [55]

Em-assist: Safe automated extractmethod refactoring with llms,

D. Pomian, A. Bellur, M. Dilhara, Z. Kurbatova, E. Bogomolov, A. Sokolov, T. Bryksin, and D. Dig, “Em-assist: Safe automated extractmethod refactoring with llms,” inCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, ser. FSE 2024. New York, NY , USA: Association for Computing Machinery, 2024, p. 582–...

work page doi:10.1145/3663529.3663803 2024

[56] [56]

Together we go further: Llms and ide static analysis for extract method refactoring,

D. Pomian, A. Bellur, M. Dilhara, Z. Kurbatova, E. Bogomolov, T. Bryksin, and D. Dig, “Together we go further: Llms and ide static analysis for extract method refactoring,” 2024. [Online]. Available: https://arxiv.org/abs/2401.15298

work page arXiv 2024

[57] [57]

ismell: Assembling llms with expert toolsets for code smell detection and refactoring,

D. Wu, F. Mu, L. Shi, Z. Guo, K. Liu, W. Zhuang, Y . Zhong, and L. Zhang, “ismell: Assembling llms with expert toolsets for code smell detection and refactoring,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 1345–1357. [Online]...

work page doi:10.1145/3691620.3695508 2024

[58] [58]

C2hlsc: Leveraging large language models to bridge the software-to-hardware design gap,

L. Collini, S. Garg, and R. Karri, “C2hlsc: Leveraging large language models to bridge the software-to-hardware design gap,”ACM Transac- tions on Design Automation of Electronic Systems, vol. 30, no. 6, pp. 1–24, 2025

work page 2025

[59] [59]

Template-guided program repair in the era of large language models

K. Huang, J. Zhang, X. Meng, and Y . Liu, “Template-guided program repair in the era of large language models.” inICSE, 2025, pp. 1895– 1907

work page 2025

[60] [60]

Opencoder: The open cookbook for top- tier code large language models,

S. Huang, T. Cheng, J. K. Liu, W. Xu, J. Hao, L. Song, Y . Xu, J. Yang, J. Liu, C. Zhanget al., “Opencoder: The open cookbook for top- tier code large language models,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 33 167–33 193

work page 2025

[61] [61]

On the effectiveness of large language models in domain- specific code generation,

X. Gu, M. Chen, Y . Lin, Y . Hu, H. Zhang, C. Wan, Z. Wei, Y . Xu, and J. Wang, “On the effectiveness of large language models in domain- specific code generation,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 3, pp. 1–22, 2025

work page 2025

[62] [62]

Soleval: Benchmarking large language models for repository- level solidity code generation,

Z. Peng, X. Yin, R. Qian, P. Lin, Y . Liu, H. Zhang, C. Ying, and Y . Luo, “Soleval: Benchmarking large language models for repository- level solidity code generation,”arXiv preprint arXiv:2502.18793, 2025

work page arXiv 2025

[63] [63]

Fixing large language models’ specification misunderstanding for better code generation,

Z. Tian, J. Chen, and X. Zhang, “Fixing large language models’ specification misunderstanding for better code generation,” in2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 2025, pp. 645–645

work page 2025

[64] [64]

Exploring parameter-efficient fine-tuning techniques for code generation with large language models,

M. Weyssow, X. Zhou, K. Kim, D. Lo, and H. Sahraoui, “Exploring parameter-efficient fine-tuning techniques for code generation with large language models,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 7, pp. 1–25, 2025

work page 2025

[65] [65]

Scalable, validated code translation of entire projects using large language models,

H. Zhang, C. David, M. Wang, B. Paulsen, and D. Kroening, “Scalable, validated code translation of entire projects using large language models,”Proceedings of the ACM on Programming Languages, vol. 9, no. PLDI, pp. 1616–1641, 2025

work page 2025

[66] [66]

En- hancing large language models for text-to-testcase generation,

S. Alagarsamy, C. Tantithamthavorn, C. Arora, and A. Aleti, “En- hancing large language models for text-to-testcase generation,”arXiv preprint arXiv:2402.11910, 2024

work page arXiv 2024

[67] [67]

Generating test scenarios from nl requirements using retrieval-augmented llms: An industrial study,

C. Arora, T. Herda, and V . Homm, “Generating test scenarios from nl requirements using retrieval-augmented llms: An industrial study,” in 2024 IEEE 32nd International Requirements Engineering Conference (RE). IEEE, 2024, pp. 240–251

work page 2024

[68] [68]

Evaluating large language models for software testing,

Y . Li, P. Liu, H. Wang, J. Chu, and W. E. Wong, “Evaluating large language models for software testing,”Computer Standards & Interfaces, vol. 93, p. 103942, 2025

work page 2025

[69] [69]

A tool for test case scenarios generation using large language models,

A. M. Sami, Z. Rasheed, M. Waseem, Z. Zhang, H. Tomas, and P. Abrahamsson, “A tool for test case scenarios generation using large language models,”arXiv preprint arXiv:2406.07021, 2024

work page arXiv 2024

[70] [70]

An initial investigation of chatgpt unit test generation capability,

V . Guilherme and A. Vincenzi, “An initial investigation of chatgpt unit test generation capability,” inProceedings of the 8th Brazilian Symposium on Systematic and Automated Software Testing, 2023, pp. 15–24

work page 2023

[71] [71]

System test case design from requirements specifications: Insights and challenges of using chatgpt,

S. Bhatia, T. Gandhi, D. Kumar, and P. Jalote, “System test case design from requirements specifications: Insights and challenges of using chatgpt,”arXiv preprint arXiv:2412.03693, 2024

work page arXiv 2024

[72] [72]

Multi-language unit test generation using llms,

R. Pan, M. Kim, R. Krishna, R. Pavuluri, and S. Sinha, “Multi-language unit test generation using llms,”arXiv preprint arXiv:2409.03093, 2024

work page arXiv 2024

[73] [73]

Mutation-guided llm-based test gener- ation at meta,

C. Foster, A. Gulati, M. Harman, I. Harper, K. Mao, J. Ritchey, H. Robert, and S. Sengupta, “Mutation-guided llm-based test gener- ation at meta,”arXiv preprint arXiv:2501.12862, 2025

work page arXiv 2025

[74] [74]

Automated program refinement: Guide and verify code large language model with refinement calculus,

Y . Cai, Z. Hou, D. San ´an, X. Luan, Y . Lin, J. Sun, and J. S. Dong, “Automated program refinement: Guide and verify code large language model with refinement calculus,”Proceedings of the ACM on Programming Languages, vol. 9, no. POPL, pp. 2057–2089, 2025

work page 2057

[75] [75]

Exploring automated assertion generation via large language models,

Q. Zhang, W. Sun, C. Fang, B. Yu, H. Li, M. Yan, J. Zhou, and Z. Chen, “Exploring automated assertion generation via large language models,” ACM Transactions on Software Engineering and Methodology, vol. 34, no. 3, pp. 1–25, 2025

work page 2025

[76] [76]

Automating autograding: Large language models as test suite generators for introductory pro- gramming,

U. Alkafaween, I. Albluwi, and P. Denny, “Automating autograding: Large language models as test suite generators for introductory pro- gramming,”Journal of Computer Assisted Learning, vol. 41, no. 1, p. e13100, 2025

work page 2025

[77] [77]

Classinvgen: Class invariant synthesis using large language models,

C. Sun, V . Agashe, S. Chakraborty, J. Taneja, C. Barrett, D. Dill, X. Qiu, and S. K. Lahiri, “Classinvgen: Class invariant synthesis using large language models,” inInternational Symposium on AI Verification. Springer, 2025, pp. 64–96

work page 2025

[78] [78]

A large- scale empirical study on fine-tuning large language models for unit testing,

Y . Shang, Q. Zhang, C. Fang, S. Gu, J. Zhou, and Z. Chen, “A large- scale empirical study on fine-tuning large language models for unit testing,”Proceedings of the ACM on Software Engineering, vol. 2, no. ISSTA, pp. 1678–1700, 2025

work page 2025

[79] [79]

A system for automated unit test generation using large language models and assessment of generated test suites,

A. Lops, F. Narducci, A. Ragone, M. Trizio, and C. Bartolini, “A system for automated unit test generation using large language models and assessment of generated test suites,” in2025 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 2025, pp. 29–36

work page 2025

[80] [80]

Testeval: Benchmarking large language models for test case generation,

W. Wang, C. Yang, Z. Wang, Y . Huang, Z. Chu, D. Song, L. Zhang, A. R. Chen, and L. Ma, “Testeval: Benchmarking large language models for test case generation,” inFindings of the Association for Computational Linguistics: NAACL 2025, 2025, pp. 3547–3562

work page 2025