arxiv: 2604.03135 · v1 · submitted 2026-04-03 · 💻 cs.SE · cs.AI

Recognition: no theorem link

AI-Assisted Unit Test Writing and Test-Driven Code Refactoring: A Case Study

Ema Smolic, Luka Hobor, Mario Brcic, Mihael Kovac

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:45 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords AI-assisted testingunit test generationtest-driven refactoringcode refactoringcase studylegacy coderegression risk

0 comments

The pith

AI coding models generate nearly 16,000 lines of unit tests in hours to enable safe refactoring of legacy code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines a case study in which coding models first generate unit tests that capture the behavior of an existing MVP-derived software system. These tests then serve as a validation layer for model-proposed refactoring changes performed under developer oversight. The workflow produced a large test suite rapidly, reached substantial branch coverage in key modules, and lowered the chance of regressions during major code modifications. The study also notes observed model errors, the need for manual corrections at times, and steps taken to address value misalignment between model outputs and intended system goals. This illustrates a shift toward empirical, data-constrained software engineering practices.

Core claim

Iterative AI unit test generation can capture existing system behavior well enough to constrain and validate subsequent model-assisted refactoring, yielding nearly 16,000 lines of tests in hours rather than weeks and up to 78% branch coverage in critical modules while reducing regression risk.

What carries the argument

Iteratively generated AI unit tests that validate and constrain supervised model-assisted refactoring changes.

Load-bearing premise

AI-generated tests capture enough of the system's intended behavior that passing them reliably confirms refactoring safety without missing important bugs.

What would settle it

A refactoring that passes all generated AI tests yet produces incorrect behavior in actual usage or additional manual tests.

Figures

Figures reproduced from arXiv: 2604.03135 by Ema Smolic, Luka Hobor, Mario Brcic, Mihael Kovac.

**Figure 1.** Figure 1: AI-led test generation flow The plan lived in a markdown file and contained information on 1) which parts of the codebase should be tested in the next step, 2) how specific test suites should be structured, and 3) which conventions to follow (alongside the rules in context files). It was updated iteratively after the previous step was implemented and reviewed by a human reviewer. After devising a plan for… view at source ↗

**Figure 3.** Figure 3: LOC difference pre- and post-refactor V. FINDINGS AND DISCUSSION In this section, we discuss our findings based on the results stated in the previous section. A. AI-Assisted Test Suite Construction During the AI-led test generation process described in Section III, Subsection III-C, we found that the model invested significant effort into shared infrastructure. In other words, tests were organized into fil… view at source ↗

read the original abstract

Many software systems originate as prototypes or minimum viable products (MVPs), developed with an emphasis on delivery speed and responsiveness to changing requirements rather than long-term code maintainability. While effective for rapid delivery, this approach can result in codebases that are difficult to modify, presenting a significant opportunity cost in the era of AI-assisted or even AI-led programming. In this paper, we present a case study of using coding models for automated unit test generation and subsequent safe refactoring, with proposed code changes validated by passing tests. The study examines best practices for iteratively generating tests to capture existing system behavior, followed by model-assisted refactoring under developer supervision. We describe how this workflow constrained refactoring changes, the errors and limitations observed in both phases, the efficiency gains achieved, when manual intervention was necessary, and how we addressed the weak value misalignment we observed in models. Using this approach, we generated nearly 16,000 lines of reliable unit tests in hours rather than weeks, achieved up to 78\% branch coverage in critical modules, and significantly reduced regression risk during large-scale refactoring. These results illustrate software engineering's shift toward an empirical science, emphasizing data collection and constraining mechanisms that support fast, safe iteration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a practical case study showing measurable speed gains from AI test generation plus supervised refactoring on one legacy project, but the regression-safety claims rest on an untested assumption about test completeness.

read the letter

The paper walks through a real workflow on an actual codebase: iteratively prompting a coding model to produce unit tests from the existing implementation, then using the passing tests to constrain and validate refactoring steps under human oversight. They ended up with nearly 16,000 lines of tests generated in hours rather than weeks, hit 78% branch coverage on critical modules, and recorded time savings plus notes on where the model needed correction or manual intervention. Those concrete numbers and the description of observed model limitations (including value misalignment) are the useful parts. It is a straightforward empirical report rather than a new technique or architecture, and the single-project setting makes the outcomes easy to follow even if they cannot be generalized yet. The workflow itself is documented clearly enough that someone facing similar legacy maintenance could try the same sequence. The soft spot is the safety claim. Because the tests are derived directly from the current code rather than an external specification, they are likely to encode existing behavior, bugs included. The paper gives no fault-injection results, no comparison against human-written tests, and no post-refactor manual audit to show that the uncovered 22% or untested properties would have caught regressions. The 78% coverage figure is internal to the generated suite and does not address that gap. This is a minor issue for an experience report but becomes load-bearing if the headline is read as evidence of reliably reduced risk. The work is aimed at practitioners and researchers who want documented examples of current LLM use on real maintenance tasks. A reader already working with legacy systems will get workflow ideas and realistic expectations about manual effort. It is coherent on its own terms and shows honest engagement with the practical constraints, so it deserves a serious referee even though any published version would need clearer limits on the safety conclusions.

Referee Report

2 major / 1 minor

Summary. The paper presents a single-case study of using AI coding models to iteratively generate unit tests for a legacy codebase (nearly 16,000 lines, up to 78% branch coverage in critical modules) and then apply test-driven refactoring under developer supervision, claiming substantial time savings (hours vs. weeks) and reduced regression risk.

Significance. If the central safety claim holds, the work supplies a concrete, reproducible workflow for AI-assisted maintenance of MVP-derived codebases and illustrates measurable efficiency gains that could accelerate empirical practices in software engineering.

major comments (2)

[Abstract and Results] Abstract and Results section: the headline claim that the generated tests are 'reliable' and 'significantly reduced regression risk' rests on the unverified premise that tests derived from the existing implementation capture intended behavior rather than current (possibly buggy) behavior. No fault-injection experiments, comparison to human-written test suites, or post-refactor manual audit are reported, so the safety assertion cannot be separated from the completeness assumption.
[Methodology] Methodology: the workflow begins by prompting on the current code rather than an external specification or requirements document. This creates a circularity risk that any gaps or defects in the original implementation are mirrored in the test suite; the paper does not quantify or mitigate this beyond general developer supervision.

minor comments (1)

[Abstract] The term 'weak value misalignment' is used in the abstract without a precise definition or example in the provided summary; a short clarification would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our case study. We address each major comment below with the strongest honest defense possible, proposing targeted revisions to clarify claims and limitations without overstating the results.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results section: the headline claim that the generated tests are 'reliable' and 'significantly reduced regression risk' rests on the unverified premise that tests derived from the existing implementation capture intended behavior rather than current (possibly buggy) behavior. No fault-injection experiments, comparison to human-written test suites, or post-refactor manual audit are reported, so the safety assertion cannot be separated from the completeness assumption.

Authors: We agree the distinction is important and that our original wording could be read as claiming verification of intended behavior. The case study explicitly targets capture of observed behavior in a legacy MVP codebase to support regression-safe refactoring; the tests are reliable insofar as they pass on the current implementation and flag behavioral changes during edits. However, absent fault injection, human-written test comparisons, or post-refactor audits, we cannot claim the suite would have caught pre-existing defects. We will revise the abstract and results to replace 'reliable unit tests' and 'significantly reduced regression risk' with phrasing that specifies 'tests capturing observed behavior' and 'reduced risk of unintended behavioral changes during refactoring,' and we will add an explicit limitations paragraph on this point. revision: partial
Referee: [Methodology] Methodology: the workflow begins by prompting on the current code rather than an external specification or requirements document. This creates a circularity risk that any gaps or defects in the original implementation are mirrored in the test suite; the paper does not quantify or mitigate this beyond general developer supervision.

Authors: This accurately describes the practical constraint of the setting: the codebase originated as an MVP with no maintained external specification. Prompting from current code is the only feasible entry point for reverse-engineering behavior in such legacy systems. Mitigation occurs via iterative developer review of generated tests before they are used to gate refactoring changes. While we cannot quantify the residual risk without unavailable ground-truth specifications, we will expand the methodology section to detail the specific supervision steps (e.g., manual inspection of tests for core modules, rejection of tests that fail to exercise key paths) and add a dedicated paragraph acknowledging the circularity risk as an inherent limitation of implementation-driven test generation for legacy code. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical case study with direct observations only

full rationale

The paper is a descriptive case study of an AI-assisted workflow for unit test generation and refactoring. It reports concrete empirical outcomes (16,000 lines of tests generated, up to 78% branch coverage) without any equations, fitted parameters, predictions, or derivations. No self-citations are used as load-bearing premises, no ansatzes are smuggled, and no results are renamed or defined in terms of themselves. All claims reduce to direct observation of the single case rather than any self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of current coding models at producing behavior-capturing tests and on the standard assumption that passing tests validate behavioral equivalence during refactoring.

axioms (2)

domain assumption AI coding models can iteratively produce tests that capture existing system behavior when guided by developer feedback
Invoked as the basis for the test-generation phase success
standard math Tests that pass after a code change confirm that the change preserves original behavior
Standard test-driven development premise used to validate refactoring

pith-pipeline@v0.9.0 · 5519 in / 1314 out tokens · 29629 ms · 2026-05-13T18:45:58.562663+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

[1]

Ai-assisted code generation and optimization in .net web development,

A. S. Shethiya, “Ai-assisted code generation and optimization in .net web development,”Annals of Applied Sciences, vol. 6, no. 1, Jan. 2025. [Online]. Available: https://annalsofappliedsciences.com/ index.php/aas/article/view/15

work page 2025
[2]

Fowler,Refactoring: improving the design of existing code

M. Fowler,Refactoring: improving the design of existing code. Addison-Wesley Professional, 2018

work page 2018
[3]

An empirical eval- uation of using large language models for automated unit test generation,

M. Schäfer, S. Nadi, A. Eghbali, and F. Tip, “An empirical eval- uation of using large language models for automated unit test generation,”IEEE Transactions on Software Engineering, vol. 50, no. 1, pp. 85–105, 2024

work page 2024
[4]

InProceedings of the 47th IEEE/ACM international conference on software engineering

Z. Nan, Z. Guo, K. Liu, and X. Xia, “ Test Intention Guided LLM-Based Unit Test Generation ,” in2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). Los Alamitos, CA, USA: IEEE Computer Society, May 2025, pp. 1026–1038. [Online]. Available: https://doi.ieeecomputersociety. org/10.1109/ICSE55347.2025.00243

work page doi:10.1109/icse55347.2025.00243 2025
[5]

Code-aware prompting: A study of coverage-guided test generation in regression setting using llm,

G. Ryan, S. Jain, M. Shang, S. Wang, X. Ma, M. K. Ramanathan, and B. Ray, “Code-aware prompting: A study of coverage-guided test generation in regression setting using llm,”Proc. ACM Softw. Eng., vol. 1, no. FSE, Jul. 2024. [Online]. Available: https://doi.org/10.1145/3643769

work page doi:10.1145/3643769 2024
[6]

Chatunitest: A framework for llm-based test generation,

Y . Chen, Z. Hu, C. Zhi, J. Han, S. Deng, and J. Yin, “Chatunitest: A framework for llm-based test generation,” inCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, ser. FSE 2024. New York, NY , USA: Association for Computing Machinery, 2024, p. 572–576. [Online]. Available: https://doi.org/10.1145/3...

work page doi:10.1145/3663529.3663801 2024
[7]

Llm4vv: Developing llm-driven testsuite for compiler validation,

C. Munley, A. Jarmusch, and S. Chandrasekaran, “Llm4vv: Developing llm-driven testsuite for compiler validation,”Future Generation Computer Systems, vol. 160, pp. 1–13, 2024. [Online]. Available: https://www.sciencedirect.com/science/article/ pii/S0167739X24002449

work page 2024
[8]

Refactoring programs using large language models with few- shot examples,

A. Shirafuji, Y . Oda, J. Suzuki, M. Morishita, and Y . Watanobe, “Refactoring programs using large language models with few- shot examples,” in2023 30th Asia-Pacific Software Engineering Conference (APSEC), 2023, pp. 151–160

work page 2023
[9]

Llm-based multi-agent system for intelligent refactoring of haskell code,

S. Siddeeq, M. Waseem, Z. Rasheed, M. M. Hasan, J. Rasku, M. Saari, H. Terho, K. Mäkelä, K.-K. Kemell, and P. Abrahamsson, “Llm-based multi-agent system for intelligent refactoring of haskell code,” inProduct-Focused Software Process Improvement, G. Scan- niello, V . Lenarduzzi, S. Romano, S. Vegas, and R. Francese, Eds. Cham: Springer Nature Switzerland,...

work page 2026
[10]

Llm-driven code refactoring: Opportunities and limitations,

J. Cordeiro, S. Noei, and Y . Zou, “Llm-driven code refactoring: Opportunities and limitations,” in2025 IEEE/ACM Second IDE Workshop (IDE), 2025, pp. 32–36

work page 2025
[11]

An empirical study on the potential of llms in automated software refactoring,

B. Liu, Y . Jiang, Y . Zhang, N. Niu, G. Li, and H. Liu, “An empirical study on the potential of llms in automated software refactoring,”

work page
[12]

Available: https://arxiv.org/abs/2411.04444

[Online]. Available: https://arxiv.org/abs/2411.04444

work page arXiv
[13]

Is refactoring always a good egg? exploring the interconnection between bugs and refactorings,

A. Bagheri and P. Hegedüs, “Is refactoring always a good egg? exploring the interconnection between bugs and refactorings,” in Proceedings of the 19th International Conference on Mining Soft- ware Repositories, 2022, pp. 117–121

work page 2022
[14]

An exploratory study of performance regression introducing code changes,

J. Chen and W. Shang, “An exploratory study of performance regression introducing code changes,” in2017 IEEE International Conference on Software Maintenance and Evolution (ICSME), 2017, pp. 341–352

work page 2017
[15]

Large language models are few- shot testers: Exploring llm-based general bug reproduction,

S. Kang, J. Yoon, and S. Yoo, “Large language models are few- shot testers: Exploring llm-based general bug reproduction,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), 2023, pp. 2312–2323

work page 2023
[16]

An empirical study on the code refactoring capability of large language models,

J. Cordeiro, S. Noei, and Y . Zou, “An empirical study on the code refactoring capability of large language models,”ACM Transactions on Software Engineering and Methodology, 2026

work page 2026
[17]

Ai-driven automatic code refactoring for performance optimization,

O. R. Polu, “Ai-driven automatic code refactoring for performance optimization,”International Journal of Science and Research, vol. 14, no. 1, pp. 1316–1320, 2025

work page 2025
[18]

Mantra: Enhancing automated method-level refactoring with contextual rag and multi-agent llm collaboration,

Y . Xu, F. Lin, J. Yang, T.-H. Chen, and N. Tsantalis, “Mantra: Enhancing automated method-level refactoring with contextual rag and multi-agent llm collaboration,” 2025. [Online]. Available: https://arxiv.org/abs/2503.14340

work page arXiv 2025
[19]

A multi-agent llm environment for software design and refactoring: A conceptual framework,

V . Rajendran, D. Besiahgari, S. C. Patil, M. Chandrashekaraiah, and V . Challagulla, “A multi-agent llm environment for software design and refactoring: A conceptual framework,” inSoutheastCon 2025, 2025, pp. 488–493

work page 2025
[20]

Optimizing ai-assisted code generation,

S. Torka and S. Albayrak, “Optimizing ai-assisted code generation,”

work page
[21]

Available: https://arxiv.org/abs/2412.10953

[Online]. Available: https://arxiv.org/abs/2412.10953

work page arXiv
[22]

Assessing the effectiveness and security implications of ai code generators,

M. Taeb, H. Chi, and S. Bernadin, “Assessing the effectiveness and security implications of ai code generators,”Journal of The Colloquium for Information Systems Security Education, vol. 11, p. 6, 02 2024

work page 2024
[23]

Claude code: Best practices for agentic coding,

Anthropic, “Claude code: Best practices for agentic coding,” https: //www.anthropic.com/engineering/claude-code-best-practices, Apr. 2025, accessed: 2026-01-26

work page 2025
[24]

AGENTS.md: A simple, open format for guiding coding agents,

AGENTS.md Project, “AGENTS.md: A simple, open format for guiding coding agents,” https://agents.md/, 2026, accessed: 2026- 01-26

work page 2026
[25]

Self-planning code generation with large language mod- els,

X. Jiang, Y . Dong, L. Wang, Z. Fang, Q. Shang, G. Li, Z. Jin, and W. Jiao, “Self-planning code generation with large language mod- els,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 7, pp. 1–30, 2024

work page 2024
[26]

Empowering ai to generate better ai code: Guided generation of deep learning projects with llms,

C. Xie, M. Jiao, X. Gu, and B. Shen, “Empowering ai to generate better ai code: Guided generation of deep learning projects with llms,” in2025 IEEE 49th Annual Computers, Software, and Appli- cations Conference (COMPSAC), 2025, pp. 1394–1399

work page 2025
[27]

R. K. Yin,Case study research and applications: Design and methods, 6th ed. Sage publications, 2018

work page 2018
[28]

Guidelines for conducting and reporting case study research in software engineering,

P. Runeson and M. Höst, “Guidelines for conducting and reporting case study research in software engineering,”Empirical Software Engineering, vol. 14, no. 2, pp. 131–164, 2009

work page 2009
[29]

Impossibility results in AI: A survey,

M. Brcic and R. V . Yampolskiy, “Impossibility results in AI: A survey,”ACM Computing Surveys, vol. 55, no. 12, 2023

work page 2023
[30]

Emotion concepts and their function in a large language model,

N. Sofroniew, I. Kauvar, W. Saunders, R. Chen, T. Henighan, S. Hydrie, C. Citro, A. Pearce, J. Tarng, W. Gurnee, J. Batson, S. Zimmerman, K. Rivoire, K. Fish, C. Olah, and J. Lindsey, “Emotion concepts and their function in a large language model,” Transformer Circuits Thread, Anthropic, April 2026, published April 2, 2026. [Online]. Available: https://tr...

work page 2026