pith. machine review for the scientific record. sign in

arxiv: 2604.03135 · v1 · submitted 2026-04-03 · 💻 cs.SE · cs.AI

Recognition: no theorem link

AI-Assisted Unit Test Writing and Test-Driven Code Refactoring: A Case Study

Ema Smolic, Luka Hobor, Mario Brcic, Mihael Kovac

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:45 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords AI-assisted testingunit test generationtest-driven refactoringcode refactoringcase studylegacy coderegression risk
0
0 comments X

The pith

AI coding models generate nearly 16,000 lines of unit tests in hours to enable safe refactoring of legacy code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines a case study in which coding models first generate unit tests that capture the behavior of an existing MVP-derived software system. These tests then serve as a validation layer for model-proposed refactoring changes performed under developer oversight. The workflow produced a large test suite rapidly, reached substantial branch coverage in key modules, and lowered the chance of regressions during major code modifications. The study also notes observed model errors, the need for manual corrections at times, and steps taken to address value misalignment between model outputs and intended system goals. This illustrates a shift toward empirical, data-constrained software engineering practices.

Core claim

Iterative AI unit test generation can capture existing system behavior well enough to constrain and validate subsequent model-assisted refactoring, yielding nearly 16,000 lines of tests in hours rather than weeks and up to 78% branch coverage in critical modules while reducing regression risk.

What carries the argument

Iteratively generated AI unit tests that validate and constrain supervised model-assisted refactoring changes.

Load-bearing premise

AI-generated tests capture enough of the system's intended behavior that passing them reliably confirms refactoring safety without missing important bugs.

What would settle it

A refactoring that passes all generated AI tests yet produces incorrect behavior in actual usage or additional manual tests.

Figures

Figures reproduced from arXiv: 2604.03135 by Ema Smolic, Luka Hobor, Mario Brcic, Mihael Kovac.

Figure 1
Figure 1. Figure 1: AI-led test generation flow The plan lived in a markdown file and contained infor￾mation on 1) which parts of the codebase should be tested in the next step, 2) how specific test suites should be structured, and 3) which conventions to follow (alongside the rules in context files). It was updated iteratively after the previous step was implemented and reviewed by a human reviewer. After devising a plan for… view at source ↗
Figure 3
Figure 3. Figure 3: LOC difference pre- and post-refactor V. FINDINGS AND DISCUSSION In this section, we discuss our findings based on the results stated in the previous section. A. AI-Assisted Test Suite Construction During the AI-led test generation process described in Section III, Subsection III-C, we found that the model invested significant effort into shared infrastructure. In other words, tests were organized into fil… view at source ↗
read the original abstract

Many software systems originate as prototypes or minimum viable products (MVPs), developed with an emphasis on delivery speed and responsiveness to changing requirements rather than long-term code maintainability. While effective for rapid delivery, this approach can result in codebases that are difficult to modify, presenting a significant opportunity cost in the era of AI-assisted or even AI-led programming. In this paper, we present a case study of using coding models for automated unit test generation and subsequent safe refactoring, with proposed code changes validated by passing tests. The study examines best practices for iteratively generating tests to capture existing system behavior, followed by model-assisted refactoring under developer supervision. We describe how this workflow constrained refactoring changes, the errors and limitations observed in both phases, the efficiency gains achieved, when manual intervention was necessary, and how we addressed the weak value misalignment we observed in models. Using this approach, we generated nearly 16,000 lines of reliable unit tests in hours rather than weeks, achieved up to 78\% branch coverage in critical modules, and significantly reduced regression risk during large-scale refactoring. These results illustrate software engineering's shift toward an empirical science, emphasizing data collection and constraining mechanisms that support fast, safe iteration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents a single-case study of using AI coding models to iteratively generate unit tests for a legacy codebase (nearly 16,000 lines, up to 78% branch coverage in critical modules) and then apply test-driven refactoring under developer supervision, claiming substantial time savings (hours vs. weeks) and reduced regression risk.

Significance. If the central safety claim holds, the work supplies a concrete, reproducible workflow for AI-assisted maintenance of MVP-derived codebases and illustrates measurable efficiency gains that could accelerate empirical practices in software engineering.

major comments (2)
  1. [Abstract and Results] Abstract and Results section: the headline claim that the generated tests are 'reliable' and 'significantly reduced regression risk' rests on the unverified premise that tests derived from the existing implementation capture intended behavior rather than current (possibly buggy) behavior. No fault-injection experiments, comparison to human-written test suites, or post-refactor manual audit are reported, so the safety assertion cannot be separated from the completeness assumption.
  2. [Methodology] Methodology: the workflow begins by prompting on the current code rather than an external specification or requirements document. This creates a circularity risk that any gaps or defects in the original implementation are mirrored in the test suite; the paper does not quantify or mitigate this beyond general developer supervision.
minor comments (1)
  1. [Abstract] The term 'weak value misalignment' is used in the abstract without a precise definition or example in the provided summary; a short clarification would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our case study. We address each major comment below with the strongest honest defense possible, proposing targeted revisions to clarify claims and limitations without overstating the results.

read point-by-point responses
  1. Referee: [Abstract and Results] Abstract and Results section: the headline claim that the generated tests are 'reliable' and 'significantly reduced regression risk' rests on the unverified premise that tests derived from the existing implementation capture intended behavior rather than current (possibly buggy) behavior. No fault-injection experiments, comparison to human-written test suites, or post-refactor manual audit are reported, so the safety assertion cannot be separated from the completeness assumption.

    Authors: We agree the distinction is important and that our original wording could be read as claiming verification of intended behavior. The case study explicitly targets capture of observed behavior in a legacy MVP codebase to support regression-safe refactoring; the tests are reliable insofar as they pass on the current implementation and flag behavioral changes during edits. However, absent fault injection, human-written test comparisons, or post-refactor audits, we cannot claim the suite would have caught pre-existing defects. We will revise the abstract and results to replace 'reliable unit tests' and 'significantly reduced regression risk' with phrasing that specifies 'tests capturing observed behavior' and 'reduced risk of unintended behavioral changes during refactoring,' and we will add an explicit limitations paragraph on this point. revision: partial

  2. Referee: [Methodology] Methodology: the workflow begins by prompting on the current code rather than an external specification or requirements document. This creates a circularity risk that any gaps or defects in the original implementation are mirrored in the test suite; the paper does not quantify or mitigate this beyond general developer supervision.

    Authors: This accurately describes the practical constraint of the setting: the codebase originated as an MVP with no maintained external specification. Prompting from current code is the only feasible entry point for reverse-engineering behavior in such legacy systems. Mitigation occurs via iterative developer review of generated tests before they are used to gate refactoring changes. While we cannot quantify the residual risk without unavailable ground-truth specifications, we will expand the methodology section to detail the specific supervision steps (e.g., manual inspection of tests for core modules, rejection of tests that fail to exercise key paths) and add a dedicated paragraph acknowledging the circularity risk as an inherent limitation of implementation-driven test generation for legacy code. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical case study with direct observations only

full rationale

The paper is a descriptive case study of an AI-assisted workflow for unit test generation and refactoring. It reports concrete empirical outcomes (16,000 lines of tests generated, up to 78% branch coverage) without any equations, fitted parameters, predictions, or derivations. No self-citations are used as load-bearing premises, no ansatzes are smuggled, and no results are renamed or defined in terms of themselves. All claims reduce to direct observation of the single case rather than any self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of current coding models at producing behavior-capturing tests and on the standard assumption that passing tests validate behavioral equivalence during refactoring.

axioms (2)
  • domain assumption AI coding models can iteratively produce tests that capture existing system behavior when guided by developer feedback
    Invoked as the basis for the test-generation phase success
  • standard math Tests that pass after a code change confirm that the change preserves original behavior
    Standard test-driven development premise used to validate refactoring

pith-pipeline@v0.9.0 · 5519 in / 1314 out tokens · 29629 ms · 2026-05-13T18:45:58.562663+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

  1. [1]

    Ai-assisted code generation and optimization in .net web development,

    A. S. Shethiya, “Ai-assisted code generation and optimization in .net web development,”Annals of Applied Sciences, vol. 6, no. 1, Jan. 2025. [Online]. Available: https://annalsofappliedsciences.com/ index.php/aas/article/view/15

  2. [2]

    Fowler,Refactoring: improving the design of existing code

    M. Fowler,Refactoring: improving the design of existing code. Addison-Wesley Professional, 2018

  3. [3]

    An empirical eval- uation of using large language models for automated unit test generation,

    M. Schäfer, S. Nadi, A. Eghbali, and F. Tip, “An empirical eval- uation of using large language models for automated unit test generation,”IEEE Transactions on Software Engineering, vol. 50, no. 1, pp. 85–105, 2024

  4. [4]

    InProceedings of the 47th IEEE/ACM international conference on software engineering

    Z. Nan, Z. Guo, K. Liu, and X. Xia, “ Test Intention Guided LLM-Based Unit Test Generation ,” in2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). Los Alamitos, CA, USA: IEEE Computer Society, May 2025, pp. 1026–1038. [Online]. Available: https://doi.ieeecomputersociety. org/10.1109/ICSE55347.2025.00243

  5. [5]

    Code-aware prompting: A study of coverage-guided test generation in regression setting using llm,

    G. Ryan, S. Jain, M. Shang, S. Wang, X. Ma, M. K. Ramanathan, and B. Ray, “Code-aware prompting: A study of coverage-guided test generation in regression setting using llm,”Proc. ACM Softw. Eng., vol. 1, no. FSE, Jul. 2024. [Online]. Available: https://doi.org/10.1145/3643769

  6. [6]

    Chatunitest: A framework for llm-based test generation,

    Y . Chen, Z. Hu, C. Zhi, J. Han, S. Deng, and J. Yin, “Chatunitest: A framework for llm-based test generation,” inCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, ser. FSE 2024. New York, NY , USA: Association for Computing Machinery, 2024, p. 572–576. [Online]. Available: https://doi.org/10.1145/3...

  7. [7]

    Llm4vv: Developing llm-driven testsuite for compiler validation,

    C. Munley, A. Jarmusch, and S. Chandrasekaran, “Llm4vv: Developing llm-driven testsuite for compiler validation,”Future Generation Computer Systems, vol. 160, pp. 1–13, 2024. [Online]. Available: https://www.sciencedirect.com/science/article/ pii/S0167739X24002449

  8. [8]

    Refactoring programs using large language models with few- shot examples,

    A. Shirafuji, Y . Oda, J. Suzuki, M. Morishita, and Y . Watanobe, “Refactoring programs using large language models with few- shot examples,” in2023 30th Asia-Pacific Software Engineering Conference (APSEC), 2023, pp. 151–160

  9. [9]

    Llm-based multi-agent system for intelligent refactoring of haskell code,

    S. Siddeeq, M. Waseem, Z. Rasheed, M. M. Hasan, J. Rasku, M. Saari, H. Terho, K. Mäkelä, K.-K. Kemell, and P. Abrahamsson, “Llm-based multi-agent system for intelligent refactoring of haskell code,” inProduct-Focused Software Process Improvement, G. Scan- niello, V . Lenarduzzi, S. Romano, S. Vegas, and R. Francese, Eds. Cham: Springer Nature Switzerland,...

  10. [10]

    Llm-driven code refactoring: Opportunities and limitations,

    J. Cordeiro, S. Noei, and Y . Zou, “Llm-driven code refactoring: Opportunities and limitations,” in2025 IEEE/ACM Second IDE Workshop (IDE), 2025, pp. 32–36

  11. [11]

    An empirical study on the potential of llms in automated software refactoring,

    B. Liu, Y . Jiang, Y . Zhang, N. Niu, G. Li, and H. Liu, “An empirical study on the potential of llms in automated software refactoring,”

  12. [12]

    Available: https://arxiv.org/abs/2411.04444

    [Online]. Available: https://arxiv.org/abs/2411.04444

  13. [13]

    Is refactoring always a good egg? exploring the interconnection between bugs and refactorings,

    A. Bagheri and P. Hegedüs, “Is refactoring always a good egg? exploring the interconnection between bugs and refactorings,” in Proceedings of the 19th International Conference on Mining Soft- ware Repositories, 2022, pp. 117–121

  14. [14]

    An exploratory study of performance regression introducing code changes,

    J. Chen and W. Shang, “An exploratory study of performance regression introducing code changes,” in2017 IEEE International Conference on Software Maintenance and Evolution (ICSME), 2017, pp. 341–352

  15. [15]

    Large language models are few- shot testers: Exploring llm-based general bug reproduction,

    S. Kang, J. Yoon, and S. Yoo, “Large language models are few- shot testers: Exploring llm-based general bug reproduction,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), 2023, pp. 2312–2323

  16. [16]

    An empirical study on the code refactoring capability of large language models,

    J. Cordeiro, S. Noei, and Y . Zou, “An empirical study on the code refactoring capability of large language models,”ACM Transactions on Software Engineering and Methodology, 2026

  17. [17]

    Ai-driven automatic code refactoring for performance optimization,

    O. R. Polu, “Ai-driven automatic code refactoring for performance optimization,”International Journal of Science and Research, vol. 14, no. 1, pp. 1316–1320, 2025

  18. [18]

    Mantra: Enhancing automated method-level refactoring with contextual rag and multi-agent llm collaboration,

    Y . Xu, F. Lin, J. Yang, T.-H. Chen, and N. Tsantalis, “Mantra: Enhancing automated method-level refactoring with contextual rag and multi-agent llm collaboration,” 2025. [Online]. Available: https://arxiv.org/abs/2503.14340

  19. [19]

    A multi-agent llm environment for software design and refactoring: A conceptual framework,

    V . Rajendran, D. Besiahgari, S. C. Patil, M. Chandrashekaraiah, and V . Challagulla, “A multi-agent llm environment for software design and refactoring: A conceptual framework,” inSoutheastCon 2025, 2025, pp. 488–493

  20. [20]

    Optimizing ai-assisted code generation,

    S. Torka and S. Albayrak, “Optimizing ai-assisted code generation,”

  21. [21]

    Available: https://arxiv.org/abs/2412.10953

    [Online]. Available: https://arxiv.org/abs/2412.10953

  22. [22]

    Assessing the effectiveness and security implications of ai code generators,

    M. Taeb, H. Chi, and S. Bernadin, “Assessing the effectiveness and security implications of ai code generators,”Journal of The Colloquium for Information Systems Security Education, vol. 11, p. 6, 02 2024

  23. [23]

    Claude code: Best practices for agentic coding,

    Anthropic, “Claude code: Best practices for agentic coding,” https: //www.anthropic.com/engineering/claude-code-best-practices, Apr. 2025, accessed: 2026-01-26

  24. [24]

    AGENTS.md: A simple, open format for guiding coding agents,

    AGENTS.md Project, “AGENTS.md: A simple, open format for guiding coding agents,” https://agents.md/, 2026, accessed: 2026- 01-26

  25. [25]

    Self-planning code generation with large language mod- els,

    X. Jiang, Y . Dong, L. Wang, Z. Fang, Q. Shang, G. Li, Z. Jin, and W. Jiao, “Self-planning code generation with large language mod- els,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 7, pp. 1–30, 2024

  26. [26]

    Empowering ai to generate better ai code: Guided generation of deep learning projects with llms,

    C. Xie, M. Jiao, X. Gu, and B. Shen, “Empowering ai to generate better ai code: Guided generation of deep learning projects with llms,” in2025 IEEE 49th Annual Computers, Software, and Appli- cations Conference (COMPSAC), 2025, pp. 1394–1399

  27. [27]

    R. K. Yin,Case study research and applications: Design and methods, 6th ed. Sage publications, 2018

  28. [28]

    Guidelines for conducting and reporting case study research in software engineering,

    P. Runeson and M. Höst, “Guidelines for conducting and reporting case study research in software engineering,”Empirical Software Engineering, vol. 14, no. 2, pp. 131–164, 2009

  29. [29]

    Impossibility results in AI: A survey,

    M. Brcic and R. V . Yampolskiy, “Impossibility results in AI: A survey,”ACM Computing Surveys, vol. 55, no. 12, 2023

  30. [30]

    Emotion concepts and their function in a large language model,

    N. Sofroniew, I. Kauvar, W. Saunders, R. Chen, T. Henighan, S. Hydrie, C. Citro, A. Pearce, J. Tarng, W. Gurnee, J. Batson, S. Zimmerman, K. Rivoire, K. Fish, C. Olah, and J. Lindsey, “Emotion concepts and their function in a large language model,” Transformer Circuits Thread, Anthropic, April 2026, published April 2, 2026. [Online]. Available: https://tr...