Humanizing Automatically Generated Unit Test Suites with LLM-Based Refactoring

Anil Koyuncu; David Lo; Jacques Klein; Lingfeng Bao; Pawe{\l} Borsukiewicz; Tegawend\'e F. Bissyand\'e; Wendk\^uuni C. Ou\'edraogo; Xueqi Dang; Yinghua Li

arxiv: 2606.28229 · v1 · pith:JYIBDQGQnew · submitted 2026-06-26 · 💻 cs.SE

Humanizing Automatically Generated Unit Test Suites with LLM-Based Refactoring

Wendk\^uuni C. Ou\'edraogo , Yinghua Li , Xueqi Dang , Pawe{\l} Borsukiewicz , Lingfeng Bao , Anil Koyuncu , Jacques Klein , David Lo

show 1 more author

Tegawend\'e F. Bissyand\'e

This is my paper

Pith reviewed 2026-06-29 02:57 UTC · model grok-4.3

classification 💻 cs.SE

keywords unit test generationLLM refactoringsearch-based software testingtest readabilityEvoSuitehybrid SBST+LLMtest suite maintenance

0 comments

The pith

LLM refactoring of EvoSuite suites yields 88-98% compilation while preserving coverage and boosting readability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Search-based test generators produce compilable high-coverage suites that are often difficult for developers to read or maintain. The paper presents TestHumanizer, which applies LLMs to refactor those suites into clearer forms using controlled prompts. Evaluation across 350 classes shows the refactored suites compile at 88-98 percent, stay within 1-2 points of original coverage, and meet faithful-refactoring criteria in 86-95 percent of cases. A developer study on 30 classes finds significant improvements in perceived readability and willingness to adopt the suites.

Core claim

TestHumanizer reaches 88-98% compilation rates on refactored suites, close to EvoSuite's 100% baseline, with structural coverage preserved within 1-2 percentage points and 86-95% of refactorings satisfying a composite faithful-refactoring threshold, while also improving predicted readability and reducing complexity.

What carries the argument

TestHumanizer, a hybrid pipeline that applies LLMs as validation-gated refactoring layers over SBST suites under summary-based, code-centric, and other context configurations.

If this is right

Structural coverage remains within 1-2 percentage points of the original EvoSuite suites.
86-95% of the produced refactorings satisfy the composite faithful-refactoring threshold.
Refactored suites reduce control-flow complexity, cognitive complexity, and structural smells.
The summary-based context configuration delivers the most robust trade-off between success rate and quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Standalone LLM test generation would likely need post-generation validation steps to approach the reliability of the hybrid approach.
The same refinement layer could be applied to outputs from other search-based or random test generators.
The results point to a general pattern in which generative models function best when gated by outputs from established automated methods.

Load-bearing premise

LLM refactoring preserves the original test semantics and intent even when only checked via compilation success and line or branch coverage.

What would settle it

A refactored test that compiles and reports matching coverage yet fails on an input where the original test passes, or passes where the original fails.

Figures

Figures reproduced from arXiv: 2606.28229 by Anil Koyuncu, David Lo, Jacques Klein, Lingfeng Bao, Pawe{\l} Borsukiewicz, Tegawend\'e F. Bissyand\'e, Wendk\^uuni C. Ou\'edraogo, Xueqi Dang, Yinghua Li.

**Figure 1.** Figure 1: Overview of the general workflow of TestHumanizer. form a multi-layered validation pipeline that balances developer-oriented improvements with behavioral faithfulness. 3.2 Research Questions RQ1 (Feasibility and robustness): How feasible and robust is TestHumanizer in producing syntactically correct and compilable refactored unit test suites? This research question assesses the viability of TestHumanizer a… view at source ↗

**Figure 2.** Figure 2: reports their occurrence counts across the three context configurations. Cannot Find Symbol Incompatible Types Package Does Not Exist 0 500 1000 1500 2000 2500 Number of compilation error occurrences 2,395 299 183 2,428 308 183 2,287 294 183 (a) Defects4J Cannot Find Symbol Incompatible Types 0 20 40 60 80 100 120 140 98 81 118 52 61 25 Package Does Not Exist not observed in SF110 (b) SF110 Configuration T… view at source ↗

**Figure 3.** Figure 3: Readability with Scalabrino. EvoSuite gpt-4o Tests-onlyCode-centric Summary-based 0 20 40 60 80 100 120 Approach Cyclomatic (a) Defects4J EvoSuite gpt-4o Tests-onlyCode-centric Summary-based 0 50 100 150 200 Approach Cyclomatic (b) SF110 [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Cyclomatic Complexity. All three TestHumanizer configurations with gpt-4o outperform both EvoSuite and direct LLM-generated tests on predicted readability. On Defects4J, the mean Scalabrino score increases from 0.56 (EvoSuite) to 0.67 (direct gpt-4o), and further to 0.74–0.75 for TestHumanizer, with the summary-based variant achieving the highest mean (0.75). A consistent pattern holds on SF110: EvoSuite a… view at source ↗

**Figure 5.** Figure 5: Test-Aware Cognitive Complexity. 19.8 and CCTR ≈ 34.2 vs. 39.5 on Defects4J; CC ≈ 19.8 vs. 25.6 and CCTR ≈ 46.7 vs. 58.6 on SF110), while remaining more structurally rich than direct LLM outputs. Among the three configurations, summary-based yields the lowest complexity on both datasets. Taken together, these results show that TestHumanizer improves predicted readability while reducing the cognitive load o… view at source ↗

**Figure 6.** Figure 6: shows the distribution of instruction coverage. EvoSuite provides a strong SBST baseline, with tightly clustered high coverage on both Defects4J and SF110. Direct gpt-4o suites exhibit substantially lower and more dispersed coverage, with medians around 80% on Defects4J and close to 60% on SF110. In contrast, all three TestHumanizer configurations closely track EvoSuite, with the summary-based variant show… view at source ↗

**Figure 7.** Figure 7: Line coverage [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: reports method coverage. EvoSuite achieves near-perfect coverage on both datasets (medians at 100%). Direct gpt-4o tests cover far fewer methods, particularly on SF110 where the distribution centers well below 50%. TestHumanizer almost entirely preserves EvoSuite’s method coverage: all three configurations maintain medians at 100% on both datasets, with mean method coverage above 99.4% on Defects4J and 99.… view at source ↗

**Figure 9.** Figure 9: Distributions of readability scores from Human evaluation [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Distributions of adoption scores from the human evaluation. Same layout as [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

read the original abstract

Search-based test generation tools such as EvoSuite produce compilable and high-coverage unit tests at scale, but their suites are often hard to read and maintain. LLMs can generate more natural tests, yet direct generation remains brittle, with compilation rates of only 51-78% in our study. We introduce TestHumanizer, a hybrid SBST+LLM approach that uses LLMs as controlled refactoring layers over compilable SBST suites to improve naming, structure, and developer-oriented clarity while preserving behavior and compilation validity. We evaluate TestHumanizer on 350 classes from Defects4J and SF110. EvoSuite generates 15 suites per class, and each suite is refactored under three context configurations using gpt-4o and mistral-large-2407, yielding 31,500 refactorings. TestHumanizer reaches 88-98% compilation rates, close to EvoSuite's 100% baseline and clearly above direct LLM generation. Structural coverage is largely preserved, typically within 1-2 percentage points, and 86-95% of refactorings satisfy a composite faithful-refactoring threshold. Refactored suites also improve predicted readability, reduce control-flow and cognitive complexity, and mitigate structural smells. The summary-based setting offers the most robust trade-off, while long code-centric prompts are more prone to hallucination-induced failures. A developer study on 30 classes and 444 test methods confirms significant gains in perceived readability and willingness to adopt, with Wilcoxon p less than 0.01 and substantial inter-rater agreement. Overall, LLMs are most effective not as standalone generators but as validation-gated refinement layers over robust SBST outputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TestHumanizer shows LLMs can refactor EvoSuite suites for better readability at scale, but the 86-95% faithful claim rests only on compilation plus coverage.

read the letter

The main takeaway is that using LLMs as a controlled refactoring step on top of working SBST suites produces more readable tests than asking LLMs to generate from scratch. Compilation stays high at 88-98%, coverage drops are small, and a developer study finds people prefer the results.

The evaluation is the strongest part. They processed 350 classes across Defects4J and SF110, generated 15 suites per class, applied three different context setups with two models, and ended up with 31,500 refactorings. Adding the developer study on 30 classes and 444 methods with Wilcoxon tests gives it more weight than most pure generation papers. The finding that summary-based prompts work better than long code-centric ones is practical and matches what people see with hallucination.

The soft spot is the definition of faithful refactoring. It combines compilation success with structural coverage staying within 1-2 points. That setup can miss cases where an assertion is weakened, a precondition changes, or exception behavior shifts while the same lines still execute. The abstract does not mention differential testing, assertion equivalence checks, or mutation analysis, so the 86-95% figure is harder to trust on its own. The paper already notes hallucination risks with longer prompts, which makes the gap more noticeable.

This is for researchers and tool builders working on automated Java unit testing. Anyone comparing SBST and LLM approaches will get usable numbers on context choices and human preference. The scale and study are enough to justify sending it to referees, even if the semantic preservation argument needs tightening in review.

Referee Report

2 major / 2 minor

Summary. The paper introduces TestHumanizer, a hybrid SBST+LLM refactoring approach that applies LLMs (gpt-4o, mistral-large-2407) to EvoSuite-generated test suites on 350 classes from Defects4J and SF110. It generates 31,500 refactorings across three context configurations, reporting 88-98% compilation rates (vs. 51-78% for direct LLM generation), structural coverage preservation within 1-2pp, 86-95% of refactorings meeting a composite faithful-refactoring threshold, reductions in complexity/smells, and a developer study (30 classes, 444 methods) showing significant readability gains (Wilcoxon p<0.01).

Significance. If the preservation claims hold, the work demonstrates a scalable, validation-gated use of LLMs as refinement layers over robust SBST outputs rather than standalone generators. The large-scale empirical results (31,500 refactorings) plus controlled developer study with statistical tests provide concrete evidence for improved readability and adoptability of automated tests, with the summary-based configuration identified as the best trade-off.

major comments (2)

[Abstract / Evaluation] Abstract and Evaluation section: The headline 86-95% 'faithful-refactoring' rate rests on a composite threshold (compilation success + structural coverage within 1-2pp). This threshold is load-bearing for the central preservation claim yet appears insufficient to rule out semantic alterations to assertions, preconditions, or exception expectations, which would still satisfy line/branch coverage. No differential execution, assertion-equivalence checks, or mutation-score comparison is described.
[Developer study] Developer study (30 classes): While Wilcoxon tests show p<0.01 for readability, the study design does not report whether participants were shown the original EvoSuite suite alongside the refactored version or whether they executed the tests; this limits the strength of the 'willingness to adopt' conclusion relative to the preservation claims.

minor comments (2)

[Method] The three context configurations (summary-based, code-centric, long code-centric) are referenced but their exact prompt templates and token budgets should be provided in an appendix or table for reproducibility.
[Results] Table or figure reporting per-configuration hallucination rates and failure modes would clarify why long prompts are 'more prone to hallucination-induced failures'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment below, with revisions planned where the evaluation can be strengthened or clarified without misrepresenting the current results.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and Evaluation section: The headline 86-95% 'faithful-refactoring' rate rests on a composite threshold (compilation success + structural coverage within 1-2pp). This threshold is load-bearing for the central preservation claim yet appears insufficient to rule out semantic alterations to assertions, preconditions, or exception expectations, which would still satisfy line/branch coverage. No differential execution, assertion-equivalence checks, or mutation-score comparison is described.

Authors: We acknowledge that the composite threshold (successful compilation plus structural coverage within 1-2pp) does not provide a complete guarantee against semantic changes to assertions, preconditions, or exception expectations. Coverage preservation serves as a practical proxy for behavioral similarity in a refactoring setting, but it is not equivalent to mutation analysis or differential execution. In the revised manuscript we will (1) explicitly state this limitation in the threats-to-validity section and (2) add a short discussion of why coverage was chosen as the primary metric given the scale of 31,500 refactorings. We will also report any post-hoc assertion-change statistics that can be computed from the existing data without new experiments. Full mutation-score comparisons would require substantial additional computation and are noted as future work. revision: partial
Referee: [Developer study] Developer study (30 classes): While Wilcoxon tests show p<0.01 for readability, the study design does not report whether participants were shown the original EvoSuite suite alongside the refactored version or whether they executed the tests; this limits the strength of the 'willingness to adopt' conclusion relative to the preservation claims.

Authors: The developer study collected ratings on readability and willingness to adopt for the refactored suites. The manuscript does not currently detail the exact presentation protocol or whether participants executed the tests. We will revise the study-design subsection to clarify that (a) participants evaluated each refactored suite after being told it was derived from an EvoSuite baseline (but were not shown the original side-by-side for every rating), and (b) the study was a perception survey and did not require test execution. These clarifications will be added to the revised version so that the strength of the adoption claims is accurately represented. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on external benchmarks

full rationale

The paper is a purely empirical study that applies LLMs as a post-processing layer to EvoSuite-generated test suites, then measures compilation success, structural coverage preservation, readability metrics, and developer perception on fixed external benchmarks (Defects4J, SF110). No equations, fitted parameters, or self-defined quantities appear; the reported percentages (88-98% compilation, 86-95% faithful refactorings) are direct counts of outcomes against independently defined criteria rather than quantities that reduce to the inputs by construction. The composite threshold is a reporting filter, not a self-referential derivation. No load-bearing self-citations or uniqueness theorems are invoked. The derivation chain is therefore self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LLM refactoring can be constrained to preserve test behavior when guided by compilation and coverage signals.

axioms (1)

domain assumption LLM-based refactoring preserves test behavior when compilation and structural coverage remain within small thresholds
This assumption underpins the claim that the refactored suites remain faithful without full semantic equivalence verification.

pith-pipeline@v0.9.1-grok · 5880 in / 1179 out tokens · 59498 ms · 2026-06-29T02:57:55.194335+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Toufique Ahmed and Premkumar Devanbu. 2022. Few-shot training LLMs for project-specific code-summarization. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–5

2022
[2]

Toufique Ahmed, Kunal Suresh Pai, Premkumar Devanbu, and Earl T Barr. 2024. Automatic semantic augmentation of language model prompts (for code summarization). In 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE).IEEE Computer Society(2024), 1004–1004

2024
[3]

M Moein Almasi, Hadi Hemmati, Gordon Fraser, Andrea Arcuri, and Janis Benefelds. 2017. An industrial evaluation of unit test generation: Finding real faults in a financial application. In2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP). IEEE, 263–272

2017
[4]

Andrea Arcuri and Gordon Fraser. 2013. Parameter tuning or default values? An empirical investigation in search-based software engineering.Empirical Software Engineering18 (2013), 594–623

2013
[5]

Alberto Bacchelli, Paolo Ciancarini, and Davide Rossi. 2008. On the effectiveness of manual and automatic unit test generation. In2008 The Third International Conference on Software Engineering Advances. IEEE, 252–257

2008
[6]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine J. ACM, Vol. 1, No. 1, Article . Publication date: June 2026. Humanizing Automatically Generated Unit Test Suites with LLM-Based ...

2005
[7]

Gabriele Bavota, Abdallah Qusef, Rocco Oliveto, Andrea De Lucia, and David Binkley. 2012. An empirical analysis of the distribution of unit test smells and their impact on software maintenance. In2012 28th IEEE international conference on software maintenance (ICSM). IEEE, 56–65

2012
[8]

Gabriele Bavota, Abdallah Qusef, Rocco Oliveto, Andrea De Lucia, and Dave Binkley. 2015. Are test smells really harmful? an empirical study.Empirical Software Engineering20 (2015), 1052–1094

2015
[9]

2000.Extreme programming explained: embrace change

Kent Beck. 2000.Extreme programming explained: embrace change. addison-wesley professional

2000
[10]

Shreya Bhatia, Tarushi Gandhi, Dhruv Kumar, and Pankaj Jalote. 2024. Unit test generation using generative ai: A comparative performance analysis of autogeneration tools. InProceedings of the 1st International Workshop on Large Language Models for Code. 54–61

2024
[11]

Matteo Biagiola, Gianluca Ghislotti, and Paolo Tonella. 2025. Improving the readability of automatically generated tests using large language models. In2025 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 162–173

2025
[12]

Raymond PL Buse and Westley R Weimer. 2009. Learning a metric for code readability.IEEE Transactions on software engineering36, 4 (2009), 546–558

2009
[13]

G Ann Campbell. 2018. Cognitive complexity: An overview and evaluation. InProceedings of the 2018 international conference on technical debt. 57–58

2018
[14]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

Pith/arXiv arXiv 2021
[15]

Ermira Daka, José Campos, Gordon Fraser, Jonathan Dorn, and Westley Weimer. 2015. Modeling readability to improve unit tests. InProceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. 107–118

2015
[16]

Amirhossein Deljouyi, Roham Koohestani, Maliheh Izadi, and Andy Zaidman. 2024. Leveraging Large Language Models for Enhancing the Understandability of Generated Unit Tests.arXiv preprint arXiv:2408.11710(2024)

arXiv 2024
[17]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al . 2020. Codebert: A pre-trained model for programming and natural languages.arXiv preprint arXiv:2002.08155(2020)

Pith/arXiv arXiv 2020
[18]

Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation for object-oriented software. InProceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 416–419

2011
[19]

Gordon Fraser and Andrea Arcuri. 2014. A large-scale evaluation of automated unit test generation using evosuite. ACM Transactions on Software Engineering and Methodology (TOSEM)24, 2 (2014), 1–42

2014
[20]

Gordon Fraser, Matt Staats, Phil McMinn, Andrea Arcuri, and Frank Padberg. 2015. Does automated unit test generation really help software testers? a controlled empirical study.ACM Transactions on Software Engineering and Methodology (TOSEM)24, 4 (2015), 1–49

2015
[21]

Yi Gao, Xing Hu, Xiaohu Yang, and Xin Xia. 2024. Context-Enhanced LLM-Based Framework for Automatic Test Refactoring.arXiv preprint arXiv:2409.16739(2024)

arXiv 2024
[22]

Yi Gao, Xing Hu, Xiaohu Yang, and Xin Xia. 2025. Automated Unit Test Refactoring.Proceedings of the ACM on Software Engineering2, FSE (2025), 713–733

2025
[23]

Gregory Gay. 2023. Improving the Readability of Generated Tests Using GPT-4 and ChatGPT Code Interpreter. In International Symposium on Search Based Software Engineering. Springer, 140–146

2023
[24]

Giovanni Grano, Simone Scalabrino, Harald C Gall, and Rocco Oliveto. 2018. An empirical investigation on the readability of manual and generated test cases. InProceedings of the 26th Conference on Program Comprehension. 348–351

2018
[25]

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2020. Graphcodebert: Pre-training code representations with data flow.arXiv preprint arXiv:2009.08366 (2020)

Pith/arXiv arXiv 2020
[26]

Cheng-Yu Hsieh, Yung-Sung Chuang, Chun-Liang Li, Zifeng Wang, Long Le, Abhishek Kumar, James Glass, Alexander Ratner, Chen-Yu Lee, Ranjay Krishna, et al. 2024. Found in the middle: Calibrating positional attention bias improves long context utilization. InFindings of the Association for Computational Linguistics: ACL 2024. 14982–14995

2024
[27]

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In54th Annual Meeting of the Association for Computational Linguistics 2016. Association for Computational Linguistics, 2073–2083

2016
[28]

Gunel Jahangirova and Valerio Terragni. 2023. SBFT tool competition 2023-Java test case generation track. In2023 IEEE/ACM International Workshop on Search-Based and Fuzz Testing (SBFT). IEEE, 61–64

2023
[29]

René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. InProceedings of the 2014 international symposium on software testing and analysis. J. ACM, Vol. 1, No. 1, Article . Publication date: June 2026. 34 W. Ouédraogo et al. 437–440

2014
[30]

Adam Kalai, Ofir Nachum, Santosh Vempala, and Edwin Zhang. 2025. Why Language Models Hallucinate. https: //doi.org/10.48550/arXiv.2509.04664

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.04664 2025
[31]

Caroline Lemieux, Jeevana Priya Inala, Shuvendu K Lahiri, and Siddhartha Sen. 2023. Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 919–931

2023
[32]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out. 74–81

2004
[33]

Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. InProceedings of the 42nd annual meeting of the association for computational linguistics (ACL-04). 605–612

2004
[34]

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics 12 (2024), 157–173

2024
[35]

Stephan Lukasczyk and Gordon Fraser. 2022. Pynguin: Automated unit test generation for python. InProceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings. 168–172

2022
[36]

Antonio Mastropaolo, Matteo Ciniselli, Massimiliano Di Penta, and Gabriele Bavota. 2024. Evaluating Code Summariza- tion Techniques: A New Metric and an Empirical Characterization. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

2024
[37]

Thomas J McCabe. 1976. A complexity measure.IEEE Transactions on software Engineering4 (1976), 308–320

1976
[38]

Rian Melo, Pedro Simões, Rohit Gheyi, Marcelo d’Amorim, Márcio Ribeiro, Gustavo Soares, Eduardo Almeida, and Elvys Soares. 2025. Agentic LMs: Hunting Down Test Smells.IEEE Software(2025)

2025
[39]

Delano Oliveira, Reyde Bruno, Fernanda Madeiral, Hidehiko Masuhara, and Fernando Castor. 2022. A systematic literature review on the impact of formatting elements on program understandability.A vailable at SSRN 4182156 (2022)

2022
[40]

Wendkûuni C Ouédraogo, Kader Kaboré, Haoye Tian, Yewei Song, Anil Koyuncu, Jacques Klein, David Lo, and Tegawendé F Bissyandé. 2024. Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation.arXiv preprint arXiv:2407.00225(2024)

arXiv 2024
[41]

Wendkuuni C Ouedraogo, Kader Kabore, Haoye Tian, Yewei Song, Anil Koyuncu, Jacques Klein, David Lo, and Tegawende F Bissyande. 2024. Llms and prompting for unit test generation: A large-scale evaluation. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 2464–2465

2024
[42]

Wendkûuni C Ouédraogo, Yinghua Li, Xueqi Dang, Pawel Borsukiewicz, Xin Zhou, Anil Koyuncu, Jacques Klein, David Lo, and Tegawendé F Bissyandé. 2025. Human-Aligned Code Readability Assessment with Large Language Models.arXiv preprint arXiv:2510.16579(2025)

arXiv 2025
[43]

Wendkûuni C Ouédraogo, Yinghua Li, Xueqi Dang, Xunzhu Tang, Anil Koyuncu, Jacques Klein, David Lo, and Tegawendé F Bissyandé. 2024. Test smells in llm-generated unit tests.arXiv preprint arXiv:2410.10628(2024)

arXiv 2024
[44]

Wendkûuni C Ouédraogo, Yinghua Li, Xueqi Dang, Xin Zhou, Anil Koyuncu, Jacques Klein, David Lo, and Tegawendé F Bissyandé. 2025. Beyond Surface Similarity: Evaluating LLM-Based Test Refactorings with Structural and Semantic Awareness.arXiv preprint arXiv:2506.06767(2025)

arXiv 2025
[45]

Wendkûuni C Ouédraogo, Yinghua Li, Xueqi Dang, Xin Zhou, Anil Koyuncu, Jacques Klein, David Lo, and Tegawendé F Bissyandé. 2025. Rethinking Cognitive Complexity for Unit Tests: Toward a Readability-Aware Metric Grounded in Developer Perception. In2025 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 797–802

2025
[46]

Wendkûuni C Ouédraogo, Laura Plein, Kader Kabore, Andrew Habib, Jacques Klein, David Lo, and Tegawendé F Bissyandé. 2025. Enriching automatic test case generation by extracting relevant test inputs from bug reports.Empirical Software Engineering30, 3 (2025), 85

2025
[47]

Fabio Palomba, Dario Di Nucci, Annibale Panichella, Rocco Oliveto, and Andrea De Lucia. 2016. On the diffusion of test smells in automatically generated test code: An empirical study. InProceedings of the 9th international workshop on search-based software testing. 5–14

2016
[48]

Rangeet Pan, Myeongsoo Kim, Rahul Krishna, Raju Pavuluri, and Saurabh Sinha. 2025. Aster: Natural and multi- language unit test generation with llms. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 413–424

2025
[49]

Annibale Panichella, Fitsum Meshesha Kifetew, and Paolo Tonella. 2017. Automated test case generation as a many- objective optimisation problem with dynamic selection of the targets.IEEE Transactions on Software Engineering44, 2 (2017), 122–158

2017
[50]

Annibale Panichella, Sebastiano Panichella, Gordon Fraser, Anand Ashok Sawant, and Vincent J Hellendoorn. 2020. Revisiting test smells in automatically generated tests: limitations, pitfalls, and opportunities. In2020 IEEE international J. ACM, Vol. 1, No. 1, Article . Publication date: June 2026. Humanizing Automatically Generated Unit Test Suites with L...

2020
[51]

Anthony Peruma, Khalid Almalki, Christian D Newman, Mohamed Wiem Mkaouer, Ali Ouni, and Fabio Palomba
[52]

InProceedings of the 28th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering

Tsdetect: An open source test smells detection tool. InProceedings of the 28th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. 1650–1654
[53]

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. Codebleu: a method for automatic evaluation of code synthesis.arXiv preprint arXiv:2009.10297 (2020)

Pith/arXiv arXiv 2020
[54]

Devjeet Roy, Ziyi Zhang, Maggie Ma, Venera Arnaoudova, Annibale Panichella, Sebastiano Panichella, Danielle Gonzalez, and Mehdi Mirakhorli. 2020. DeepTC-Enhancer: Improving the readability of automatically generated tests. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 287–298

2020
[55]

Simone Scalabrino, Mario Linares-Vásquez, Rocco Oliveto, and Denys Poshyvanyk. 2018. A comprehensive model for code readability.Journal of Software: Evolution and Process30, 6 (2018), e1958

2018
[56]

Agnia Sergeyuk, Olga Lvova, Sergey Titov, Anastasiia Serova, Farid Bagirov, and Timofey Bryksin. 2024. Assessing Consensus of Developers’ Views on Code Readability.arXiv preprint arXiv:2407.03790(2024)

arXiv 2024
[57]

Agnia Sergeyuk, Olga Lvova, Sergey Titov, Anastasiia Serova, Farid Bagirov, Evgeniia Kirillova, and Timofey Bryksin
[58]

InProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension

Reassessing Java Code Readability Models with a Human-Centered Approach. InProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension. 225–235
[59]

Sina Shamshiri, René Just, José Miguel Rojas, Gordon Fraser, Phil McMinn, and Andrea Arcuri. 2015. Do automatically generated unit tests find real faults? an empirical study of effectiveness and challenges (t). In2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 201–211

2015
[60]

Jiho Shin, Nima Shiri Harzevili, Reem Aleithan, Hadi Hemmati, and Song Wang. 2024. Retrieval-augmented test generation: How far are we?arXiv preprint arXiv:2409.12682(2024)

arXiv 2024
[61]

O’Reilly Media, Inc

James Shore and Shane Warden. 2021.The art of agile development. " O’Reilly Media, Inc. "

2021
[62]

Mohammed Latif Siddiq, Joanna CS Santos, Ridwanul Hasan Tanvir, Noshin Ulfat, Fahmid Al Rifat, and Vinícius Car- valho Lopes. 2024. Using Large Language Models to Generate JUnit Tests: An Empirical Study. (2024)

2024
[63]

O’Reilly Media, Inc

Saleem Siddiqui. 2021.Learning Test-Driven Development. " O’Reilly Media, Inc. "

2021
[64]

Weisong Sun, Yun Miao, Yuekang Li, Hongyu Zhang, Chunrong Fang, Yi Liu, Gelei Deng, Yang Liu, and Zhenyu Chen. 2024. Source Code Summarization in the Era of Large Language Models. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 419–431

2024
[65]

Yutian Tang, Zhijie Liu, Zhichao Zhou, and Xiapu Luo. 2024. Chatgpt vs sbst: A comparative assessment of unit test suite generation.IEEE Transactions on Software Engineering(2024)

2024
[66]

Arie Van Deursen, Leon Moonen, Alex Van Den Bergh, and Gerard Kok. 2001. Refactoring test code. InProceedings of the 2nd international conference on extreme programming and flexible processes in software engineering (XP2001). Citeseer, 92–95

2001
[67]

Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering(2024)

2024
[68]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171 (2022)

Pith/arXiv arXiv 2022
[69]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al . 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

2022
[70]

Zhiqiang Yuan, Yiling Lou, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, and Xin Peng. 2023. No more manual tests? evaluating and improving chatgpt for unit test generation.arXiv preprint arXiv:2305.04207(2023)

arXiv 2023
[71]

Zhichao Zhou, Yutian Tang, Yun Lin, and Jingzhu He. 2024. An LLM-based readability measurement for unit tests’ context-aware inputs.arXiv preprint arXiv:2407.21369(2024). A Additional Tables Table 10.Test generation statistics across datasets and tools. Dataset Model #Projects #ClassesTest GenerationRate (%) CompilabilityRate (%)TokensMax TokensMin Tokens...

arXiv 2024

[1] [1]

Toufique Ahmed and Premkumar Devanbu. 2022. Few-shot training LLMs for project-specific code-summarization. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–5

2022

[2] [2]

Toufique Ahmed, Kunal Suresh Pai, Premkumar Devanbu, and Earl T Barr. 2024. Automatic semantic augmentation of language model prompts (for code summarization). In 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE).IEEE Computer Society(2024), 1004–1004

2024

[3] [3]

M Moein Almasi, Hadi Hemmati, Gordon Fraser, Andrea Arcuri, and Janis Benefelds. 2017. An industrial evaluation of unit test generation: Finding real faults in a financial application. In2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP). IEEE, 263–272

2017

[4] [4]

Andrea Arcuri and Gordon Fraser. 2013. Parameter tuning or default values? An empirical investigation in search-based software engineering.Empirical Software Engineering18 (2013), 594–623

2013

[5] [5]

Alberto Bacchelli, Paolo Ciancarini, and Davide Rossi. 2008. On the effectiveness of manual and automatic unit test generation. In2008 The Third International Conference on Software Engineering Advances. IEEE, 252–257

2008

[6] [6]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine J. ACM, Vol. 1, No. 1, Article . Publication date: June 2026. Humanizing Automatically Generated Unit Test Suites with LLM-Based ...

2005

[7] [7]

Gabriele Bavota, Abdallah Qusef, Rocco Oliveto, Andrea De Lucia, and David Binkley. 2012. An empirical analysis of the distribution of unit test smells and their impact on software maintenance. In2012 28th IEEE international conference on software maintenance (ICSM). IEEE, 56–65

2012

[8] [8]

Gabriele Bavota, Abdallah Qusef, Rocco Oliveto, Andrea De Lucia, and Dave Binkley. 2015. Are test smells really harmful? an empirical study.Empirical Software Engineering20 (2015), 1052–1094

2015

[9] [9]

2000.Extreme programming explained: embrace change

Kent Beck. 2000.Extreme programming explained: embrace change. addison-wesley professional

2000

[10] [10]

Shreya Bhatia, Tarushi Gandhi, Dhruv Kumar, and Pankaj Jalote. 2024. Unit test generation using generative ai: A comparative performance analysis of autogeneration tools. InProceedings of the 1st International Workshop on Large Language Models for Code. 54–61

2024

[11] [11]

Matteo Biagiola, Gianluca Ghislotti, and Paolo Tonella. 2025. Improving the readability of automatically generated tests using large language models. In2025 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 162–173

2025

[12] [12]

Raymond PL Buse and Westley R Weimer. 2009. Learning a metric for code readability.IEEE Transactions on software engineering36, 4 (2009), 546–558

2009

[13] [13]

G Ann Campbell. 2018. Cognitive complexity: An overview and evaluation. InProceedings of the 2018 international conference on technical debt. 57–58

2018

[14] [14]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

Pith/arXiv arXiv 2021

[15] [15]

Ermira Daka, José Campos, Gordon Fraser, Jonathan Dorn, and Westley Weimer. 2015. Modeling readability to improve unit tests. InProceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. 107–118

2015

[16] [16]

Amirhossein Deljouyi, Roham Koohestani, Maliheh Izadi, and Andy Zaidman. 2024. Leveraging Large Language Models for Enhancing the Understandability of Generated Unit Tests.arXiv preprint arXiv:2408.11710(2024)

arXiv 2024

[17] [17]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al . 2020. Codebert: A pre-trained model for programming and natural languages.arXiv preprint arXiv:2002.08155(2020)

Pith/arXiv arXiv 2020

[18] [18]

Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation for object-oriented software. InProceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 416–419

2011

[19] [19]

Gordon Fraser and Andrea Arcuri. 2014. A large-scale evaluation of automated unit test generation using evosuite. ACM Transactions on Software Engineering and Methodology (TOSEM)24, 2 (2014), 1–42

2014

[20] [20]

Gordon Fraser, Matt Staats, Phil McMinn, Andrea Arcuri, and Frank Padberg. 2015. Does automated unit test generation really help software testers? a controlled empirical study.ACM Transactions on Software Engineering and Methodology (TOSEM)24, 4 (2015), 1–49

2015

[21] [21]

Yi Gao, Xing Hu, Xiaohu Yang, and Xin Xia. 2024. Context-Enhanced LLM-Based Framework for Automatic Test Refactoring.arXiv preprint arXiv:2409.16739(2024)

arXiv 2024

[22] [22]

Yi Gao, Xing Hu, Xiaohu Yang, and Xin Xia. 2025. Automated Unit Test Refactoring.Proceedings of the ACM on Software Engineering2, FSE (2025), 713–733

2025

[23] [23]

Gregory Gay. 2023. Improving the Readability of Generated Tests Using GPT-4 and ChatGPT Code Interpreter. In International Symposium on Search Based Software Engineering. Springer, 140–146

2023

[24] [24]

Giovanni Grano, Simone Scalabrino, Harald C Gall, and Rocco Oliveto. 2018. An empirical investigation on the readability of manual and generated test cases. InProceedings of the 26th Conference on Program Comprehension. 348–351

2018

[25] [25]

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2020. Graphcodebert: Pre-training code representations with data flow.arXiv preprint arXiv:2009.08366 (2020)

Pith/arXiv arXiv 2020

[26] [26]

Cheng-Yu Hsieh, Yung-Sung Chuang, Chun-Liang Li, Zifeng Wang, Long Le, Abhishek Kumar, James Glass, Alexander Ratner, Chen-Yu Lee, Ranjay Krishna, et al. 2024. Found in the middle: Calibrating positional attention bias improves long context utilization. InFindings of the Association for Computational Linguistics: ACL 2024. 14982–14995

2024

[27] [27]

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In54th Annual Meeting of the Association for Computational Linguistics 2016. Association for Computational Linguistics, 2073–2083

2016

[28] [28]

Gunel Jahangirova and Valerio Terragni. 2023. SBFT tool competition 2023-Java test case generation track. In2023 IEEE/ACM International Workshop on Search-Based and Fuzz Testing (SBFT). IEEE, 61–64

2023

[29] [29]

René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. InProceedings of the 2014 international symposium on software testing and analysis. J. ACM, Vol. 1, No. 1, Article . Publication date: June 2026. 34 W. Ouédraogo et al. 437–440

2014

[30] [30]

Adam Kalai, Ofir Nachum, Santosh Vempala, and Edwin Zhang. 2025. Why Language Models Hallucinate. https: //doi.org/10.48550/arXiv.2509.04664

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.04664 2025

[31] [31]

Caroline Lemieux, Jeevana Priya Inala, Shuvendu K Lahiri, and Siddhartha Sen. 2023. Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 919–931

2023

[32] [32]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out. 74–81

2004

[33] [33]

Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. InProceedings of the 42nd annual meeting of the association for computational linguistics (ACL-04). 605–612

2004

[34] [34]

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics 12 (2024), 157–173

2024

[35] [35]

Stephan Lukasczyk and Gordon Fraser. 2022. Pynguin: Automated unit test generation for python. InProceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings. 168–172

2022

[36] [36]

Antonio Mastropaolo, Matteo Ciniselli, Massimiliano Di Penta, and Gabriele Bavota. 2024. Evaluating Code Summariza- tion Techniques: A New Metric and an Empirical Characterization. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

2024

[37] [37]

Thomas J McCabe. 1976. A complexity measure.IEEE Transactions on software Engineering4 (1976), 308–320

1976

[38] [38]

Rian Melo, Pedro Simões, Rohit Gheyi, Marcelo d’Amorim, Márcio Ribeiro, Gustavo Soares, Eduardo Almeida, and Elvys Soares. 2025. Agentic LMs: Hunting Down Test Smells.IEEE Software(2025)

2025

[39] [39]

Delano Oliveira, Reyde Bruno, Fernanda Madeiral, Hidehiko Masuhara, and Fernando Castor. 2022. A systematic literature review on the impact of formatting elements on program understandability.A vailable at SSRN 4182156 (2022)

2022

[40] [40]

Wendkûuni C Ouédraogo, Kader Kaboré, Haoye Tian, Yewei Song, Anil Koyuncu, Jacques Klein, David Lo, and Tegawendé F Bissyandé. 2024. Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation.arXiv preprint arXiv:2407.00225(2024)

arXiv 2024

[41] [41]

Wendkuuni C Ouedraogo, Kader Kabore, Haoye Tian, Yewei Song, Anil Koyuncu, Jacques Klein, David Lo, and Tegawende F Bissyande. 2024. Llms and prompting for unit test generation: A large-scale evaluation. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 2464–2465

2024

[42] [42]

Wendkûuni C Ouédraogo, Yinghua Li, Xueqi Dang, Pawel Borsukiewicz, Xin Zhou, Anil Koyuncu, Jacques Klein, David Lo, and Tegawendé F Bissyandé. 2025. Human-Aligned Code Readability Assessment with Large Language Models.arXiv preprint arXiv:2510.16579(2025)

arXiv 2025

[43] [43]

Wendkûuni C Ouédraogo, Yinghua Li, Xueqi Dang, Xunzhu Tang, Anil Koyuncu, Jacques Klein, David Lo, and Tegawendé F Bissyandé. 2024. Test smells in llm-generated unit tests.arXiv preprint arXiv:2410.10628(2024)

arXiv 2024

[44] [44]

Wendkûuni C Ouédraogo, Yinghua Li, Xueqi Dang, Xin Zhou, Anil Koyuncu, Jacques Klein, David Lo, and Tegawendé F Bissyandé. 2025. Beyond Surface Similarity: Evaluating LLM-Based Test Refactorings with Structural and Semantic Awareness.arXiv preprint arXiv:2506.06767(2025)

arXiv 2025

[45] [45]

Wendkûuni C Ouédraogo, Yinghua Li, Xueqi Dang, Xin Zhou, Anil Koyuncu, Jacques Klein, David Lo, and Tegawendé F Bissyandé. 2025. Rethinking Cognitive Complexity for Unit Tests: Toward a Readability-Aware Metric Grounded in Developer Perception. In2025 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 797–802

2025

[46] [46]

Wendkûuni C Ouédraogo, Laura Plein, Kader Kabore, Andrew Habib, Jacques Klein, David Lo, and Tegawendé F Bissyandé. 2025. Enriching automatic test case generation by extracting relevant test inputs from bug reports.Empirical Software Engineering30, 3 (2025), 85

2025

[47] [47]

Fabio Palomba, Dario Di Nucci, Annibale Panichella, Rocco Oliveto, and Andrea De Lucia. 2016. On the diffusion of test smells in automatically generated test code: An empirical study. InProceedings of the 9th international workshop on search-based software testing. 5–14

2016

[48] [48]

Rangeet Pan, Myeongsoo Kim, Rahul Krishna, Raju Pavuluri, and Saurabh Sinha. 2025. Aster: Natural and multi- language unit test generation with llms. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 413–424

2025

[49] [49]

Annibale Panichella, Fitsum Meshesha Kifetew, and Paolo Tonella. 2017. Automated test case generation as a many- objective optimisation problem with dynamic selection of the targets.IEEE Transactions on Software Engineering44, 2 (2017), 122–158

2017

[50] [50]

Annibale Panichella, Sebastiano Panichella, Gordon Fraser, Anand Ashok Sawant, and Vincent J Hellendoorn. 2020. Revisiting test smells in automatically generated tests: limitations, pitfalls, and opportunities. In2020 IEEE international J. ACM, Vol. 1, No. 1, Article . Publication date: June 2026. Humanizing Automatically Generated Unit Test Suites with L...

2020

[51] [51]

Anthony Peruma, Khalid Almalki, Christian D Newman, Mohamed Wiem Mkaouer, Ali Ouni, and Fabio Palomba

[52] [52]

InProceedings of the 28th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering

Tsdetect: An open source test smells detection tool. InProceedings of the 28th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. 1650–1654

[53] [53]

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. Codebleu: a method for automatic evaluation of code synthesis.arXiv preprint arXiv:2009.10297 (2020)

Pith/arXiv arXiv 2020

[54] [54]

Devjeet Roy, Ziyi Zhang, Maggie Ma, Venera Arnaoudova, Annibale Panichella, Sebastiano Panichella, Danielle Gonzalez, and Mehdi Mirakhorli. 2020. DeepTC-Enhancer: Improving the readability of automatically generated tests. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 287–298

2020

[55] [55]

Simone Scalabrino, Mario Linares-Vásquez, Rocco Oliveto, and Denys Poshyvanyk. 2018. A comprehensive model for code readability.Journal of Software: Evolution and Process30, 6 (2018), e1958

2018

[56] [56]

Agnia Sergeyuk, Olga Lvova, Sergey Titov, Anastasiia Serova, Farid Bagirov, and Timofey Bryksin. 2024. Assessing Consensus of Developers’ Views on Code Readability.arXiv preprint arXiv:2407.03790(2024)

arXiv 2024

[57] [57]

Agnia Sergeyuk, Olga Lvova, Sergey Titov, Anastasiia Serova, Farid Bagirov, Evgeniia Kirillova, and Timofey Bryksin

[58] [58]

InProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension

Reassessing Java Code Readability Models with a Human-Centered Approach. InProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension. 225–235

[59] [59]

Sina Shamshiri, René Just, José Miguel Rojas, Gordon Fraser, Phil McMinn, and Andrea Arcuri. 2015. Do automatically generated unit tests find real faults? an empirical study of effectiveness and challenges (t). In2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 201–211

2015

[60] [60]

Jiho Shin, Nima Shiri Harzevili, Reem Aleithan, Hadi Hemmati, and Song Wang. 2024. Retrieval-augmented test generation: How far are we?arXiv preprint arXiv:2409.12682(2024)

arXiv 2024

[61] [61]

O’Reilly Media, Inc

James Shore and Shane Warden. 2021.The art of agile development. " O’Reilly Media, Inc. "

2021

[62] [62]

Mohammed Latif Siddiq, Joanna CS Santos, Ridwanul Hasan Tanvir, Noshin Ulfat, Fahmid Al Rifat, and Vinícius Car- valho Lopes. 2024. Using Large Language Models to Generate JUnit Tests: An Empirical Study. (2024)

2024

[63] [63]

O’Reilly Media, Inc

Saleem Siddiqui. 2021.Learning Test-Driven Development. " O’Reilly Media, Inc. "

2021

[64] [64]

Weisong Sun, Yun Miao, Yuekang Li, Hongyu Zhang, Chunrong Fang, Yi Liu, Gelei Deng, Yang Liu, and Zhenyu Chen. 2024. Source Code Summarization in the Era of Large Language Models. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 419–431

2024

[65] [65]

Yutian Tang, Zhijie Liu, Zhichao Zhou, and Xiapu Luo. 2024. Chatgpt vs sbst: A comparative assessment of unit test suite generation.IEEE Transactions on Software Engineering(2024)

2024

[66] [66]

Arie Van Deursen, Leon Moonen, Alex Van Den Bergh, and Gerard Kok. 2001. Refactoring test code. InProceedings of the 2nd international conference on extreme programming and flexible processes in software engineering (XP2001). Citeseer, 92–95

2001

[67] [67]

Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software testing with large language models: Survey, landscape, and vision.IEEE Transactions on Software Engineering(2024)

2024

[68] [68]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171 (2022)

Pith/arXiv arXiv 2022

[69] [69]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al . 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

2022

[70] [70]

Zhiqiang Yuan, Yiling Lou, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, and Xin Peng. 2023. No more manual tests? evaluating and improving chatgpt for unit test generation.arXiv preprint arXiv:2305.04207(2023)

arXiv 2023

[71] [71]

Zhichao Zhou, Yutian Tang, Yun Lin, and Jingzhu He. 2024. An LLM-based readability measurement for unit tests’ context-aware inputs.arXiv preprint arXiv:2407.21369(2024). A Additional Tables Table 10.Test generation statistics across datasets and tools. Dataset Model #Projects #ClassesTest GenerationRate (%) CompilabilityRate (%)TokensMax TokensMin Tokens...

arXiv 2024