Generating Project-Specific Test Cases with Requirement Validation Intention

Binhang Qi; Chenyan Liu; Hailong Sun; Jin Song Dong; Xinyi Weng; Yuhuan Huang; Yun Lin; Zhi Jin

arxiv: 2507.20619 · v3 · submitted 2025-07-28 · 💻 cs.SE

Generating Project-Specific Test Cases with Requirement Validation Intention

Binhang Qi , Yun Lin , Xinyi Weng , Yuhuan Huang , Chenyan Liu , Hailong Sun , Zhi Jin , Jin Song Dong This is my paper

Pith reviewed 2026-05-19 03:11 UTC · model grok-4.3

classification 💻 cs.SE

keywords test case generationsoftware testinglarge language modelsrequirement validationretrieval augmented generationautomated test generationproject-specific tests

0 comments

The pith

Retrieving a similar project test and editing it with an LLM produces tests that better match developers' validation intentions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes IntentionTest to generate project-specific test cases from an explicit description of validation intention, which specifies the test scenario for a program function along with its preconditions and expected results. Rather than maximizing branch coverage or translating code directly, the method first retrieves an existing test from the same project as a reference and then uses an LLM to adapt that reference to the new intention. Evaluation across 3,680 test cases shows the resulting tests are more semantically aligned with ground-truth developer tests, as measured by killing more common mutants and sharing more common coverage, while also yielding a higher proportion of successful passing tests. A sympathetic reader would care because tests that directly reflect what a requirement needs to validate are more likely to be adopted and maintained than those produced by generic automation techniques.

Core claim

IntentionTest generates project-specific tests given a focal code and a validation intention description consisting of a test objective with precondition and expected results. It retrieves a reusable test in the project as reference and edits it with an LLM toward the target test. On 3,680 test cases, this produces tests far more semantically relevant to ground-truth tests by killing 28.1% to 37.6% more common mutants and sharing 16.9% to 23.9% more common coverage, while also generating 23.7% to 49.0% more successful passing tests than state-of-the-art baselines.

What carries the argument

IntentionTest's retrieval-and-edit pipeline, which locates a reusable test reference from the project and adapts it via LLM to a stated validation intention of objective, precondition, and expected results.

If this is right

Generated tests become more semantically relevant to actual developer-written tests.
Higher mutant-killing rates indicate stronger alignment with real fault-detection needs.
Greater coverage overlap with ground-truth tests reflects structural similarity to human-written tests.
A larger share of successful passing tests means the outputs are more immediately usable.
Tests reflecting explicit validation intentions are more likely to be adopted in practice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same retrieval-and-edit pattern could be applied to other artifacts such as bug reports or requirements documents to keep them consistent with code changes.
Projects with very few or highly unique tests may need fallback strategies when retrieval finds no close reference.
Linking the validation intention description directly to formal requirements could create traceable tests from specification to execution.
Extending the intention to cover non-functional aspects such as performance constraints would broaden the method beyond functional validation.

Load-bearing premise

The method assumes a suitable test reference can be retrieved from the project that is close enough in structure and intent for the LLM edit to succeed without semantic drift or incorrect assertions.

What would settle it

Running IntentionTest on a project containing no tests structurally similar to the target validation scenarios and observing that mutant-killing rates and passing-test counts fall to baseline levels or below would falsify the central claim.

Figures

Figures reproduced from arXiv: 2507.20619 by Binhang Qi, Chenyan Liu, Hailong Sun, Jin Song Dong, Xinyi Weng, Yuhuan Huang, Yun Lin, Zhi Jin.

**Figure 2.** Figure 2: Overview of IntentionTest: Given a focal method and the validation intention description of its test, [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Taking the description of [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: An example of extending usage scenarios. By discovering the method delegation relation, we can [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Crucial fact ignite() provides two hints for editing the test case shown in Listing 4. Co-occurrence Identification. We first identify all the usage scenarios of the focal methods 𝑚𝑡𝑎𝑟, where each scenario is a method invoking 𝑚𝑡𝑎𝑟. Given a usage scenario 𝑠𝑖 of 𝑚𝑡𝑎𝑟, we call the set of all its invoking program elements as a co-occurring set to 𝑚𝑡𝑎𝑟, denoted as 𝑐𝑜_𝑜𝑐𝑐𝑢𝑟(𝑠𝑖 ,𝑚𝑡𝑎𝑟) = {𝑒1, 𝑒2, ..., 𝑒𝑘 } where … view at source ↗

**Figure 6.** Figure 6: Comparison of tests generated by IntentionTest and ChatTester (CMS: 100% vs 8%). [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: A showcase of the effectiveness of test objectives. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

read the original abstract

Test cases are valuable assets for maintaining software quality. State-of-the-art automated test generation techniques typically focus on maximizing program branch coverage or translating focal methods into test code. However, in contrast to branch coverage or code-to-test translation, practical tests are written out of the need to validate whether a requirement has been fulfilled. Specifically, each test usually reflects a developer's validation intention for a program function, regarding (1) what is the test scenario of a program function? and (2) what is expected behavior under such a scenario? Without taking such intention into account, generated tests are less likely to be adopted in practice. In this work, we propose IntentionTest, which generates project-specific tests given the description of validation intention. IntentionTest adopts a retrieval-and-edit manner. First, given a focal code and a description of validation intention consisting of a test objective with test precondition and expected results, IntentionTest retrieves a reusable test in the project as the test reference. Then, IntentionTest edits the test reference with an LLM regarding the validation intention toward the target test. We extensively evaluate IntentionTest against four baselines on 3,680 test cases. Compared to state-of-the-art baselines, IntentionTest can (1) generate tests far more semantically relevant to ground-truth tests by (i) killing 28.1% to 37.6% more common mutants and (ii) sharing 16.9% to 23.9% more common coverage; and (2) generate 23.7% to 49.0% more successful passing tests.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IntentionTest shows gains from feeding a structured validation intention into retrieval-plus-LLM-edit, but the reported deltas may simply reflect that extra input rather than the technique itself.

read the letter

The core move here is to take an explicit validation intention—objective, precondition, expected results—and use it to pull a project-internal test as a reference, then have an LLM rewrite it for the target focal method. That retrieval-and-edit loop with intention as the driver is the actual novelty; prior coverage or translation baselines do not normally operate this way. On the positive side, the evaluation uses 3,680 tests and reports consistent lifts: 28–38 % more common mutants killed, 17–24 % more shared coverage, and 24–49 % more passing tests against four baselines. Those numbers are concrete and the ground-truth comparison to developer-written tests is a reasonable anchor. The work also stays grounded in a practical pain point—tests that developers might actually keep—rather than chasing pure coverage numbers. The soft spot is exactly the one the stress-test flags. The abstract presents the intention description as input only to IntentionTest. Standard baselines (coverage-driven or focal-method translators) do not receive an equivalent structured oracle. If that asymmetry holds in the full experiments, the headline improvements cannot be cleanly attributed to the retrieval-edit pipeline; they could just be the result of giving the LLM a clearer target. Without an ablation that supplies comparable intention-style guidance to the baselines, the causal claim is under-supported. Minor issues include the usual questions about statistical testing and whether mutant and coverage metrics were computed identically across all systems, but those are fixable. This is a paper for people working on LLM-assisted test generation who care about adoption. It is coherent on its own terms and shows honest engagement with the literature on intention versus coverage. A serious editor should send it out for review, but the referees will need to press hard on the input-control question before the results can be taken at face value.

Referee Report

2 major / 2 minor

Summary. The paper introduces IntentionTest, a retrieval-and-edit technique that generates project-specific test cases given a validation intention (test objective, precondition, and expected results). It retrieves a reusable test from the project as reference and uses an LLM to edit it toward the target scenario. Evaluation on 3,680 test cases across projects shows IntentionTest outperforms four baselines by killing 28.1–37.6% more common mutants, sharing 16.9–23.9% more common coverage, and producing 23.7–49.0% more successful passing tests.

Significance. If the central empirical claims hold under fair conditions, the work meaningfully advances automated test generation by shifting focus from coverage maximization or focal-method translation to explicit requirement-validation intentions, potentially increasing the practical adoptability of generated tests. The scale of the evaluation (3,680 cases) and use of mutant analysis plus coverage overlap as semantic-relevance proxies are concrete strengths that support falsifiable claims.

major comments (2)

[§5.2] §5.2 and Table 2: the experimental comparison does not indicate that the four baselines receive the same structured validation intention (objective + precondition + expected results) that IntentionTest uses for retrieval and editing. Standard coverage-driven or focal-method baselines normally operate without this oracle-like input; if the intention is supplied only to IntentionTest, the reported deltas (e.g., 28.1–37.6% more mutants killed) cannot isolate the contribution of the retrieval-and-edit pipeline from the effect of extra input. This assumption is load-bearing for the headline superiority claim.
[§4.3] §4.3 and §5.3: no ablation or error analysis is presented on retrieval quality or on cases where LLM editing introduces semantic drift or incorrect assertions. The weakest assumption—that a retrieved reference is sufficiently close in structure and intent for reliable transformation—is therefore untested, undermining confidence that the observed gains stem from the proposed method rather than fortunate retrievals.

minor comments (2)

[§5.1] The description of the four baselines in §5.1 would benefit from explicit pseudocode or parameter settings to allow exact reproduction.
[Figure 2] Figure 2 (pipeline overview) uses small font sizes for the intention components; enlarging or adding a legend would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the referee's constructive comments and positive assessment of the work's significance, evaluation scale, and use of mutant analysis and coverage overlap. We address each major comment point by point below.

read point-by-point responses

Referee: [§5.2] §5.2 and Table 2: the experimental comparison does not indicate that the four baselines receive the same structured validation intention (objective + precondition + expected results) that IntentionTest uses for retrieval and editing. Standard coverage-driven or focal-method baselines normally operate without this oracle-like input; if the intention is supplied only to IntentionTest, the reported deltas (e.g., 28.1–37.6% more mutants killed) cannot isolate the contribution of the retrieval-and-edit pipeline from the effect of extra input. This assumption is load-bearing for the headline superiority claim.

Authors: The structured validation intention is the defining input to IntentionTest and the central element of our contribution, as the abstract and §1 emphasize that practical tests are written to validate specific requirements rather than maximize coverage or translate focal methods. The four baselines are established techniques that do not accept or exploit such explicit intention descriptions; they are intentionally chosen to represent current state-of-the-art approaches that lack this capability. The reported improvements therefore demonstrate the benefit of incorporating intention via retrieval-and-edit, which is precisely the point of the work. We will revise §5.2 and the caption of Table 2 to clarify that the intention is not an extraneous oracle but the core input that enables the proposed pipeline, and we will add a short discussion of why comparing against intention-agnostic baselines is the appropriate way to quantify the practical advantage. revision: partial
Referee: [§4.3] §4.3 and §5.3: no ablation or error analysis is presented on retrieval quality or on cases where LLM editing introduces semantic drift or incorrect assertions. The weakest assumption—that a retrieved reference is sufficiently close in structure and intent for reliable transformation—is therefore untested, undermining confidence that the observed gains stem from the proposed method rather than fortunate retrievals.

Authors: We agree that a dedicated analysis of retrieval quality and LLM editing behavior would increase confidence in the results. The current manuscript reports only aggregate end-to-end metrics across 3,680 cases. In the revised version we will add an error analysis subsection (or appendix) that (i) reports retrieval similarity statistics (e.g., token overlap or embedding distance between the retrieved reference and the target test) and (ii) presents a manual inspection of a random sample of cases, categorizing instances of semantic drift or incorrect assertions introduced during editing. This will directly test the assumption that retrieved references are sufficiently close for reliable transformation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on external benchmarks

full rationale

The paper proposes IntentionTest as a retrieval-and-edit technique and evaluates it empirically on 3,680 test cases against four external baselines, measuring mutant killing rates, coverage overlap, and passing-test counts relative to ground-truth developer tests. No equations, fitted parameters, or first-principles derivations are presented whose outputs reduce by construction to the inputs. The reported deltas are computed from independent test suites and standard mutation/coverage tools, satisfying the self-contained-against-external-benchmarks criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach rests on the assumption that project tests contain reusable structural patterns that an LLM can meaningfully adapt; no explicit free parameters, new axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5832 in / 1199 out tokens · 35745 ms · 2026-05-19T03:11:19.794551+00:00 · methodology

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Generalizing Test Cases for Comprehensive Test Scenario Coverage
cs.SE 2026-04 unverdicted novelty 6.0

TestGeneralizer generalizes an initial test into a set of executable tests covering more diverse scenarios, delivering +31.66% mutation-based and +23.08% LLM-assessed scenario coverage gains over ChatTester on 12 open...
ARuleCon: Agentic Security Rule Conversion
cs.CR 2026-04 unverdicted novelty 6.0

ARuleCon uses AI agents plus execution-based checks to convert SIEM rules across vendors with 15% higher fidelity than standard LLM translation.
EditFlow: Benchmarking and Optimizing Code Edit Recommendation Systems via Reconstruction of Developer Flows
cs.SE 2026-02 unverdicted novelty 6.0

EditFlow reconstructs temporal developer editing flows from code changes to benchmark and optimize AI code edit recommenders so they align with natural incremental reasoning rather than static snapshots.
Learning Project-wise Subsequent Code Edits via Interleaving Neural-based Induction and Tool-based Deduction
cs.SE 2026-04 unverdicted novelty 5.0

TRACE improves project-wise subsequent code editing by interleaving neural-based induction for semantic edits and tool-based deduction for syntactic edits.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 4 Pith papers · 2 internal anchors

[1]

IEEE/ISO/IEC International Standard for Software and systems engineering–Software testing–Part 3:Test documentation.ISO/IEC/IEEE 29119-3:2021(E)(2021), 1–98

2021. IEEE/ISO/IEC International Standard for Software and systems engineering–Software testing–Part 3:Test documentation.ISO/IEC/IEEE 29119-3:2021(E)(2021), 1–98

work page 2021
[2]

Nadia Alshahwan, Jubin Chheda, Anastasia Finogenova, Beliz Gokkaya, Mark Harman, Inna Harper, Alexandru Marginean, Shubho Sengupta, and Eddy Wang. 2024. Automated unit test improvement using large language models at meta. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 185–196

work page 2024
[3]

Anonymous. 2025. Anonymous video for IntentionTest tool. https://youtu.be/i1qMPqb993A

work page 2025
[4]

Anonymous. 2025. Anonymous website for IntentionTest. https://sites.google.com/view/domain-specific-tester/home

work page 2025
[5]

Andrea Arcuri and Xin Yao. 2008. Search based software testing of object-oriented containers.Information Sciences 178, 15 (2008), 3075–3095

work page 2008
[6]

Spark authors. 2023. Spark - a tiny web framework for Java 8. https://github.com/perwendel/spark

work page 2023
[7]

awesome-algorithm authors. 2022. Awesome Algorithm. https://github.com/codeartx/awesome-algorithm

work page 2022
[8]

bartowski. 2025. DeepSeek-R1-Distill-Qwen-32B-GGUF. https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen- 32B-GGUF

work page 2025
[9]

Tobias Baum and Kurt Schneider. 2016. On the need for a new generation of code review tools. InProduct-Focused Software Process Improvement: 17th International Conference, PROFES 2016, Trondheim, Norway, November 22-24, 2016, Proceedings 17. Springer, 301–308

work page 2016
[10]

blade authors. 2025. Lightning fast and elegant mvc framework for Java8. https://github.com/lets-blade/blade

work page 2025
[11]

Pietro Braione, Giovanni Denaro, Andrea Mattavelli, and Mauro Pezzè. 2017. Combining symbolic execution and search-based testing for programs with complex heap inputs. InProceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis. 90–101

work page 2017
[12]

Pietro Braione, Giovanni Denaro, Andrea Mattavelli, and Mauro Pezzè. 2018. SUSHI: a test generator for programs with complex structured inputs. In 2018 IEEE/ACM 40th International Conference on Software Engineering: Companion (ICSE-Companion)

work page 2018
[13]

Cristian Cadar, Daniel Dunbar, Dawson R Engler, et al. 2008. Klee: unassisted and automatic generation of high-coverage tests for complex systems programs.. InOSDI, Vol. 8. 209–224

work page 2008
[14]

José Campos, Andrea Arcuri, Gordon Fraser, and Rui Abreu. 2014. Continuous test generation: Enhancing continuous integration with automated test generation. InProceedings of the 29th ACM/IEEE international conference on Automated software engineering. 55–66

work page 2014
[15]

cron-utils authors. 2025. Cron utils for parsing, validations and human readable descriptions as well as date/time interoperability. https://github.com/jmrozanec/cron-utils

work page 2025
[16]

Ermira Daka, José Campos, Gordon Fraser, Jonathan Dorn, and Westley Weimer. 2015. Modeling readability to improve unit tests. InProceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. 107–118

work page 2015
[17]

Elizabeth Dinella, Gabriel Ryan, Todd Mytkowicz, and Shuvendu K Lahiri. 2022. Toga: A neural method for test oracle generation. InProceedings of the 44th International Conference on Software Engineering. 2130–2141

work page 2022
[18]

Chunhao Dong, Yanjie Jiang, Yuxia Zhang, Yang Zhang, and Liu Hui. 2025. ChatGPT-Based Test Generation for Refactoring Engines Enhanced by Feature Analysis on Examples . In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 746–746. doi:10.1109/ICSE55347.2025.00210

work page doi:10.1109/icse55347.2025.00210 2025
[19]

Emad Fallahzadeh, Amir Hossein Bavand, and Peter C Rigby. 2023. Accelerating Continuous Integration with Parallel Batch Testing. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 55–67

work page 2023
[20]

Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation for object-oriented software. InProceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 416–419

work page 2011
[21]

Shuzheng Gao, Chaozheng Wang, Cuiyun Gao, Xiaoqian Jiao, Chun Yong Chong, Shan Gao, and Michael Lyu. 2025. The Prompt Alchemist: Automated LLM-Tailored Prompt Optimization for Test Case Generation. arXiv:2501.01329

work page arXiv 2025
[22]

Patrice Godefroid, Nils Klarlund, and Koushik Sen. 2005. DART: Directed automated random testing. InProceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation. 213–223

work page 2005
[23]

Javier Godoy, Juan Pablo Galeotti, Diego Garbervetsky, and Sebastián Uchitel. 2021. Enabledness-based testing of object protocols.ACM Transactions on Software Engineering and Methodology (TOSEM)30, 2 (2021), 1–36

work page 2021
[24]

Larisa Gota, Dan Gota, and Liviu Miclea. 2020. Continuous Integration in Automation Testing. In2020 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR). IEEE, 1–6. 19

work page 2020
[25]

Gall, and Rocco Oliveto

Giovanni Grano, Simone Scalabrino, Harald C. Gall, and Rocco Oliveto. 2018. An empirical investigation on the readability of manual and generated test cases. InProceedings of the 26th Conference on Program Comprehension. 348–351

work page 2018
[26]

imglib authors. 2023. Imglib: lightweight Image processing library. https://github.com/nackily/imglib

work page 2023
[27]

Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2023. Large language models are few-shot testers: Exploring llm-based general bug reproduction. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2312–2323

work page 2023
[28]

Caroline Lemieux, Jeevana Priya Inala, Shuvendu K Lahiri, and Siddhartha Sen. 2023. Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 919–931

work page 2023
[29]

Tsz-On Li, Wenxi Zong, Yibo Wang, Haoye Tian, Ying Wang, Shing-Chi Cheung, and Jeff Kramer. 2023. Nuances are the key: Unlocking chatgpt to find failure-inducing tests with differential prompting. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 14–26

work page 2023
[30]

Yun Lin, You Sheng Ong, Jun Sun, Gordon Fraser, and Jin Song Dong. 2021. Graph-based seed object synthesis for search-based unit testing. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1068–1080

work page 2021
[31]

Yun Lin, Jun Sun, Gordon Fraser, Ziheng Xiu, Ting Liu, and Jin Song Dong. 2020. Recovering fitness gradients for interprocedural Boolean flags in search-based testing. InProceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 440–451

work page 2020
[32]

Simone Mezzaro, Alessio Gambi, and Gordon Fraser. 2024. An empirical study on how large language models impact software testing learning. InProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering. 555–564

work page 2024
[33]

Fangwen Mu, Lin Shi, Song Wang, Zhuohao Yu, Binquan Zhang, Chenxue Wang, Shichao Liu, and Qing Wang. 2023. ClarifyGPT: Empowering LLM-based Code Generation with Intention Clarification.arXiv preprint arXiv:2310.10996 (2023)

work page arXiv 2023
[34]

Fangwen Mu, Lin Shi, Song Wang, Zhuohao Yu, Binquan Zhang, ChenXue Wang, Shichao Liu, and Qing Wang. 2024. ClarifyGPT: A Framework for Enhancing LLM-Based Code Generation via Requirements Clarification.Proceedings of the ACM on Software Engineering1, FSE (2024), 2332–2354

work page 2024
[35]

Zifan Nan, Zhaoqiang Guo, Kui Liu, and Xin Xia. 2025. Test Intention Guided LLM-Based Unit Test Generation. In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). 1026–1038

work page 2025
[36]

Pengyu Nie, Rahul Banerjee, Junyi Jessy Li, Raymond J Mooney, and Milos Gligoric. 2023. Learning deep semantics for test completion. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2111–2123

work page 2023
[37]

(2025, January)

Shuyin Ouyang, Jie M. Zhang, Mark Harman, and Meng Wang. 2024. An Empirical Study of the Non-determinism of ChatGPT in Code Generation.ACM Trans. Softw. Eng. Methodol.(2024). doi:10.1145/3697010

work page doi:10.1145/3697010 2024
[38]

Carlos Pacheco and Michael D Ernst. 2007. Randoop: feedback-directed random testing for Java. InCompanion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion. 815–816

work page 2007
[39]

Fabio Palomba, Dario Di Nucci, Annibale Panichella, Rocco Oliveto, and Andrea De Lucia. 2016. On the diffusion of test smells in automatically generated test code: an empirical study. InProceedings of the 9th International Workshop on Search-Based Software Testing. 5–14

work page 2016
[40]

Fabio Palomba, Annibale Panichella, Andy Zaidman, Rocco Oliveto, and Andrea De Lucia. 2016. Automatic test case generation: what if test code quality matters?. InProceedings of the 25th International Symposium on Software Testing and Analysis. 130–141

work page 2016
[41]

Martin Pawelczyk, Seth Neel, and Himabindu Lakkaraju. 2023. In-context unlearning: Language models as few shot unlearners.arXiv preprint arXiv:2310.07579(2023)

work page arXiv 2023
[42]

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. Codebleu: a method for automatic evaluation of code synthesis.arXiv preprint arXiv:2009.10297 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[43]

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al . 2023. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An empirical evaluation of using large language models for automated unit test generation.IEEE Transactions on Software Engineering(2023)

work page 2023
[45]

Koushik Sen, Darko Marinov, and Gul Agha. 2005. CUTE: A concolic unit testing engine for C.ACM SIGSOFT Software Engineering Notes30, 5 (2005), 263–272

work page 2005
[46]

Jiho Shin, Sepehr Hashtroudi, Hadi Hemmati, and Song Wang. 2024. Domain Adaptation for Code Model-Based Unit Test Case Generation. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1211–1222. 20

work page 2024
[47]

Shota Takashiro, Takeshi Kojima, Andrew Gambardella, Qi Cao, Yusuke Iwasawa, and Yutaka Matsuo. 2024. Answer When Needed, Forget When Not: Language Models Pretend to Forget via In-Context Knowledge Unlearning.arXiv preprint arXiv:2410.00382(2024)

work page arXiv 2024
[48]

truth authors. 2025. Fluent assertions for Java and Android. https://github.com/google/truth

work page 2025
[49]

Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, Shao Kun Deng, and Neel Sundaresan. 2020. Unit test case generation with transformers and focal context.arXiv preprint arXiv:2009.05617(2020)

work page arXiv 2020
[50]

Yue Wang, Hung Le, Akhilesh Gotmare, Nghi Bui, Junnan Li, and Steven Hoi. 2023. CodeT5+: Open Code Large Language Models for Code Understanding and Generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 1069–1088

work page 2023
[51]

Jin Wen, Qiang Hu, Yuejun Guo, Maxime Cordy, and Yves Le Traon. 2025. Variable Renaming-Based Adversarial Test Generation for Code Model: Benchmark and Enhancement.ACM Transactions on Software Engineering and Methodology (2025)

work page 2025
[52]

Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. 2024. Fuzz4all: Univer- sal fuzzing with large language models. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

work page 2024
[53]

yavi authors. 2025. a lambda based type safe validation for Java. https://github.com/making/yavi

work page 2025
[54]

Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng, and Yiling Lou. 2024. Evaluating and Improving ChatGPT for Unit Test Generation.Proc. ACM Softw. Eng.1, FSE, Article 76 (jul 2024), 24 pages. doi:10.1145/3660783

work page doi:10.1145/3660783 2024
[55]

Xin Zhou, Kisub Kim, Bowen Xu, DongGyun Han, Junda He, and David Lo. 2023. Generation-based code review automation: how far are weƒ. In2023 IEEE/ACM 31st International Conference on Program Comprehension (ICPC). IEEE, 215–226. 21

work page 2023

[1] [1]

IEEE/ISO/IEC International Standard for Software and systems engineering–Software testing–Part 3:Test documentation.ISO/IEC/IEEE 29119-3:2021(E)(2021), 1–98

2021. IEEE/ISO/IEC International Standard for Software and systems engineering–Software testing–Part 3:Test documentation.ISO/IEC/IEEE 29119-3:2021(E)(2021), 1–98

work page 2021

[2] [2]

Nadia Alshahwan, Jubin Chheda, Anastasia Finogenova, Beliz Gokkaya, Mark Harman, Inna Harper, Alexandru Marginean, Shubho Sengupta, and Eddy Wang. 2024. Automated unit test improvement using large language models at meta. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 185–196

work page 2024

[3] [3]

Anonymous. 2025. Anonymous video for IntentionTest tool. https://youtu.be/i1qMPqb993A

work page 2025

[4] [4]

Anonymous. 2025. Anonymous website for IntentionTest. https://sites.google.com/view/domain-specific-tester/home

work page 2025

[5] [5]

Andrea Arcuri and Xin Yao. 2008. Search based software testing of object-oriented containers.Information Sciences 178, 15 (2008), 3075–3095

work page 2008

[6] [6]

Spark authors. 2023. Spark - a tiny web framework for Java 8. https://github.com/perwendel/spark

work page 2023

[7] [7]

awesome-algorithm authors. 2022. Awesome Algorithm. https://github.com/codeartx/awesome-algorithm

work page 2022

[8] [8]

bartowski. 2025. DeepSeek-R1-Distill-Qwen-32B-GGUF. https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen- 32B-GGUF

work page 2025

[9] [9]

Tobias Baum and Kurt Schneider. 2016. On the need for a new generation of code review tools. InProduct-Focused Software Process Improvement: 17th International Conference, PROFES 2016, Trondheim, Norway, November 22-24, 2016, Proceedings 17. Springer, 301–308

work page 2016

[10] [10]

blade authors. 2025. Lightning fast and elegant mvc framework for Java8. https://github.com/lets-blade/blade

work page 2025

[11] [11]

Pietro Braione, Giovanni Denaro, Andrea Mattavelli, and Mauro Pezzè. 2017. Combining symbolic execution and search-based testing for programs with complex heap inputs. InProceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis. 90–101

work page 2017

[12] [12]

Pietro Braione, Giovanni Denaro, Andrea Mattavelli, and Mauro Pezzè. 2018. SUSHI: a test generator for programs with complex structured inputs. In 2018 IEEE/ACM 40th International Conference on Software Engineering: Companion (ICSE-Companion)

work page 2018

[13] [13]

Cristian Cadar, Daniel Dunbar, Dawson R Engler, et al. 2008. Klee: unassisted and automatic generation of high-coverage tests for complex systems programs.. InOSDI, Vol. 8. 209–224

work page 2008

[14] [14]

José Campos, Andrea Arcuri, Gordon Fraser, and Rui Abreu. 2014. Continuous test generation: Enhancing continuous integration with automated test generation. InProceedings of the 29th ACM/IEEE international conference on Automated software engineering. 55–66

work page 2014

[15] [15]

cron-utils authors. 2025. Cron utils for parsing, validations and human readable descriptions as well as date/time interoperability. https://github.com/jmrozanec/cron-utils

work page 2025

[16] [16]

Ermira Daka, José Campos, Gordon Fraser, Jonathan Dorn, and Westley Weimer. 2015. Modeling readability to improve unit tests. InProceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. 107–118

work page 2015

[17] [17]

Elizabeth Dinella, Gabriel Ryan, Todd Mytkowicz, and Shuvendu K Lahiri. 2022. Toga: A neural method for test oracle generation. InProceedings of the 44th International Conference on Software Engineering. 2130–2141

work page 2022

[18] [18]

Chunhao Dong, Yanjie Jiang, Yuxia Zhang, Yang Zhang, and Liu Hui. 2025. ChatGPT-Based Test Generation for Refactoring Engines Enhanced by Feature Analysis on Examples . In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 746–746. doi:10.1109/ICSE55347.2025.00210

work page doi:10.1109/icse55347.2025.00210 2025

[19] [19]

Emad Fallahzadeh, Amir Hossein Bavand, and Peter C Rigby. 2023. Accelerating Continuous Integration with Parallel Batch Testing. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 55–67

work page 2023

[20] [20]

Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation for object-oriented software. InProceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 416–419

work page 2011

[21] [21]

Shuzheng Gao, Chaozheng Wang, Cuiyun Gao, Xiaoqian Jiao, Chun Yong Chong, Shan Gao, and Michael Lyu. 2025. The Prompt Alchemist: Automated LLM-Tailored Prompt Optimization for Test Case Generation. arXiv:2501.01329

work page arXiv 2025

[22] [22]

Patrice Godefroid, Nils Klarlund, and Koushik Sen. 2005. DART: Directed automated random testing. InProceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation. 213–223

work page 2005

[23] [23]

Javier Godoy, Juan Pablo Galeotti, Diego Garbervetsky, and Sebastián Uchitel. 2021. Enabledness-based testing of object protocols.ACM Transactions on Software Engineering and Methodology (TOSEM)30, 2 (2021), 1–36

work page 2021

[24] [24]

Larisa Gota, Dan Gota, and Liviu Miclea. 2020. Continuous Integration in Automation Testing. In2020 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR). IEEE, 1–6. 19

work page 2020

[25] [25]

Gall, and Rocco Oliveto

Giovanni Grano, Simone Scalabrino, Harald C. Gall, and Rocco Oliveto. 2018. An empirical investigation on the readability of manual and generated test cases. InProceedings of the 26th Conference on Program Comprehension. 348–351

work page 2018

[26] [26]

imglib authors. 2023. Imglib: lightweight Image processing library. https://github.com/nackily/imglib

work page 2023

[27] [27]

Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2023. Large language models are few-shot testers: Exploring llm-based general bug reproduction. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2312–2323

work page 2023

[28] [28]

Caroline Lemieux, Jeevana Priya Inala, Shuvendu K Lahiri, and Siddhartha Sen. 2023. Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 919–931

work page 2023

[29] [29]

Tsz-On Li, Wenxi Zong, Yibo Wang, Haoye Tian, Ying Wang, Shing-Chi Cheung, and Jeff Kramer. 2023. Nuances are the key: Unlocking chatgpt to find failure-inducing tests with differential prompting. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 14–26

work page 2023

[30] [30]

Yun Lin, You Sheng Ong, Jun Sun, Gordon Fraser, and Jin Song Dong. 2021. Graph-based seed object synthesis for search-based unit testing. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1068–1080

work page 2021

[31] [31]

Yun Lin, Jun Sun, Gordon Fraser, Ziheng Xiu, Ting Liu, and Jin Song Dong. 2020. Recovering fitness gradients for interprocedural Boolean flags in search-based testing. InProceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 440–451

work page 2020

[32] [32]

Simone Mezzaro, Alessio Gambi, and Gordon Fraser. 2024. An empirical study on how large language models impact software testing learning. InProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering. 555–564

work page 2024

[33] [33]

Fangwen Mu, Lin Shi, Song Wang, Zhuohao Yu, Binquan Zhang, Chenxue Wang, Shichao Liu, and Qing Wang. 2023. ClarifyGPT: Empowering LLM-based Code Generation with Intention Clarification.arXiv preprint arXiv:2310.10996 (2023)

work page arXiv 2023

[34] [34]

Fangwen Mu, Lin Shi, Song Wang, Zhuohao Yu, Binquan Zhang, ChenXue Wang, Shichao Liu, and Qing Wang. 2024. ClarifyGPT: A Framework for Enhancing LLM-Based Code Generation via Requirements Clarification.Proceedings of the ACM on Software Engineering1, FSE (2024), 2332–2354

work page 2024

[35] [35]

Zifan Nan, Zhaoqiang Guo, Kui Liu, and Xin Xia. 2025. Test Intention Guided LLM-Based Unit Test Generation. In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). 1026–1038

work page 2025

[36] [36]

Pengyu Nie, Rahul Banerjee, Junyi Jessy Li, Raymond J Mooney, and Milos Gligoric. 2023. Learning deep semantics for test completion. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2111–2123

work page 2023

[37] [37]

(2025, January)

Shuyin Ouyang, Jie M. Zhang, Mark Harman, and Meng Wang. 2024. An Empirical Study of the Non-determinism of ChatGPT in Code Generation.ACM Trans. Softw. Eng. Methodol.(2024). doi:10.1145/3697010

work page doi:10.1145/3697010 2024

[38] [38]

Carlos Pacheco and Michael D Ernst. 2007. Randoop: feedback-directed random testing for Java. InCompanion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion. 815–816

work page 2007

[39] [39]

Fabio Palomba, Dario Di Nucci, Annibale Panichella, Rocco Oliveto, and Andrea De Lucia. 2016. On the diffusion of test smells in automatically generated test code: an empirical study. InProceedings of the 9th International Workshop on Search-Based Software Testing. 5–14

work page 2016

[40] [40]

Fabio Palomba, Annibale Panichella, Andy Zaidman, Rocco Oliveto, and Andrea De Lucia. 2016. Automatic test case generation: what if test code quality matters?. InProceedings of the 25th International Symposium on Software Testing and Analysis. 130–141

work page 2016

[41] [41]

Martin Pawelczyk, Seth Neel, and Himabindu Lakkaraju. 2023. In-context unlearning: Language models as few shot unlearners.arXiv preprint arXiv:2310.07579(2023)

work page arXiv 2023

[42] [42]

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. Codebleu: a method for automatic evaluation of code synthesis.arXiv preprint arXiv:2009.10297 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020

[43] [43]

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al . 2023. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An empirical evaluation of using large language models for automated unit test generation.IEEE Transactions on Software Engineering(2023)

work page 2023

[45] [45]

Koushik Sen, Darko Marinov, and Gul Agha. 2005. CUTE: A concolic unit testing engine for C.ACM SIGSOFT Software Engineering Notes30, 5 (2005), 263–272

work page 2005

[46] [46]

Jiho Shin, Sepehr Hashtroudi, Hadi Hemmati, and Song Wang. 2024. Domain Adaptation for Code Model-Based Unit Test Case Generation. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1211–1222. 20

work page 2024

[47] [47]

Shota Takashiro, Takeshi Kojima, Andrew Gambardella, Qi Cao, Yusuke Iwasawa, and Yutaka Matsuo. 2024. Answer When Needed, Forget When Not: Language Models Pretend to Forget via In-Context Knowledge Unlearning.arXiv preprint arXiv:2410.00382(2024)

work page arXiv 2024

[48] [48]

truth authors. 2025. Fluent assertions for Java and Android. https://github.com/google/truth

work page 2025

[49] [49]

Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, Shao Kun Deng, and Neel Sundaresan. 2020. Unit test case generation with transformers and focal context.arXiv preprint arXiv:2009.05617(2020)

work page arXiv 2020

[50] [50]

Yue Wang, Hung Le, Akhilesh Gotmare, Nghi Bui, Junnan Li, and Steven Hoi. 2023. CodeT5+: Open Code Large Language Models for Code Understanding and Generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 1069–1088

work page 2023

[51] [51]

Jin Wen, Qiang Hu, Yuejun Guo, Maxime Cordy, and Yves Le Traon. 2025. Variable Renaming-Based Adversarial Test Generation for Code Model: Benchmark and Enhancement.ACM Transactions on Software Engineering and Methodology (2025)

work page 2025

[52] [52]

Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. 2024. Fuzz4all: Univer- sal fuzzing with large language models. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

work page 2024

[53] [53]

yavi authors. 2025. a lambda based type safe validation for Java. https://github.com/making/yavi

work page 2025

[54] [54]

Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng, and Yiling Lou. 2024. Evaluating and Improving ChatGPT for Unit Test Generation.Proc. ACM Softw. Eng.1, FSE, Article 76 (jul 2024), 24 pages. doi:10.1145/3660783

work page doi:10.1145/3660783 2024

[55] [55]

Xin Zhou, Kisub Kim, Bowen Xu, DongGyun Han, Junda He, and David Lo. 2023. Generation-based code review automation: how far are weƒ. In2023 IEEE/ACM 31st International Conference on Program Comprehension (ICPC). IEEE, 215–226. 21

work page 2023