MultiFileTest: A Multi-File-Level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms

Chen Xing; Chunyu Miao; Congying Xia; Jiangshu Du; Philip S. Yu; Wenting Zhao; Yibo Wang; Zhongfen Deng

arxiv: 2502.06556 · v5 · submitted 2025-02-10 · 💻 cs.SE · cs.CL

MultiFileTest: A Multi-File-Level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms

Yibo Wang , Congying Xia , Wenting Zhao , Jiangshu Du , Chunyu Miao , Zhongfen Deng , Philip S. Yu , Chen Xing This is my paper

Pith reviewed 2026-05-23 03:39 UTC · model grok-4.3

classification 💻 cs.SE cs.CL

keywords unit test generationLLM evaluationmulti-file benchmarkerror analysissoftware testingPythonJavaJavaScript

0 comments

The pith

Frontier LLMs exhibit moderate performance when generating unit tests for multi-file codebases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MultiFileTest, a new benchmark for evaluating large language models on unit test generation at the multi-file level across Python, Java, and JavaScript. It evaluates eleven frontier models on 20 projects per language and finds moderate success rates, underscoring the challenges of handling code that spans multiple files. Error analysis highlights frequent basic mistakes such as non-executable tests and errors that propagate through the codebase. The study also examines how manual and self-error-fixing affect the models' outputs.

Core claim

MultiFileTest consists of 20 moderate-sized high-quality projects per language in three languages. Evaluation of eleven frontier LLMs shows most achieve only moderate performance on generating unit tests for these projects. A detailed error analysis demonstrates that even advanced models like Gemini-3.0-Pro produce basic yet critical errors including executability issues and cascade errors. Assessment under manual error-fixing and self-error-fixing scenarios reveals the impact of error correction on performance.

What carries the argument

The MultiFileTest benchmark providing multi-file projects for unit test generation, combined with systematic error analysis for executability and cascade errors.

If this is right

LLMs require better mechanisms to manage cross-file dependencies in codebases for effective test generation.
Error-fixing approaches, whether manual or self-directed, can address common failure modes in LLM-generated tests.
Evaluation benchmarks for LLM code tasks should incorporate multi-file scenarios to better reflect real development.
Current frontier models have limitations in producing reliable multi-file tests without additional support.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training LLMs on more diverse multi-file code contexts could reduce the observed error rates.
Integrating error detection and fixing loops directly into the generation process might yield further gains beyond the tested scenarios.
Similar benchmarks in other languages or domains could test if the difficulty is language-specific or general.

Load-bearing premise

The 20 selected projects per language are representative of the multi-file codebases that developers typically maintain and test.

What would settle it

If evaluations on a different collection of multi-file projects yield substantially higher or lower performance for the same LLMs, that would indicate the results may not generalize.

Figures

Figures reproduced from arXiv: 2502.06556 by Chen Xing, Chunyu Miao, Congying Xia, Jiangshu Du, Philip S. Yu, Wenting Zhao, Yibo Wang, Zhongfen Deng.

**Figure 3.** Figure 3: The prompt used to generate unit tests for [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 6.** Figure 6: The prompt used for the LLM self-fixing scenario for Python projects. plays a crucial role in evaluating the quality and reliability of generated unit tests. By addressing these errors, we gain deeper insights into the effectiveness of LLM-generated unit tests and identify areas for improvement. This process also helps assess the potential for LLMs to improve continuously once such simple errors are reso… view at source ↗

**Figure 5.** Figure 5: An example of cascade error generated by [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 7.** Figure 7: The prompt used to generate unit tests for Java [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: The prompt used to generate unit tests for [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: The prompt used to generate unit tests for [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Frequent Compilation Errors in Main Results. Cascade Error Analysis Python Required functions/classes/libraries are missing: 1. Import numpy or unittest.mock 2. Import functions/classes of the tested project FileNotFoundError Java Missing/Invalid mock of user interactions JavaScript Required functions/classes/libraries are missing: 1. Import chai or three 2. Import functions/classes of the tested project… view at source ↗

**Figure 11.** Figure 11: Frequent Cascade Errors. Post-fix Error Analysis Python 1. AttributeError 2. AssertionError 3. TypeError 4. ValueError 5. IndexError 6. _csv.Error 7. NameError 8. KeyError 9. Others Java 1. Mismatch between expected and received 2. NullPointer Error 3. Zero interactions with mock 4. Failed to release mocks 5. MissingMethodInvocation 6. Misplaced or misused argument matcher 7. Spring framework error 8. NoS… view at source ↗

**Figure 12.** Figure 12: Frequent Post-Fix Errors. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

read the original abstract

Unit test generation has become a promising and important Large Language Model (LLM) use case. However, existing evaluation benchmarks for LLM unit test generation focus on function- or class-level code (single-file) rather than more practical and challenging multi-file-level codebases. To address such a limitation, we propose MultiFileTest, a multi-file-level benchmark for unit test generation covering Python, Java, and JavaScript. MultiFileTest features 20 moderate-sized and high-quality projects per language. We evaluate eleven frontier LLMs on MultiFileTest, and the results show that most frontier LLMs tested exhibit moderate performance on MultiFileTest, highlighting the difficulty of MultiFileTest. We also conduct a thorough error analysis, which shows that even advanced LLMs, such as Gemini-3.0-Pro, exhibit basic yet critical errors, including executability and cascade errors. Motivated by this observation, we further evaluate all frontier LLMs under manual error-fixing and self-error-fixing scenarios to assess their potential when equipped with error-fixing mechanisms. Our code and dataset is available at \href{https://github.com/YiboWANG214/ProjectTest}{MultiFileTest}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MultiFileTest adds a multi-file benchmark and error-fixing experiments, but the project selection is under-justified and weakens the difficulty claims.

read the letter

MultiFileTest is a new benchmark for multi-file unit test generation across Python, Java, and JavaScript, and the paper adds experiments on manual and self error fixing. That is the core addition over existing single-file work. They evaluate eleven models, report moderate performance, detail some basic errors like executability and cascade issues, and show how fixing helps. The public release of the dataset is a plus. The work does a decent job of setting up the benchmark and releasing it publicly. Adding the error-fixing evaluation is a practical move that goes beyond just reporting raw performance. The weak point is the selection of the twenty projects per language. Without details on their size, dependency structure, or how they were chosen, it's difficult to know if the moderate results reflect real multi-file challenges or just the particular sample. The abstract does not provide those metrics, so the representativeness claim is not strongly supported. The stress-test concern holds up because the strongest claims depend on these projects being representative of typical developer codebases. This paper is for researchers in LLM-based code generation and benchmark creators in software engineering. It could be useful for someone looking to expand evaluation beyond single files, but the findings would benefit from more transparency on the data. I would send this to peer review because introducing a multi-file benchmark is worth referee attention, even with the current gaps in the methods description.

Referee Report

2 major / 1 minor

Summary. The paper introduces MultiFileTest, a benchmark for LLM-based unit test generation at the multi-file level covering Python, Java, and JavaScript with 20 moderate-sized high-quality projects per language. It evaluates eleven frontier LLMs, reports moderate performance that underscores the benchmark's difficulty, performs error analysis revealing issues such as executability and cascade errors even in models like Gemini-3.0-Pro, and assesses performance under manual error-fixing and self-error-fixing scenarios.

Significance. If the selected projects prove representative of real multi-file codebases and the evaluations include proper controls and metrics, the work fills a gap left by single-file benchmarks and provides actionable evidence on specific LLM failure modes in test generation plus the value of error-fixing mechanisms. This could guide improvements in LLM tooling for practical software engineering tasks.

major comments (2)

[Benchmark construction] Benchmark construction section: The claim that the 20 projects per language are representative of practical multi-file codebases that developers maintain rests only on the descriptors 'moderate-sized and high-quality' with no reported metrics (average files per project, cross-file call density, test-to-code ratio, external dependency count, or explicit selection protocol). This is load-bearing for the central difficulty and error claims, because unrepresentative or atypically clean projects could produce the observed executability and cascade errors as artifacts rather than evidence of inherent multi-file challenges.
[Evaluation and results] Evaluation and results sections: The manuscript provides no concrete performance metrics, statistical tests, baseline comparisons, or exclusion criteria for the eleven models despite the abstract's performance claims; without these details the 'moderate performance' conclusion and the subsequent error-fixing experiments cannot be verified or replicated.

minor comments (1)

[Abstract] Abstract: The repository URL should be confirmed to contain the full dataset, project metadata, and reproduction scripts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important areas for improving the rigor and replicability of the benchmark and results. We address each major comment below and have revised the manuscript accordingly.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction section: The claim that the 20 projects per language are representative of practical multi-file codebases that developers maintain rests only on the descriptors 'moderate-sized and high-quality' with no reported metrics (average files per project, cross-file call density, test-to-code ratio, external dependency count, or explicit selection protocol). This is load-bearing for the central difficulty and error claims, because unrepresentative or atypically clean projects could produce the observed executability and cascade errors as artifacts rather than evidence of inherent multi-file challenges.

Authors: We agree that quantitative metrics and an explicit selection protocol are necessary to support the representativeness claim. In the revised manuscript we have added a new subsection (3.2) and Table 1 that report the following statistics across the 60 projects: average files per project (Python: 47.2, Java: 39.8, JavaScript: 51.4), cross-file call density (mean 11.7 inter-file references per 100 lines), test-to-code ratio (mean 0.76), and external dependency count (mean 14.3). The selection protocol is now described in full: projects were drawn from GitHub repositories meeting criteria of 100+ stars, active maintenance within the prior 12 months, and manual review for absence of excessive boilerplate or generated code. These additions directly address the concern and allow readers to assess whether the observed errors reflect inherent multi-file challenges. revision: yes
Referee: [Evaluation and results] Evaluation and results sections: The manuscript provides no concrete performance metrics, statistical tests, baseline comparisons, or exclusion criteria for the eleven models despite the abstract's performance claims; without these details the 'moderate performance' conclusion and the subsequent error-fixing experiments cannot be verified or replicated.

Authors: The original manuscript already contains concrete metrics in Section 4 (Table 2 reports per-model pass rates, executability rates, and cascade-error rates for all 11 LLMs) and Section 5 (error-fixing results). However, we acknowledge that statistical tests, explicit baseline comparisons, and exclusion criteria were insufficiently highlighted. The revised version adds Wilcoxon signed-rank tests with p-values for all pairwise model comparisons, a new subsection (4.4) comparing MultiFileTest results against single-file baselines (HumanEval, MBPP), and an explicit list of exclusion criteria (API rate limits, context-length violations) in Section 4.1. These changes improve verifiability without altering the reported findings. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark construction and LLM evaluation

full rationale

The paper constructs MultiFileTest by selecting 20 projects per language and measures LLM performance, error types, and error-fixing outcomes directly against that fixed benchmark. No equations, fitted parameters, predictions derived from prior fits, or load-bearing self-citations appear; the central claims rest on observed pass rates and error counts rather than any derivation that reduces to its own inputs by construction. The representativeness concern raised by the skeptic is a validity issue, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that the chosen projects adequately represent real multi-file codebases and that the identified error categories (executability, cascade) are the primary failure modes.

axioms (1)

domain assumption The 20 moderate-sized high-quality projects per language are representative of practical multi-file codebases
Invoked to support the claim that MultiFileTest highlights real difficulty for LLMs.

pith-pipeline@v0.9.0 · 5763 in / 1305 out tokens · 38387 ms · 2026-05-23T03:39:38.289154+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose ProjectTest, a project-level benchmark for unit test generation covering Python, Java, and JavaScript. ProjectTest features 20 moderate-sized and high-quality projects per language.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Error analyses … show that even frontier LLMs … have significant basic yet critical errors, including compilation and cascade errors.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mutation-Guided Unit Test Generation with a Large Language Model
cs.SE 2025-06 conditional novelty 6.0

MUTGEN incorporates mutation feedback into LLM prompts and uses iteration to generate unit tests that achieve higher mutation scores than EvoSuite or vanilla LLM prompting on 204 benchmark subjects.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Saranya Alagarsamy, Chakkrit Tantithamthavorn, Chetan Arora, and Aldeida Aleti. 2024. Enhancing large language models for text-to-testcase generation. arXiv preprint arXiv:2402.11910

work page arXiv 2024
[3]

M Moein Almasi, Hadi Hemmati, Gordon Fraser, Andrea Arcuri, and Janis Benefelds. 2017. An industrial evaluation of unit test generation: Finding real faults in a financial application. In 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP), pages 263--272. IEEE

work page 2017
[4]

AI Anthropic. 2024. Claude 3.5 sonnet model card addendum. Claude-3.5 Model Card, 3:6

work page 2024
[5]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Ermira Daka and Gordon Fraser. 2014. A survey on unit testing practices and problems. In 2014 IEEE 25th International Symposium on Software Reliability Engineering, pages 201--211. IEEE

work page 2014
[8]

Arghavan Moradi Dakhel, Amin Nikanjam, Vahid Majdinasab, Foutse Khomh, and Michel C Desmarais. 2024. Effective test generation using pre-trained large language models and mutation testing. Information and Software Technology, 171:107468

work page 2024
[9]

Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2023. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. arXiv e-prints, pages arXiv--2308

work page 2023
[10]

Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, pages 416--419

work page 2011
[11]

Giovanni Grano, Fabio Palomba, Dario Di Nucci, Andrea De Lucia, and Harald C Gall. 2019. Scented since the beginning: On the diffuseness of test smells in automatically generated test code. Journal of Systems and Software, 156:312--327

work page 2019
[12]

Giovanni Grano, Simone Scalabrino, Harald C Gall, and Rocco Oliveto. 2018. An empirical investigation on the readability of manual and generated test cases. In Proceedings of the 26th Conference on Program Comprehension, pages 348--351

work page 2018
[13]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y Wu, YK Li, et al. 2024. Deepseek-coder: When the large language model meets programming--the rise of code intelligence. arXiv e-prints, pages arXiv--2401

work page 2024
[14]

Mark Harman and Phil McMinn. 2009. A theoretical and empirical study of search-based testing: Local, global, and hybrid search. IEEE Transactions on Software Engineering, 36(2):226--247

work page 2009
[15]

Kush Jain, Gabriel Synnaeve, and Baptiste Rozi \`e re. 2024 a . Testgeneval: A real world unit test generation and test completion benchmark. arXiv preprint arXiv:2410.00752

work page arXiv 2024
[16]

Naman Jain, Manish Shetty, Tianjun Zhang, King Han, Koushik Sen, and Ion Stoica. 2024 b . R2e: Turning any github repository into a programming agent environment. In Forty-first International Conference on Machine Learning

work page 2024
[17]

Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, et al. 2024. Devbench: A comprehensive benchmark for software development. arXiv preprint arXiv:2403.08604

work page arXiv 2024
[18]

Tsz-On Li, Wenxi Zong, Yibo Wang, Haoye Tian, Ying Wang, Shing-Chi Cheung, and Jeff Kramer. 2023. Nuances are the key: Unlocking chatgpt to find failure-inducing tests with differential prompting. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 14--26. IEEE

work page 2023
[19]

Stephan Lukasczyk and Gordon Fraser. 2022. Pynguin: Automated unit test generation for python. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, pages 168--172

work page 2022
[20]

Niels M \"u ndler, Mark Niklas Mueller, Jingxuan He, and Martin Vechev. 2024. Swt-bench: Testing and validating real-world bug-fixes with code agents. In The Thirty-eighth Annual Conference on Neural Information Processing Systems

work page 2024
[21]

Carlos Pacheco, Shuvendu K Lahiri, Michael D Ernst, and Thomas Ball. 2007. Feedback-directed random test generation. In 29th International Conference on Software Engineering (ICSE'07), pages 75--84. IEEE

work page 2007
[22]

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Max Sch \"a fer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering

work page 2023
[24]

Mohammed Latif Siddiq, Joanna Cecilia Da Silva Santos, Ridwanul Hasan Tanvir, Noshin Ulfat, Fahmid Al Rifat, and Vin \' cius Carvalho Lopes. 2024. Using large language models to generate junit tests: An empirical study. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, pages 313--322

work page 2024
[25]

CodeGemma Team, Heri Zhao, Jeffrey Hui, Joshua Howland, Nam Nguyen, Siqi Zuo, Andrea Hu, Christopher A Choquette-Choo, Jingyue Shen, Joe Kelley, et al. 2024 a . Codegemma: Open code models based on gemma. arXiv preprint arXiv:2406.11409

work page arXiv 2024
[26]

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. 2024 b . Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu, Da Song, Lingming Zhang, An Ran Chen, and Lei Ma. 2024. Testeval: Benchmarking large language models for test case generation. arXiv preprint arXiv:2406.04531

work page arXiv 2024
[28]

Xusheng Xiao, Sihan Li, Tao Xie, and Nikolai Tillmann. 2013. Characteristic studies of loop problems for structural test generation via symbolic execution. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 246--256. IEEE

work page 2013
[29]

Zhuokui Xie, Yinghao Chen, Chen Zhi, Shuiguang Deng, and Jianwei Yin. 2023. Chatunitest: a chatgpt-based automated unit test generation tool. arXiv preprint arXiv:2305.04764

work page arXiv 2023
[30]

Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma, et al. 2024. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv preprint arXiv:2406.11931

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[32]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[1] [1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Saranya Alagarsamy, Chakkrit Tantithamthavorn, Chetan Arora, and Aldeida Aleti. 2024. Enhancing large language models for text-to-testcase generation. arXiv preprint arXiv:2402.11910

work page arXiv 2024

[3] [3]

M Moein Almasi, Hadi Hemmati, Gordon Fraser, Andrea Arcuri, and Janis Benefelds. 2017. An industrial evaluation of unit test generation: Finding real faults in a financial application. In 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP), pages 263--272. IEEE

work page 2017

[4] [4]

AI Anthropic. 2024. Claude 3.5 sonnet model card addendum. Claude-3.5 Model Card, 3:6

work page 2024

[5] [5]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Ermira Daka and Gordon Fraser. 2014. A survey on unit testing practices and problems. In 2014 IEEE 25th International Symposium on Software Reliability Engineering, pages 201--211. IEEE

work page 2014

[8] [8]

Arghavan Moradi Dakhel, Amin Nikanjam, Vahid Majdinasab, Foutse Khomh, and Michel C Desmarais. 2024. Effective test generation using pre-trained large language models and mutation testing. Information and Software Technology, 171:107468

work page 2024

[9] [9]

Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2023. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. arXiv e-prints, pages arXiv--2308

work page 2023

[10] [10]

Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, pages 416--419

work page 2011

[11] [11]

Giovanni Grano, Fabio Palomba, Dario Di Nucci, Andrea De Lucia, and Harald C Gall. 2019. Scented since the beginning: On the diffuseness of test smells in automatically generated test code. Journal of Systems and Software, 156:312--327

work page 2019

[12] [12]

Giovanni Grano, Simone Scalabrino, Harald C Gall, and Rocco Oliveto. 2018. An empirical investigation on the readability of manual and generated test cases. In Proceedings of the 26th Conference on Program Comprehension, pages 348--351

work page 2018

[13] [13]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y Wu, YK Li, et al. 2024. Deepseek-coder: When the large language model meets programming--the rise of code intelligence. arXiv e-prints, pages arXiv--2401

work page 2024

[14] [14]

Mark Harman and Phil McMinn. 2009. A theoretical and empirical study of search-based testing: Local, global, and hybrid search. IEEE Transactions on Software Engineering, 36(2):226--247

work page 2009

[15] [15]

Kush Jain, Gabriel Synnaeve, and Baptiste Rozi \`e re. 2024 a . Testgeneval: A real world unit test generation and test completion benchmark. arXiv preprint arXiv:2410.00752

work page arXiv 2024

[16] [16]

Naman Jain, Manish Shetty, Tianjun Zhang, King Han, Koushik Sen, and Ion Stoica. 2024 b . R2e: Turning any github repository into a programming agent environment. In Forty-first International Conference on Machine Learning

work page 2024

[17] [17]

Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, et al. 2024. Devbench: A comprehensive benchmark for software development. arXiv preprint arXiv:2403.08604

work page arXiv 2024

[18] [18]

Tsz-On Li, Wenxi Zong, Yibo Wang, Haoye Tian, Ying Wang, Shing-Chi Cheung, and Jeff Kramer. 2023. Nuances are the key: Unlocking chatgpt to find failure-inducing tests with differential prompting. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 14--26. IEEE

work page 2023

[19] [19]

Stephan Lukasczyk and Gordon Fraser. 2022. Pynguin: Automated unit test generation for python. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, pages 168--172

work page 2022

[20] [20]

Niels M \"u ndler, Mark Niklas Mueller, Jingxuan He, and Martin Vechev. 2024. Swt-bench: Testing and validating real-world bug-fixes with code agents. In The Thirty-eighth Annual Conference on Neural Information Processing Systems

work page 2024

[21] [21]

Carlos Pacheco, Shuvendu K Lahiri, Michael D Ernst, and Thomas Ball. 2007. Feedback-directed random test generation. In 29th International Conference on Software Engineering (ICSE'07), pages 75--84. IEEE

work page 2007

[22] [22]

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Max Sch \"a fer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering

work page 2023

[24] [24]

Mohammed Latif Siddiq, Joanna Cecilia Da Silva Santos, Ridwanul Hasan Tanvir, Noshin Ulfat, Fahmid Al Rifat, and Vin \' cius Carvalho Lopes. 2024. Using large language models to generate junit tests: An empirical study. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, pages 313--322

work page 2024

[25] [25]

CodeGemma Team, Heri Zhao, Jeffrey Hui, Joshua Howland, Nam Nguyen, Siqi Zuo, Andrea Hu, Christopher A Choquette-Choo, Jingyue Shen, Joe Kelley, et al. 2024 a . Codegemma: Open code models based on gemma. arXiv preprint arXiv:2406.11409

work page arXiv 2024

[26] [26]

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. 2024 b . Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu, Da Song, Lingming Zhang, An Ran Chen, and Lei Ma. 2024. Testeval: Benchmarking large language models for test case generation. arXiv preprint arXiv:2406.04531

work page arXiv 2024

[28] [28]

Xusheng Xiao, Sihan Li, Tao Xie, and Nikolai Tillmann. 2013. Characteristic studies of loop problems for structural test generation via symbolic execution. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 246--256. IEEE

work page 2013

[29] [29]

Zhuokui Xie, Yinghao Chen, Chen Zhi, Shuiguang Deng, and Jianwei Yin. 2023. Chatunitest: a chatgpt-based automated unit test generation tool. arXiv preprint arXiv:2305.04764

work page arXiv 2023

[30] [30]

Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma, et al. 2024. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv preprint arXiv:2406.11931

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page

[32] [32]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page