MultiFileTest: A Multi-File-Level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms
Pith reviewed 2026-05-23 03:39 UTC · model grok-4.3
The pith
Frontier LLMs exhibit moderate performance when generating unit tests for multi-file codebases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MultiFileTest consists of 20 moderate-sized high-quality projects per language in three languages. Evaluation of eleven frontier LLMs shows most achieve only moderate performance on generating unit tests for these projects. A detailed error analysis demonstrates that even advanced models like Gemini-3.0-Pro produce basic yet critical errors including executability issues and cascade errors. Assessment under manual error-fixing and self-error-fixing scenarios reveals the impact of error correction on performance.
What carries the argument
The MultiFileTest benchmark providing multi-file projects for unit test generation, combined with systematic error analysis for executability and cascade errors.
If this is right
- LLMs require better mechanisms to manage cross-file dependencies in codebases for effective test generation.
- Error-fixing approaches, whether manual or self-directed, can address common failure modes in LLM-generated tests.
- Evaluation benchmarks for LLM code tasks should incorporate multi-file scenarios to better reflect real development.
- Current frontier models have limitations in producing reliable multi-file tests without additional support.
Where Pith is reading between the lines
- Training LLMs on more diverse multi-file code contexts could reduce the observed error rates.
- Integrating error detection and fixing loops directly into the generation process might yield further gains beyond the tested scenarios.
- Similar benchmarks in other languages or domains could test if the difficulty is language-specific or general.
Load-bearing premise
The 20 selected projects per language are representative of the multi-file codebases that developers typically maintain and test.
What would settle it
If evaluations on a different collection of multi-file projects yield substantially higher or lower performance for the same LLMs, that would indicate the results may not generalize.
Figures
read the original abstract
Unit test generation has become a promising and important Large Language Model (LLM) use case. However, existing evaluation benchmarks for LLM unit test generation focus on function- or class-level code (single-file) rather than more practical and challenging multi-file-level codebases. To address such a limitation, we propose MultiFileTest, a multi-file-level benchmark for unit test generation covering Python, Java, and JavaScript. MultiFileTest features 20 moderate-sized and high-quality projects per language. We evaluate eleven frontier LLMs on MultiFileTest, and the results show that most frontier LLMs tested exhibit moderate performance on MultiFileTest, highlighting the difficulty of MultiFileTest. We also conduct a thorough error analysis, which shows that even advanced LLMs, such as Gemini-3.0-Pro, exhibit basic yet critical errors, including executability and cascade errors. Motivated by this observation, we further evaluate all frontier LLMs under manual error-fixing and self-error-fixing scenarios to assess their potential when equipped with error-fixing mechanisms. Our code and dataset is available at \href{https://github.com/YiboWANG214/ProjectTest}{MultiFileTest}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MultiFileTest, a benchmark for LLM-based unit test generation at the multi-file level covering Python, Java, and JavaScript with 20 moderate-sized high-quality projects per language. It evaluates eleven frontier LLMs, reports moderate performance that underscores the benchmark's difficulty, performs error analysis revealing issues such as executability and cascade errors even in models like Gemini-3.0-Pro, and assesses performance under manual error-fixing and self-error-fixing scenarios.
Significance. If the selected projects prove representative of real multi-file codebases and the evaluations include proper controls and metrics, the work fills a gap left by single-file benchmarks and provides actionable evidence on specific LLM failure modes in test generation plus the value of error-fixing mechanisms. This could guide improvements in LLM tooling for practical software engineering tasks.
major comments (2)
- [Benchmark construction] Benchmark construction section: The claim that the 20 projects per language are representative of practical multi-file codebases that developers maintain rests only on the descriptors 'moderate-sized and high-quality' with no reported metrics (average files per project, cross-file call density, test-to-code ratio, external dependency count, or explicit selection protocol). This is load-bearing for the central difficulty and error claims, because unrepresentative or atypically clean projects could produce the observed executability and cascade errors as artifacts rather than evidence of inherent multi-file challenges.
- [Evaluation and results] Evaluation and results sections: The manuscript provides no concrete performance metrics, statistical tests, baseline comparisons, or exclusion criteria for the eleven models despite the abstract's performance claims; without these details the 'moderate performance' conclusion and the subsequent error-fixing experiments cannot be verified or replicated.
minor comments (1)
- [Abstract] Abstract: The repository URL should be confirmed to contain the full dataset, project metadata, and reproduction scripts.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. The comments highlight important areas for improving the rigor and replicability of the benchmark and results. We address each major comment below and have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [Benchmark construction] Benchmark construction section: The claim that the 20 projects per language are representative of practical multi-file codebases that developers maintain rests only on the descriptors 'moderate-sized and high-quality' with no reported metrics (average files per project, cross-file call density, test-to-code ratio, external dependency count, or explicit selection protocol). This is load-bearing for the central difficulty and error claims, because unrepresentative or atypically clean projects could produce the observed executability and cascade errors as artifacts rather than evidence of inherent multi-file challenges.
Authors: We agree that quantitative metrics and an explicit selection protocol are necessary to support the representativeness claim. In the revised manuscript we have added a new subsection (3.2) and Table 1 that report the following statistics across the 60 projects: average files per project (Python: 47.2, Java: 39.8, JavaScript: 51.4), cross-file call density (mean 11.7 inter-file references per 100 lines), test-to-code ratio (mean 0.76), and external dependency count (mean 14.3). The selection protocol is now described in full: projects were drawn from GitHub repositories meeting criteria of 100+ stars, active maintenance within the prior 12 months, and manual review for absence of excessive boilerplate or generated code. These additions directly address the concern and allow readers to assess whether the observed errors reflect inherent multi-file challenges. revision: yes
-
Referee: [Evaluation and results] Evaluation and results sections: The manuscript provides no concrete performance metrics, statistical tests, baseline comparisons, or exclusion criteria for the eleven models despite the abstract's performance claims; without these details the 'moderate performance' conclusion and the subsequent error-fixing experiments cannot be verified or replicated.
Authors: The original manuscript already contains concrete metrics in Section 4 (Table 2 reports per-model pass rates, executability rates, and cascade-error rates for all 11 LLMs) and Section 5 (error-fixing results). However, we acknowledge that statistical tests, explicit baseline comparisons, and exclusion criteria were insufficiently highlighted. The revised version adds Wilcoxon signed-rank tests with p-values for all pairwise model comparisons, a new subsection (4.4) comparing MultiFileTest results against single-file baselines (HumanEval, MBPP), and an explicit list of exclusion criteria (API rate limits, context-length violations) in Section 4.1. These changes improve verifiability without altering the reported findings. revision: partial
Circularity Check
No circularity: purely empirical benchmark construction and LLM evaluation
full rationale
The paper constructs MultiFileTest by selecting 20 projects per language and measures LLM performance, error types, and error-fixing outcomes directly against that fixed benchmark. No equations, fitted parameters, predictions derived from prior fits, or load-bearing self-citations appear; the central claims rest on observed pass rates and error counts rather than any derivation that reduces to its own inputs by construction. The representativeness concern raised by the skeptic is a validity issue, not a circularity issue.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 20 moderate-sized high-quality projects per language are representative of practical multi-file codebases
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose ProjectTest, a project-level benchmark for unit test generation covering Python, Java, and JavaScript. ProjectTest features 20 moderate-sized and high-quality projects per language.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Error analyses … show that even frontier LLMs … have significant basic yet critical errors, including compilation and cascade errors.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Mutation-Guided Unit Test Generation with a Large Language Model
MUTGEN incorporates mutation feedback into LLM prompts and uses iteration to generate unit tests that achieve higher mutation scores than EvoSuite or vanilla LLM prompting on 204 benchmark subjects.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [2]
-
[3]
M Moein Almasi, Hadi Hemmati, Gordon Fraser, Andrea Arcuri, and Janis Benefelds. 2017. An industrial evaluation of unit test generation: Finding real faults in a financial application. In 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP), pages 263--272. IEEE
work page 2017
-
[4]
AI Anthropic. 2024. Claude 3.5 sonnet model card addendum. Claude-3.5 Model Card, 3:6
work page 2024
-
[5]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Ermira Daka and Gordon Fraser. 2014. A survey on unit testing practices and problems. In 2014 IEEE 25th International Symposium on Software Reliability Engineering, pages 201--211. IEEE
work page 2014
-
[8]
Arghavan Moradi Dakhel, Amin Nikanjam, Vahid Majdinasab, Foutse Khomh, and Michel C Desmarais. 2024. Effective test generation using pre-trained large language models and mutation testing. Information and Software Technology, 171:107468
work page 2024
-
[9]
Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2023. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. arXiv e-prints, pages arXiv--2308
work page 2023
-
[10]
Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, pages 416--419
work page 2011
-
[11]
Giovanni Grano, Fabio Palomba, Dario Di Nucci, Andrea De Lucia, and Harald C Gall. 2019. Scented since the beginning: On the diffuseness of test smells in automatically generated test code. Journal of Systems and Software, 156:312--327
work page 2019
-
[12]
Giovanni Grano, Simone Scalabrino, Harald C Gall, and Rocco Oliveto. 2018. An empirical investigation on the readability of manual and generated test cases. In Proceedings of the 26th Conference on Program Comprehension, pages 348--351
work page 2018
-
[13]
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y Wu, YK Li, et al. 2024. Deepseek-coder: When the large language model meets programming--the rise of code intelligence. arXiv e-prints, pages arXiv--2401
work page 2024
-
[14]
Mark Harman and Phil McMinn. 2009. A theoretical and empirical study of search-based testing: Local, global, and hybrid search. IEEE Transactions on Software Engineering, 36(2):226--247
work page 2009
- [15]
-
[16]
Naman Jain, Manish Shetty, Tianjun Zhang, King Han, Koushik Sen, and Ion Stoica. 2024 b . R2e: Turning any github repository into a programming agent environment. In Forty-first International Conference on Machine Learning
work page 2024
- [17]
-
[18]
Tsz-On Li, Wenxi Zong, Yibo Wang, Haoye Tian, Ying Wang, Shing-Chi Cheung, and Jeff Kramer. 2023. Nuances are the key: Unlocking chatgpt to find failure-inducing tests with differential prompting. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 14--26. IEEE
work page 2023
-
[19]
Stephan Lukasczyk and Gordon Fraser. 2022. Pynguin: Automated unit test generation for python. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, pages 168--172
work page 2022
-
[20]
Niels M \"u ndler, Mark Niklas Mueller, Jingxuan He, and Martin Vechev. 2024. Swt-bench: Testing and validating real-world bug-fixes with code agents. In The Thirty-eighth Annual Conference on Neural Information Processing Systems
work page 2024
-
[21]
Carlos Pacheco, Shuvendu K Lahiri, Michael D Ernst, and Thomas Ball. 2007. Feedback-directed random test generation. In 29th International Conference on Software Engineering (ICSE'07), pages 75--84. IEEE
work page 2007
-
[22]
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Max Sch \"a fer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering
work page 2023
-
[24]
Mohammed Latif Siddiq, Joanna Cecilia Da Silva Santos, Ridwanul Hasan Tanvir, Noshin Ulfat, Fahmid Al Rifat, and Vin \' cius Carvalho Lopes. 2024. Using large language models to generate junit tests: An empirical study. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, pages 313--322
work page 2024
- [25]
-
[26]
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. 2024 b . Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [27]
-
[28]
Xusheng Xiao, Sihan Li, Tao Xie, and Nikolai Tillmann. 2013. Characteristic studies of loop problems for structural test generation via symbolic execution. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 246--256. IEEE
work page 2013
- [29]
-
[30]
Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma, et al. 2024. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv preprint arXiv:2406.11931
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[32]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.