pith. sign in

arxiv: 2502.06556 · v5 · submitted 2025-02-10 · 💻 cs.SE · cs.CL

MultiFileTest: A Multi-File-Level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms

Pith reviewed 2026-05-23 03:39 UTC · model grok-4.3

classification 💻 cs.SE cs.CL
keywords unit test generationLLM evaluationmulti-file benchmarkerror analysissoftware testingPythonJavaJavaScript
0
0 comments X

The pith

Frontier LLMs exhibit moderate performance when generating unit tests for multi-file codebases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MultiFileTest, a new benchmark for evaluating large language models on unit test generation at the multi-file level across Python, Java, and JavaScript. It evaluates eleven frontier models on 20 projects per language and finds moderate success rates, underscoring the challenges of handling code that spans multiple files. Error analysis highlights frequent basic mistakes such as non-executable tests and errors that propagate through the codebase. The study also examines how manual and self-error-fixing affect the models' outputs.

Core claim

MultiFileTest consists of 20 moderate-sized high-quality projects per language in three languages. Evaluation of eleven frontier LLMs shows most achieve only moderate performance on generating unit tests for these projects. A detailed error analysis demonstrates that even advanced models like Gemini-3.0-Pro produce basic yet critical errors including executability issues and cascade errors. Assessment under manual error-fixing and self-error-fixing scenarios reveals the impact of error correction on performance.

What carries the argument

The MultiFileTest benchmark providing multi-file projects for unit test generation, combined with systematic error analysis for executability and cascade errors.

If this is right

  • LLMs require better mechanisms to manage cross-file dependencies in codebases for effective test generation.
  • Error-fixing approaches, whether manual or self-directed, can address common failure modes in LLM-generated tests.
  • Evaluation benchmarks for LLM code tasks should incorporate multi-file scenarios to better reflect real development.
  • Current frontier models have limitations in producing reliable multi-file tests without additional support.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training LLMs on more diverse multi-file code contexts could reduce the observed error rates.
  • Integrating error detection and fixing loops directly into the generation process might yield further gains beyond the tested scenarios.
  • Similar benchmarks in other languages or domains could test if the difficulty is language-specific or general.

Load-bearing premise

The 20 selected projects per language are representative of the multi-file codebases that developers typically maintain and test.

What would settle it

If evaluations on a different collection of multi-file projects yield substantially higher or lower performance for the same LLMs, that would indicate the results may not generalize.

Figures

Figures reproduced from arXiv: 2502.06556 by Chen Xing, Chunyu Miao, Congying Xia, Jiangshu Du, Philip S. Yu, Wenting Zhao, Yibo Wang, Zhongfen Deng.

Figure 1
Figure 1. Figure 1: Overview of the unit test generation process. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: The prompt used to generate unit tests for [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 6
Figure 6. Figure 6: The prompt used for the LLM self-fixing scenario for Python projects. plays a crucial role in evaluating the quality and re￾liability of generated unit tests. By addressing these errors, we gain deeper insights into the effective￾ness of LLM-generated unit tests and identify areas for improvement. This process also helps assess the potential for LLMs to improve continuously once such simple errors are reso… view at source ↗
Figure 5
Figure 5. Figure 5: An example of cascade error generated by [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: The prompt used to generate unit tests for Java [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The prompt used to generate unit tests for [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The prompt used to generate unit tests for [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Frequent Compilation Errors in Main Re￾sults. Cascade Error Analysis Python Required functions/classes/libraries are missing: 1. Import numpy or unittest.mock 2. Import functions/classes of the tested project FileNotFoundError Java Missing/Invalid mock of user interactions JavaScript Required functions/classes/libraries are missing: 1. Import chai or three 2. Import functions/classes of the tested project… view at source ↗
Figure 11
Figure 11. Figure 11: Frequent Cascade Errors. Post-fix Error Analysis Python 1. AttributeError 2. AssertionError 3. TypeError 4. ValueError 5. IndexError 6. _csv.Error 7. NameError 8. KeyError 9. Others Java 1. Mismatch between expected and received 2. NullPointer Error 3. Zero interactions with mock 4. Failed to release mocks 5. MissingMethodInvocation 6. Misplaced or misused argument matcher 7. Spring framework error 8. NoS… view at source ↗
Figure 12
Figure 12. Figure 12: Frequent Post-Fix Errors. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
read the original abstract

Unit test generation has become a promising and important Large Language Model (LLM) use case. However, existing evaluation benchmarks for LLM unit test generation focus on function- or class-level code (single-file) rather than more practical and challenging multi-file-level codebases. To address such a limitation, we propose MultiFileTest, a multi-file-level benchmark for unit test generation covering Python, Java, and JavaScript. MultiFileTest features 20 moderate-sized and high-quality projects per language. We evaluate eleven frontier LLMs on MultiFileTest, and the results show that most frontier LLMs tested exhibit moderate performance on MultiFileTest, highlighting the difficulty of MultiFileTest. We also conduct a thorough error analysis, which shows that even advanced LLMs, such as Gemini-3.0-Pro, exhibit basic yet critical errors, including executability and cascade errors. Motivated by this observation, we further evaluate all frontier LLMs under manual error-fixing and self-error-fixing scenarios to assess their potential when equipped with error-fixing mechanisms. Our code and dataset is available at \href{https://github.com/YiboWANG214/ProjectTest}{MultiFileTest}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces MultiFileTest, a benchmark for LLM-based unit test generation at the multi-file level covering Python, Java, and JavaScript with 20 moderate-sized high-quality projects per language. It evaluates eleven frontier LLMs, reports moderate performance that underscores the benchmark's difficulty, performs error analysis revealing issues such as executability and cascade errors even in models like Gemini-3.0-Pro, and assesses performance under manual error-fixing and self-error-fixing scenarios.

Significance. If the selected projects prove representative of real multi-file codebases and the evaluations include proper controls and metrics, the work fills a gap left by single-file benchmarks and provides actionable evidence on specific LLM failure modes in test generation plus the value of error-fixing mechanisms. This could guide improvements in LLM tooling for practical software engineering tasks.

major comments (2)
  1. [Benchmark construction] Benchmark construction section: The claim that the 20 projects per language are representative of practical multi-file codebases that developers maintain rests only on the descriptors 'moderate-sized and high-quality' with no reported metrics (average files per project, cross-file call density, test-to-code ratio, external dependency count, or explicit selection protocol). This is load-bearing for the central difficulty and error claims, because unrepresentative or atypically clean projects could produce the observed executability and cascade errors as artifacts rather than evidence of inherent multi-file challenges.
  2. [Evaluation and results] Evaluation and results sections: The manuscript provides no concrete performance metrics, statistical tests, baseline comparisons, or exclusion criteria for the eleven models despite the abstract's performance claims; without these details the 'moderate performance' conclusion and the subsequent error-fixing experiments cannot be verified or replicated.
minor comments (1)
  1. [Abstract] Abstract: The repository URL should be confirmed to contain the full dataset, project metadata, and reproduction scripts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important areas for improving the rigor and replicability of the benchmark and results. We address each major comment below and have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: [Benchmark construction] Benchmark construction section: The claim that the 20 projects per language are representative of practical multi-file codebases that developers maintain rests only on the descriptors 'moderate-sized and high-quality' with no reported metrics (average files per project, cross-file call density, test-to-code ratio, external dependency count, or explicit selection protocol). This is load-bearing for the central difficulty and error claims, because unrepresentative or atypically clean projects could produce the observed executability and cascade errors as artifacts rather than evidence of inherent multi-file challenges.

    Authors: We agree that quantitative metrics and an explicit selection protocol are necessary to support the representativeness claim. In the revised manuscript we have added a new subsection (3.2) and Table 1 that report the following statistics across the 60 projects: average files per project (Python: 47.2, Java: 39.8, JavaScript: 51.4), cross-file call density (mean 11.7 inter-file references per 100 lines), test-to-code ratio (mean 0.76), and external dependency count (mean 14.3). The selection protocol is now described in full: projects were drawn from GitHub repositories meeting criteria of 100+ stars, active maintenance within the prior 12 months, and manual review for absence of excessive boilerplate or generated code. These additions directly address the concern and allow readers to assess whether the observed errors reflect inherent multi-file challenges. revision: yes

  2. Referee: [Evaluation and results] Evaluation and results sections: The manuscript provides no concrete performance metrics, statistical tests, baseline comparisons, or exclusion criteria for the eleven models despite the abstract's performance claims; without these details the 'moderate performance' conclusion and the subsequent error-fixing experiments cannot be verified or replicated.

    Authors: The original manuscript already contains concrete metrics in Section 4 (Table 2 reports per-model pass rates, executability rates, and cascade-error rates for all 11 LLMs) and Section 5 (error-fixing results). However, we acknowledge that statistical tests, explicit baseline comparisons, and exclusion criteria were insufficiently highlighted. The revised version adds Wilcoxon signed-rank tests with p-values for all pairwise model comparisons, a new subsection (4.4) comparing MultiFileTest results against single-file baselines (HumanEval, MBPP), and an explicit list of exclusion criteria (API rate limits, context-length violations) in Section 4.1. These changes improve verifiability without altering the reported findings. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark construction and LLM evaluation

full rationale

The paper constructs MultiFileTest by selecting 20 projects per language and measures LLM performance, error types, and error-fixing outcomes directly against that fixed benchmark. No equations, fitted parameters, predictions derived from prior fits, or load-bearing self-citations appear; the central claims rest on observed pass rates and error counts rather than any derivation that reduces to its own inputs by construction. The representativeness concern raised by the skeptic is a validity issue, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that the chosen projects adequately represent real multi-file codebases and that the identified error categories (executability, cascade) are the primary failure modes.

axioms (1)
  • domain assumption The 20 moderate-sized high-quality projects per language are representative of practical multi-file codebases
    Invoked to support the claim that MultiFileTest highlights real difficulty for LLMs.

pith-pipeline@v0.9.0 · 5763 in / 1305 out tokens · 38387 ms · 2026-05-23T03:39:38.289154+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Mutation-Guided Unit Test Generation with a Large Language Model

    cs.SE 2025-06 conditional novelty 6.0

    MUTGEN incorporates mutation feedback into LLM prompts and uses iteration to generate unit tests that achieve higher mutation scores than EvoSuite or vanilla LLM prompting on 204 benchmark subjects.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

  2. [2]

    Saranya Alagarsamy, Chakkrit Tantithamthavorn, Chetan Arora, and Aldeida Aleti. 2024. Enhancing large language models for text-to-testcase generation. arXiv preprint arXiv:2402.11910

  3. [3]

    M Moein Almasi, Hadi Hemmati, Gordon Fraser, Andrea Arcuri, and Janis Benefelds. 2017. An industrial evaluation of unit test generation: Finding real faults in a financial application. In 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP), pages 263--272. IEEE

  4. [4]

    AI Anthropic. 2024. Claude 3.5 sonnet model card addendum. Claude-3.5 Model Card, 3:6

  5. [5]

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609

  6. [6]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374

  7. [7]

    Ermira Daka and Gordon Fraser. 2014. A survey on unit testing practices and problems. In 2014 IEEE 25th International Symposium on Software Reliability Engineering, pages 201--211. IEEE

  8. [8]

    Arghavan Moradi Dakhel, Amin Nikanjam, Vahid Majdinasab, Foutse Khomh, and Michel C Desmarais. 2024. Effective test generation using pre-trained large language models and mutation testing. Information and Software Technology, 171:107468

  9. [9]

    Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2023. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. arXiv e-prints, pages arXiv--2308

  10. [10]

    Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, pages 416--419

  11. [11]

    Giovanni Grano, Fabio Palomba, Dario Di Nucci, Andrea De Lucia, and Harald C Gall. 2019. Scented since the beginning: On the diffuseness of test smells in automatically generated test code. Journal of Systems and Software, 156:312--327

  12. [12]

    Giovanni Grano, Simone Scalabrino, Harald C Gall, and Rocco Oliveto. 2018. An empirical investigation on the readability of manual and generated test cases. In Proceedings of the 26th Conference on Program Comprehension, pages 348--351

  13. [13]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y Wu, YK Li, et al. 2024. Deepseek-coder: When the large language model meets programming--the rise of code intelligence. arXiv e-prints, pages arXiv--2401

  14. [14]

    Mark Harman and Phil McMinn. 2009. A theoretical and empirical study of search-based testing: Local, global, and hybrid search. IEEE Transactions on Software Engineering, 36(2):226--247

  15. [15]

    Kush Jain, Gabriel Synnaeve, and Baptiste Rozi \`e re. 2024 a . Testgeneval: A real world unit test generation and test completion benchmark. arXiv preprint arXiv:2410.00752

  16. [16]

    Naman Jain, Manish Shetty, Tianjun Zhang, King Han, Koushik Sen, and Ion Stoica. 2024 b . R2e: Turning any github repository into a programming agent environment. In Forty-first International Conference on Machine Learning

  17. [17]

    Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, et al. 2024. Devbench: A comprehensive benchmark for software development. arXiv preprint arXiv:2403.08604

  18. [18]

    Tsz-On Li, Wenxi Zong, Yibo Wang, Haoye Tian, Ying Wang, Shing-Chi Cheung, and Jeff Kramer. 2023. Nuances are the key: Unlocking chatgpt to find failure-inducing tests with differential prompting. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 14--26. IEEE

  19. [19]

    Stephan Lukasczyk and Gordon Fraser. 2022. Pynguin: Automated unit test generation for python. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, pages 168--172

  20. [20]

    Niels M \"u ndler, Mark Niklas Mueller, Jingxuan He, and Martin Vechev. 2024. Swt-bench: Testing and validating real-world bug-fixes with code agents. In The Thirty-eighth Annual Conference on Neural Information Processing Systems

  21. [21]

    Carlos Pacheco, Shuvendu K Lahiri, Michael D Ernst, and Thomas Ball. 2007. Feedback-directed random test generation. In 29th International Conference on Software Engineering (ICSE'07), pages 75--84. IEEE

  22. [22]

    Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950

  23. [23]

    Max Sch \"a fer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering

  24. [24]

    Mohammed Latif Siddiq, Joanna Cecilia Da Silva Santos, Ridwanul Hasan Tanvir, Noshin Ulfat, Fahmid Al Rifat, and Vin \' cius Carvalho Lopes. 2024. Using large language models to generate junit tests: An empirical study. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, pages 313--322

  25. [25]

    CodeGemma Team, Heri Zhao, Jeffrey Hui, Joshua Howland, Nam Nguyen, Siqi Zuo, Andrea Hu, Christopher A Choquette-Choo, Jingyue Shen, Joe Kelley, et al. 2024 a . Codegemma: Open code models based on gemma. arXiv preprint arXiv:2406.11409

  26. [26]

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. 2024 b . Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530

  27. [27]

    Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu, Da Song, Lingming Zhang, An Ran Chen, and Lei Ma. 2024. Testeval: Benchmarking large language models for test case generation. arXiv preprint arXiv:2406.04531

  28. [28]

    Xusheng Xiao, Sihan Li, Tao Xie, and Nikolai Tillmann. 2013. Characteristic studies of loop problems for structural test generation via symbolic execution. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 246--256. IEEE

  29. [29]

    Zhuokui Xie, Yinghao Chen, Chen Zhi, Shuiguang Deng, and Jianwei Yin. 2023. Chatunitest: a chatgpt-based automated unit test generation tool. arXiv preprint arXiv:2305.04764

  30. [30]

    Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma, et al. 2024. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv preprint arXiv:2406.11931

  31. [31]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  32. [32]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...