pith. machine review for the scientific record. sign in

arxiv: 2604.22046 · v1 · submitted 2026-04-23 · 💻 cs.SE · cs.AI

Recognition: unknown

Call-Chain-Aware LLM-Based Test Generation for Java Projects

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:53 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords LLM-based test generationcall-chain analysisunit test generationstatic analysisJava software testingcode coveragesoftware dependency modeling
0
0 comments X

The pith

Static analysis of call chains and dependencies lets large language models generate unit tests with higher coverage than execution-path methods alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models show promise for creating unit tests but often fail to handle the full web of interactions in real codebases when prompts contain only execution paths. This paper establishes that systematically extracting caller-callee links, constructor details, and third-party dependencies through static analysis and inserting them into prompts produces tests that run and cover more code. The improvement matters because it addresses a practical bottleneck in automated testing of complex systems where manual effort remains high. If correct, the result follows that structural program information can be turned into prompt content to make model outputs more reliable and complete. The work further shows through comparison and removal of components that the added contexts drive the gains.

Core claim

The central claim is that modeling caller-callee relationships, object constructors, and third-party dependencies via static analysis, then feeding these contexts into prompts along with iterative repair for failed generations, enables large language models to produce executable and semantically valid tests that achieve higher line and branch coverage than approaches limited to execution-path information.

What carries the argument

Call-chain-aware prompt construction that uses static analysis to capture and insert caller-callee relationships, object constructors, and dependency contexts into the input for the language model.

If this is right

  • Tests become more executable and cover a larger portion of the code base.
  • Performance holds on projects released after the language model was trained.
  • Removing the call-chain and dependency elements from the prompts reduces the observed coverage gains.
  • The approach supports iterative fixing to correct generation failures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same static-analysis enrichment could be applied to generate tests for other languages that support similar caller-callee extraction.
  • The resulting tests could serve as a starting point for human developers to refine rather than begin from scratch.
  • Extending the context extraction to include runtime traces might further close remaining coverage gaps.

Load-bearing premise

Static analysis can reliably identify call chains and dependencies that supply useful additional guidance to the language model beyond execution paths.

What would settle it

A direct comparison on the same projects in which tests generated from prompts without the extracted call-chain and dependency contexts achieve equal or higher coverage than those generated with the contexts.

Figures

Figures reproduced from arXiv: 2604.22046 by Guancheng Wang, Kui Liu, Lionel C. Briand, Qinghua Xu, Zhaoqiang Guo.

Figure 1
Figure 1. Figure 1: Overview of the CAT workflow 3.1 Overview CAT consists of two phases: a generation phase and a fixing phase. It can operate with or without existing tests. When no tests are available, CAT first creates a pseudo test file with placeholder test functions, ensuring that newly generated tests can be compiled and executed within a valid test harness. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The system prompt used in CAT 3.4 Fixing Phase After each generation iteration, once the LLM produces candidate tests, we compile and execute all generated tests. Failed tests are collected and forwarded to the fixing phase, where execution feedback is leveraged to repair non-compilable or failing tests. Specifically, we incorporate failed tests together with their corresponding error messages into the fix… view at source ↗
Figure 4
Figure 4. Figure 4: The fixing prompt explicitly instructs the model to focus solely on repairing issues in the test code while preserving the semantics of the production code. Typical repair targets include compilation errors (e.g., missing imports and incorrect API usage leading to type mismatches) and runtime failures (e.g., invalid assertions and incorrect object construction), which are among the most common failure mode… view at source ↗
Figure 3
Figure 3. Figure 3: Main excerpt of the user prompt used in CAT and concise failure reasons. In addition, the provided source files of related classes are restricted to the top-𝑑𝑚𝑎𝑥 most frequently occurring classes in the extracted call-chain context, consistent with Section 3.3. Manuscript submitted to ACM [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Fixing prompt used for LLM-based test repair with source-code grounding [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
read the original abstract

Large language models (LLMs) have recently shown strong potential for generating project-level unit tests. However, existing state-of-the-art approaches primarily rely on execution-path information to guide prompt construction, which is often insufficient for complex software systems with rich inter-class dependencies, deep call chains, and intricate object initialization requirements. In this paper, we present CAT, a novel call-chain-aware LLM-based test generation approach that explicitly incorporates call-chain and dependency contexts into prompts through dedicated static analysis. To construct executable, semantically valid test contexts, CAT systematically models caller--callee relationships, object constructors, and third-party dependencies, and supports iterative test fixing when generation failures occur. We evaluate CAT on the widely used Defects4J benchmark and on four real-world GitHub projects released after the LLM's cut-off date. The results show that, across projects in Defects4J, CAT improves line and branch coverage by 18.04% and 21.74%, respectively, over the state-of-the-art approach PANTA, while consistently achieving superior performance on post-cutoff real-world projects. An ablation study further demonstrates the importance of call-chain and dependency contexts in CAT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents CAT, a call-chain-aware LLM-based approach for generating unit tests in Java projects. It uses dedicated static analysis to extract caller-callee relationships, object constructors, and third-party dependencies, incorporating these contexts into prompts along with iterative test fixing. On Defects4J, CAT reports 18.04% higher line coverage and 21.74% higher branch coverage than the state-of-the-art PANTA baseline; it also outperforms on four post-cutoff real-world GitHub projects. An ablation study is included to show the contribution of the call-chain and dependency contexts.

Significance. If the static analysis reliably extracts accurate and useful call-chain contexts that drive the observed coverage gains (rather than confounding factors), this work would meaningfully advance project-level test generation by addressing inter-class dependencies that execution-path-only methods miss. The evaluation on post-cutoff projects is a strength that reduces data-leakage risks. However, the central attribution of gains to call-chain awareness rests on an unvalidated assumption about static-analysis quality, which limits the strength of the contribution.

major comments (3)
  1. [Section 5.3] Ablation study (Section 5.3): The ablation removes call-chain/dependency contexts and reports a performance drop, but provides no quantitative validation (e.g., precision/recall against ground-truth call graphs on a sample of methods, or manual inspection of extracted contexts) that the static analysis produces correct or complete information. Without this, the drop cannot be confidently attributed to the absence of accurate call-chain data rather than prompt-length differences or other uncontrolled variables.
  2. [Section 4] Evaluation setup (Section 4): The comparison to PANTA does not state whether prompt length, number of LLM calls, or the iterative fixing budget are held constant across CAT and the baseline. If CAT's prompts are systematically longer or receive more fixing iterations due to the added contexts, the reported 18-22% coverage improvements cannot be isolated to call-chain awareness.
  3. [Section 3.2] Static analysis description (Section 3.2): The paper describes modeling of caller-callee relationships and third-party dependencies but does not discuss handling of known Java static-analysis challenges such as virtual dispatch, reflection, or generics. If these cases frequently produce incomplete or incorrect contexts, the approach's reliability on real-world projects is overstated.
minor comments (2)
  1. [Table 2] Table 2: The caption does not clarify whether the reported coverage numbers are averages across multiple LLM runs or single-run results; variance or statistical significance tests should be reported to support the percentage improvements.
  2. [Section 2] Related work (Section 2): The discussion of PANTA and other LLM-based test generators could more explicitly contrast their prompt-construction strategies with CAT's static-analysis pipeline to highlight the novelty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important aspects of validating our static analysis and ensuring fair comparisons. We address each major comment point by point below. We will incorporate revisions to strengthen the manuscript's claims and transparency.

read point-by-point responses
  1. Referee: [Section 5.3] Ablation study (Section 5.3): The ablation removes call-chain/dependency contexts and reports a performance drop, but provides no quantitative validation (e.g., precision/recall against ground-truth call graphs on a sample of methods, or manual inspection of extracted contexts) that the static analysis produces correct or complete information. Without this, the drop cannot be confidently attributed to the absence of accurate call-chain data rather than prompt-length differences or other uncontrolled variables.

    Authors: We agree that the current ablation study would be strengthened by direct validation of the static analysis outputs. The ablation isolates the contribution of call-chain and dependency contexts by removing them while retaining the base prompt structure and fixing procedure. To address the concern, we will add a quantitative validation in the revised Section 5.3: a manual inspection of extracted contexts for 50 randomly sampled methods across Defects4J projects, reporting precision and recall against manually constructed ground-truth call graphs and dependency lists. This will confirm the reliability of the static analysis and better attribute the observed drops. revision: yes

  2. Referee: [Section 4] Evaluation setup (Section 4): The comparison to PANTA does not state whether prompt length, number of LLM calls, or the iterative fixing budget are held constant across CAT and the baseline. If CAT's prompts are systematically longer or receive more fixing iterations due to the added contexts, the reported 18-22% coverage improvements cannot be isolated to call-chain awareness.

    Authors: We thank the referee for this important clarification point. The experimental protocol holds the number of LLM calls per generation task and the maximum iterative fixing budget constant across CAT and PANTA. Prompt lengths inherently differ because CAT incorporates additional call-chain and dependency information extracted from static analysis. The ablation study further controls for context effects by using identical base prompts with and without the added information. We will revise Section 4 to explicitly document these controls and report average prompt lengths (in tokens) for both approaches to enable full assessment of fairness. revision: partial

  3. Referee: [Section 3.2] Static analysis description (Section 3.2): The paper describes modeling of caller-callee relationships and third-party dependencies but does not discuss handling of known Java static-analysis challenges such as virtual dispatch, reflection, or generics. If these cases frequently produce incomplete or incorrect contexts, the approach's reliability on real-world projects is overstated.

    Authors: We acknowledge that Section 3.2 does not discuss the handling of virtual dispatch, reflection, and generics, which are well-known limitations in Java static analysis. Our implementation relies on standard call-graph construction that approximates these cases (e.g., via class hierarchy analysis for dispatch and conservative handling of reflective calls). We will revise Section 3.2 to include a new subsection explicitly addressing these challenges, describing our approximations, and discussing their potential impact on context completeness and coverage gains on real-world projects. This will provide a more balanced evaluation of reliability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation is independent of inputs

full rationale

The paper presents an empirical method (CAT) that augments LLM prompts with statically extracted call-chain and dependency contexts, then measures coverage gains on Defects4J and post-cutoff GitHub projects plus an ablation. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided abstract or description. The central claim rests on external benchmarks and ablation rather than reducing to its own definitions or prior author results by construction. The skeptic concern about static-analysis accuracy is a correctness/falsifiability issue, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical software engineering paper. The central claim rests on standard assumptions that static analysis can accurately model call chains and that LLMs can effectively use the resulting context in prompts. No new free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5511 in / 1071 out tokens · 29595 ms · 2026-05-09T20:53:18.001385+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 7 canonical work pages · 3 internal anchors

  1. [1]

    Andrea Arcuri and Lionel Briand. 2011. A practical guide for using statistical tests to assess randomized algorithms in software engineering. In Proceedings of the 33rd international conference on software engineering. 1–10

  2. [2]

    David F Bacon and Peter F Sweeney. 1996. Fast static analysis of C++ virtual function calls. InProceedings of the 11th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications. 324–341

  3. [3]

    Binance. [n. d.]. Binance Java Connectors. https://github.com/binance/binance-connector-java. Accessed: 2026-04-12

  4. [4]

    Cristian Cadar, Daniel Dunbar, Dawson R Engler, et al. 2008. Klee: unassisted and automatic generation of high-coverage tests for complex systems programs.. InOSDI, Vol. 8. 209–224

  5. [5]

    Cristian Cadar and Koushik Sen. 2013. Symbolic execution for software testing: three decades later.Commun. ACM56, 2 (2013), 82–90

  6. [6]

    Xiang Cheng, Fan Sang, Yizhuo Zhai, Xiaokuan Zhang, and Taesoo Kim. 2025. Rug: Turbo llm for rust unit test generation. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 634–634

  7. [7]

    Qwen Code. [n. d.]. Qwen Code Documentation. https://qwenlm.github.io/qwen-code-docs/en/users/overview/. Accessed: 2026-04-20

  8. [8]

    Jeffrey Dean, David Grove, and Craig Chambers. 1995. Optimization of object-oriented programs using static class hierarchy analysis. InEuropean conference on object-oriented programming. Springer, 77–101

  9. [9]

    Greta Dolcetti, Vincenzo Arceri, Eleonora Iotti, Sergio Maffeis, Agostino Cortesi, and Enea Zaffanella. 2026. Helping llms improve code generation using feedback from testing and static analysis.Discover Artificial Intelligence(2026)

  10. [10]

    Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation for object-oriented software. InProceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 416–419

  11. [11]

    Joseph Gil and Keren Lenz. 2010. The use of overloading in Java programs. InEuropean Conference on Object-Oriented Programming. Springer, 529–551

  12. [12]

    Google. [n. d.]. Agent Development Kit (ADK) for Java. https://github.com/google/adk-java. Accessed: 2026-04-12

  13. [13]

    Sijia Gu, Noor Nashid, and Ali Mesbah. 2025. LLM Test Generation via Iterative Hybrid Program Analysis.arXiv preprint arXiv:2503.13580(2025)

  14. [14]

    Susan Horwitz, Thomas Reps, and David Binkley. 1990. Interprocedural slicing using dependence graphs.ACM Transactions on Programming Languages and Systems (TOPLAS)12, 1 (1990), 26–60

  15. [15]

    iFLYTEK Open Source. [n. d.]. Accelerate Agentic AI Strengthen Productivity Onspire Imagination. https://github.com/iflytek/astron-agent. Accessed: 2026-04-12

  16. [16]

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974(2024)

  17. [17]

    JD Open Source. [n. d.]. JoyAgent-JDGenie. https://github.com/jd-opensource/joyagent-jdgenie. Accessed: 2026-04-12

  18. [18]

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023)

  19. [19]

    René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. InProceedings of the 2014 international symposium on software testing and analysis. 437–440

  20. [20]

    Mohammad Taha Khan, Mohamed Elhussiny, William Tobin, and Muhammad Ali Gulzar. 2024. An Empirical Evaluation of Method Signature Similarity in Java Codebases. InProceeding of the 2024 5th Asia Service Sciences and Software Engineering Conference. 35–42

  21. [21]

    Tri Le, Thien Tran, Duy Cao, Vy Le, Tien N Nguyen, and Vu Nguyen. 2024. Kat: Dependency-aware automated api testing with large language models. In2024 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 82–92. Manuscript submitted to ACM 26 Wang et al

  22. [22]

    Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM computing surveys55, 9 (2023), 1–35

  23. [23]

    Zifan Nan, Zhaoqiang Guo, Kui Liu, and Xin Xia. 2025. Test intention guided LLM-based unit test generation. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 1026–1038

  24. [24]

    Ollama. 2025. Qwen3-Coder Model. https://ollama.com/library/qwen3-coder. Accessed: 2026-02-05

  25. [25]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

  26. [26]

    Carlos Pacheco and Michael D Ernst. 2007. Randoop: feedback-directed random testing for Java. InCompanion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion. 815–816

  27. [27]

    Rangeet Pan, Myeongsoo Kim, Rahul Krishna, Raju Pavuluri, and Saurabh Sinha. 2025. Aster: Natural and multi-language unit test generation with llms. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 413–424

  28. [28]

    PANTA TestAutomation. 2025. Panta: A Test Automation Framework. https://github.com/PANTA-TestAutomation/Panta. Accessed: 2026-02-05

  29. [29]

    Ravin Ravi, Dylan Bradshaw, Stefano Ruberto, Gunel Jahangirova, and Valerio Terragni. 2025. LLMLOOP: Improving LLM-Generated Code and Tests through Automated Iterative Feedback Loops. In2025 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 930–934

  30. [30]

    Sujoy Roy Chowdhury, Giriprasad Sridhara, AK Raghavan, Joy Bose, Sourav Mazumdar, Hamender Singh, Srinivasan Bajji Sugumaran, and Ricardo Britto. 2024. Static program analysis guided LLM based unit test generation. InProceedings of the 8th International Conference on Data Science and Management of Data (12th ACM IKDD CODS and 30th COMAD). 279–283

  31. [31]

    Gabriel Ryan, Siddhartha Jain, Mingyue Shang, Shiqi Wang, Xiaofei Ma, Murali Krishna Ramanathan, and Baishakhi Ray. 2024. Code-aware prompting: A study of coverage-guided test generation in regression setting using llm.Proceedings of the ACM on Software Engineering1, FSE (2024), 951–971

  32. [32]

    Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An empirical evaluation of using large language models for automated unit test generation.IEEE Transactions on Software Engineering50, 1 (2023), 85–105

  33. [33]

    Yannis Smaragdakis and George Balatsouras. 2015. Pointer analysis.Foundations and Trends in Programming Languages2, 1 (2015), 1–69

  34. [34]

    soot-oss. 2023. SootUp: A Redesign of the Soot Static Analysis Framework. https://github.com/soot-oss/SootUp. Accessed: 2026-02-05

  35. [35]

    Nikolai Tillmann and Jonathan De Halleux. 2008. Pex–white box test generation for. net. InInternational conference on tests and proofs. Springer, 134–153

  36. [36]

    Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2019. An empirical study on learning bug-fixing patches in the wild via neural machine translation.ACM Transactions on Software Engineering and Methodology (TOSEM)28, 4 (2019), 1–29

  37. [37]

    András Vargha and Harold D Delaney. 2000. A critique and improvement of the CL common language effect size statistics of McGraw and Wong. Journal of Educational and Behavioral Statistics25, 2 (2000), 101–132

  38. [38]

    Guancheng Wang, Qinghua Xu, Lionel C Briand, and Kui Liu. 2025. Mutation-Guided Unit Test Generation with a Large Language Model.arXiv preprint arXiv:2506.02954(2025)

  39. [39]

    Zejun Wang, Kaibo Liu, Ge Li, and Zhi Jin. 2024. Hits: High-coverage llm-based unit test generation via method slicing. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1258–1268

  40. [40]

    Yaoxuan Wu, Xiaojie Zhou, Ahmad Humayun, Muhammad Ali Gulzar, and Miryung Kim. 2025. Generating and understanding tests via path-aware symbolic execution with llms.arXiv preprint arXiv:2506.19287(2025)

  41. [41]

    Qinghua Xu, Guancheng Wang, Lionel Briand, and Kui Liu. 2025. Hallucination to Consensus: Multi-Agent LLMs for End-to-End Test Generation. arXiv preprint arXiv:2506.02943(2025)

  42. [42]

    Chen Yang, Junjie Chen, Bin Lin, Jianyi Zhou, and Ziqi Wang. 2024. Enhancing llm-based test generation for hard-to-cover branches via program analysis.arXiv e-prints(2024), arXiv–2404

  43. [43]

    Junwei Zhang, Xing Hu, Cuiyun Gao, Xin Xia, and Shanping Li. 2026. Enhancing Automated Unit Test Generation with Large Language Models: A Systematic Literature Review.ACM Transactions on Software Engineering and Methodology(2026)

  44. [44]

    Quanjun Zhang, Ye Shang, Chunrong Fang, Siqi Gu, Jianyi Zhou, and Zhenyu Chen. 2024. Testbench: Evaluating class-level test case generation capability of large language models.arXiv preprint arXiv:2409.17561(2024)

  45. [45]

    Yuwei Zhang, Qingyuan Lu, Kai Liu, Wensheng Dou, Jiaxin Zhu, Li Qian, Chunxi Zhang, Zheng Lin, and Jun Wei. 2025. Citywalk: Enhancing llm-based c++ unit test generation via project-dependency awareness and language-specific knowledge.ACM Transactions on Software Engineering and Methodology(2025). Received 20 February 2007; revised 12 March 2009; accepted ...