Recognition: unknown
Call-Chain-Aware LLM-Based Test Generation for Java Projects
Pith reviewed 2026-05-09 20:53 UTC · model grok-4.3
The pith
Static analysis of call chains and dependencies lets large language models generate unit tests with higher coverage than execution-path methods alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that modeling caller-callee relationships, object constructors, and third-party dependencies via static analysis, then feeding these contexts into prompts along with iterative repair for failed generations, enables large language models to produce executable and semantically valid tests that achieve higher line and branch coverage than approaches limited to execution-path information.
What carries the argument
Call-chain-aware prompt construction that uses static analysis to capture and insert caller-callee relationships, object constructors, and dependency contexts into the input for the language model.
If this is right
- Tests become more executable and cover a larger portion of the code base.
- Performance holds on projects released after the language model was trained.
- Removing the call-chain and dependency elements from the prompts reduces the observed coverage gains.
- The approach supports iterative fixing to correct generation failures.
Where Pith is reading between the lines
- The same static-analysis enrichment could be applied to generate tests for other languages that support similar caller-callee extraction.
- The resulting tests could serve as a starting point for human developers to refine rather than begin from scratch.
- Extending the context extraction to include runtime traces might further close remaining coverage gaps.
Load-bearing premise
Static analysis can reliably identify call chains and dependencies that supply useful additional guidance to the language model beyond execution paths.
What would settle it
A direct comparison on the same projects in which tests generated from prompts without the extracted call-chain and dependency contexts achieve equal or higher coverage than those generated with the contexts.
Figures
read the original abstract
Large language models (LLMs) have recently shown strong potential for generating project-level unit tests. However, existing state-of-the-art approaches primarily rely on execution-path information to guide prompt construction, which is often insufficient for complex software systems with rich inter-class dependencies, deep call chains, and intricate object initialization requirements. In this paper, we present CAT, a novel call-chain-aware LLM-based test generation approach that explicitly incorporates call-chain and dependency contexts into prompts through dedicated static analysis. To construct executable, semantically valid test contexts, CAT systematically models caller--callee relationships, object constructors, and third-party dependencies, and supports iterative test fixing when generation failures occur. We evaluate CAT on the widely used Defects4J benchmark and on four real-world GitHub projects released after the LLM's cut-off date. The results show that, across projects in Defects4J, CAT improves line and branch coverage by 18.04% and 21.74%, respectively, over the state-of-the-art approach PANTA, while consistently achieving superior performance on post-cutoff real-world projects. An ablation study further demonstrates the importance of call-chain and dependency contexts in CAT.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents CAT, a call-chain-aware LLM-based approach for generating unit tests in Java projects. It uses dedicated static analysis to extract caller-callee relationships, object constructors, and third-party dependencies, incorporating these contexts into prompts along with iterative test fixing. On Defects4J, CAT reports 18.04% higher line coverage and 21.74% higher branch coverage than the state-of-the-art PANTA baseline; it also outperforms on four post-cutoff real-world GitHub projects. An ablation study is included to show the contribution of the call-chain and dependency contexts.
Significance. If the static analysis reliably extracts accurate and useful call-chain contexts that drive the observed coverage gains (rather than confounding factors), this work would meaningfully advance project-level test generation by addressing inter-class dependencies that execution-path-only methods miss. The evaluation on post-cutoff projects is a strength that reduces data-leakage risks. However, the central attribution of gains to call-chain awareness rests on an unvalidated assumption about static-analysis quality, which limits the strength of the contribution.
major comments (3)
- [Section 5.3] Ablation study (Section 5.3): The ablation removes call-chain/dependency contexts and reports a performance drop, but provides no quantitative validation (e.g., precision/recall against ground-truth call graphs on a sample of methods, or manual inspection of extracted contexts) that the static analysis produces correct or complete information. Without this, the drop cannot be confidently attributed to the absence of accurate call-chain data rather than prompt-length differences or other uncontrolled variables.
- [Section 4] Evaluation setup (Section 4): The comparison to PANTA does not state whether prompt length, number of LLM calls, or the iterative fixing budget are held constant across CAT and the baseline. If CAT's prompts are systematically longer or receive more fixing iterations due to the added contexts, the reported 18-22% coverage improvements cannot be isolated to call-chain awareness.
- [Section 3.2] Static analysis description (Section 3.2): The paper describes modeling of caller-callee relationships and third-party dependencies but does not discuss handling of known Java static-analysis challenges such as virtual dispatch, reflection, or generics. If these cases frequently produce incomplete or incorrect contexts, the approach's reliability on real-world projects is overstated.
minor comments (2)
- [Table 2] Table 2: The caption does not clarify whether the reported coverage numbers are averages across multiple LLM runs or single-run results; variance or statistical significance tests should be reported to support the percentage improvements.
- [Section 2] Related work (Section 2): The discussion of PANTA and other LLM-based test generators could more explicitly contrast their prompt-construction strategies with CAT's static-analysis pipeline to highlight the novelty.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback, which highlights important aspects of validating our static analysis and ensuring fair comparisons. We address each major comment point by point below. We will incorporate revisions to strengthen the manuscript's claims and transparency.
read point-by-point responses
-
Referee: [Section 5.3] Ablation study (Section 5.3): The ablation removes call-chain/dependency contexts and reports a performance drop, but provides no quantitative validation (e.g., precision/recall against ground-truth call graphs on a sample of methods, or manual inspection of extracted contexts) that the static analysis produces correct or complete information. Without this, the drop cannot be confidently attributed to the absence of accurate call-chain data rather than prompt-length differences or other uncontrolled variables.
Authors: We agree that the current ablation study would be strengthened by direct validation of the static analysis outputs. The ablation isolates the contribution of call-chain and dependency contexts by removing them while retaining the base prompt structure and fixing procedure. To address the concern, we will add a quantitative validation in the revised Section 5.3: a manual inspection of extracted contexts for 50 randomly sampled methods across Defects4J projects, reporting precision and recall against manually constructed ground-truth call graphs and dependency lists. This will confirm the reliability of the static analysis and better attribute the observed drops. revision: yes
-
Referee: [Section 4] Evaluation setup (Section 4): The comparison to PANTA does not state whether prompt length, number of LLM calls, or the iterative fixing budget are held constant across CAT and the baseline. If CAT's prompts are systematically longer or receive more fixing iterations due to the added contexts, the reported 18-22% coverage improvements cannot be isolated to call-chain awareness.
Authors: We thank the referee for this important clarification point. The experimental protocol holds the number of LLM calls per generation task and the maximum iterative fixing budget constant across CAT and PANTA. Prompt lengths inherently differ because CAT incorporates additional call-chain and dependency information extracted from static analysis. The ablation study further controls for context effects by using identical base prompts with and without the added information. We will revise Section 4 to explicitly document these controls and report average prompt lengths (in tokens) for both approaches to enable full assessment of fairness. revision: partial
-
Referee: [Section 3.2] Static analysis description (Section 3.2): The paper describes modeling of caller-callee relationships and third-party dependencies but does not discuss handling of known Java static-analysis challenges such as virtual dispatch, reflection, or generics. If these cases frequently produce incomplete or incorrect contexts, the approach's reliability on real-world projects is overstated.
Authors: We acknowledge that Section 3.2 does not discuss the handling of virtual dispatch, reflection, and generics, which are well-known limitations in Java static analysis. Our implementation relies on standard call-graph construction that approximates these cases (e.g., via class hierarchy analysis for dispatch and conservative handling of reflective calls). We will revise Section 3.2 to include a new subsection explicitly addressing these challenges, describing our approximations, and discussing their potential impact on context completeness and coverage gains on real-world projects. This will provide a more balanced evaluation of reliability. revision: yes
Circularity Check
No circularity: empirical evaluation is independent of inputs
full rationale
The paper presents an empirical method (CAT) that augments LLM prompts with statically extracted call-chain and dependency contexts, then measures coverage gains on Defects4J and post-cutoff GitHub projects plus an ablation. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided abstract or description. The central claim rests on external benchmarks and ablation rather than reducing to its own definitions or prior author results by construction. The skeptic concern about static-analysis accuracy is a correctness/falsifiability issue, not circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Andrea Arcuri and Lionel Briand. 2011. A practical guide for using statistical tests to assess randomized algorithms in software engineering. In Proceedings of the 33rd international conference on software engineering. 1–10
2011
-
[2]
David F Bacon and Peter F Sweeney. 1996. Fast static analysis of C++ virtual function calls. InProceedings of the 11th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications. 324–341
1996
-
[3]
Binance. [n. d.]. Binance Java Connectors. https://github.com/binance/binance-connector-java. Accessed: 2026-04-12
2026
-
[4]
Cristian Cadar, Daniel Dunbar, Dawson R Engler, et al. 2008. Klee: unassisted and automatic generation of high-coverage tests for complex systems programs.. InOSDI, Vol. 8. 209–224
2008
-
[5]
Cristian Cadar and Koushik Sen. 2013. Symbolic execution for software testing: three decades later.Commun. ACM56, 2 (2013), 82–90
2013
-
[6]
Xiang Cheng, Fan Sang, Yizhuo Zhai, Xiaokuan Zhang, and Taesoo Kim. 2025. Rug: Turbo llm for rust unit test generation. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 634–634
2025
-
[7]
Qwen Code. [n. d.]. Qwen Code Documentation. https://qwenlm.github.io/qwen-code-docs/en/users/overview/. Accessed: 2026-04-20
2026
-
[8]
Jeffrey Dean, David Grove, and Craig Chambers. 1995. Optimization of object-oriented programs using static class hierarchy analysis. InEuropean conference on object-oriented programming. Springer, 77–101
1995
-
[9]
Greta Dolcetti, Vincenzo Arceri, Eleonora Iotti, Sergio Maffeis, Agostino Cortesi, and Enea Zaffanella. 2026. Helping llms improve code generation using feedback from testing and static analysis.Discover Artificial Intelligence(2026)
2026
-
[10]
Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation for object-oriented software. InProceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 416–419
2011
-
[11]
Joseph Gil and Keren Lenz. 2010. The use of overloading in Java programs. InEuropean Conference on Object-Oriented Programming. Springer, 529–551
2010
-
[12]
Google. [n. d.]. Agent Development Kit (ADK) for Java. https://github.com/google/adk-java. Accessed: 2026-04-12
2026
- [13]
-
[14]
Susan Horwitz, Thomas Reps, and David Binkley. 1990. Interprocedural slicing using dependence graphs.ACM Transactions on Programming Languages and Systems (TOPLAS)12, 1 (1990), 26–60
1990
-
[15]
iFLYTEK Open Source. [n. d.]. Accelerate Agentic AI Strengthen Productivity Onspire Imagination. https://github.com/iflytek/astron-agent. Accessed: 2026-04-12
2026
-
[16]
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974(2024)
work page internal anchor Pith review arXiv 2024
-
[17]
JD Open Source. [n. d.]. JoyAgent-JDGenie. https://github.com/jd-opensource/joyagent-jdgenie. Accessed: 2026-04-12
2026
-
[18]
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023)
work page internal anchor Pith review arXiv 2023
-
[19]
René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. InProceedings of the 2014 international symposium on software testing and analysis. 437–440
2014
-
[20]
Mohammad Taha Khan, Mohamed Elhussiny, William Tobin, and Muhammad Ali Gulzar. 2024. An Empirical Evaluation of Method Signature Similarity in Java Codebases. InProceeding of the 2024 5th Asia Service Sciences and Software Engineering Conference. 35–42
2024
-
[21]
Tri Le, Thien Tran, Duy Cao, Vy Le, Tien N Nguyen, and Vu Nguyen. 2024. Kat: Dependency-aware automated api testing with large language models. In2024 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 82–92. Manuscript submitted to ACM 26 Wang et al
2024
-
[22]
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM computing surveys55, 9 (2023), 1–35
2023
-
[23]
Zifan Nan, Zhaoqiang Guo, Kui Liu, and Xin Xia. 2025. Test intention guided LLM-based unit test generation. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 1026–1038
2025
-
[24]
Ollama. 2025. Qwen3-Coder Model. https://ollama.com/library/qwen3-coder. Accessed: 2026-02-05
2025
-
[25]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744
2022
-
[26]
Carlos Pacheco and Michael D Ernst. 2007. Randoop: feedback-directed random testing for Java. InCompanion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion. 815–816
2007
-
[27]
Rangeet Pan, Myeongsoo Kim, Rahul Krishna, Raju Pavuluri, and Saurabh Sinha. 2025. Aster: Natural and multi-language unit test generation with llms. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 413–424
2025
-
[28]
PANTA TestAutomation. 2025. Panta: A Test Automation Framework. https://github.com/PANTA-TestAutomation/Panta. Accessed: 2026-02-05
2025
-
[29]
Ravin Ravi, Dylan Bradshaw, Stefano Ruberto, Gunel Jahangirova, and Valerio Terragni. 2025. LLMLOOP: Improving LLM-Generated Code and Tests through Automated Iterative Feedback Loops. In2025 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 930–934
2025
-
[30]
Sujoy Roy Chowdhury, Giriprasad Sridhara, AK Raghavan, Joy Bose, Sourav Mazumdar, Hamender Singh, Srinivasan Bajji Sugumaran, and Ricardo Britto. 2024. Static program analysis guided LLM based unit test generation. InProceedings of the 8th International Conference on Data Science and Management of Data (12th ACM IKDD CODS and 30th COMAD). 279–283
2024
-
[31]
Gabriel Ryan, Siddhartha Jain, Mingyue Shang, Shiqi Wang, Xiaofei Ma, Murali Krishna Ramanathan, and Baishakhi Ray. 2024. Code-aware prompting: A study of coverage-guided test generation in regression setting using llm.Proceedings of the ACM on Software Engineering1, FSE (2024), 951–971
2024
-
[32]
Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An empirical evaluation of using large language models for automated unit test generation.IEEE Transactions on Software Engineering50, 1 (2023), 85–105
2023
-
[33]
Yannis Smaragdakis and George Balatsouras. 2015. Pointer analysis.Foundations and Trends in Programming Languages2, 1 (2015), 1–69
2015
-
[34]
soot-oss. 2023. SootUp: A Redesign of the Soot Static Analysis Framework. https://github.com/soot-oss/SootUp. Accessed: 2026-02-05
2023
-
[35]
Nikolai Tillmann and Jonathan De Halleux. 2008. Pex–white box test generation for. net. InInternational conference on tests and proofs. Springer, 134–153
2008
-
[36]
Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2019. An empirical study on learning bug-fixing patches in the wild via neural machine translation.ACM Transactions on Software Engineering and Methodology (TOSEM)28, 4 (2019), 1–29
2019
-
[37]
András Vargha and Harold D Delaney. 2000. A critique and improvement of the CL common language effect size statistics of McGraw and Wong. Journal of Educational and Behavioral Statistics25, 2 (2000), 101–132
2000
-
[38]
Guancheng Wang, Qinghua Xu, Lionel C Briand, and Kui Liu. 2025. Mutation-Guided Unit Test Generation with a Large Language Model.arXiv preprint arXiv:2506.02954(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Zejun Wang, Kaibo Liu, Ge Li, and Zhi Jin. 2024. Hits: High-coverage llm-based unit test generation via method slicing. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1258–1268
2024
- [40]
- [41]
-
[42]
Chen Yang, Junjie Chen, Bin Lin, Jianyi Zhou, and Ziqi Wang. 2024. Enhancing llm-based test generation for hard-to-cover branches via program analysis.arXiv e-prints(2024), arXiv–2404
2024
-
[43]
Junwei Zhang, Xing Hu, Cuiyun Gao, Xin Xia, and Shanping Li. 2026. Enhancing Automated Unit Test Generation with Large Language Models: A Systematic Literature Review.ACM Transactions on Software Engineering and Methodology(2026)
2026
- [44]
-
[45]
Yuwei Zhang, Qingyuan Lu, Kai Liu, Wensheng Dou, Jiaxin Zhu, Li Qian, Chunxi Zhang, Zheng Lin, and Jun Wei. 2025. Citywalk: Enhancing llm-based c++ unit test generation via project-dependency awareness and language-specific knowledge.ACM Transactions on Software Engineering and Methodology(2025). Received 20 February 2007; revised 12 March 2009; accepted ...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.