pith. machine review for the scientific record. sign in

arxiv: 2604.11270 · v2 · submitted 2026-04-13 · 💻 cs.SE

Recognition: unknown

Evaluating LLM Agents on Automated Software Analysis Tasks

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:33 UTC · model grok-4.3

classification 💻 cs.SE
keywords LLM agentssoftware analysis toolsbenchmarkagent architectureautomated configurationC/C++ Java projectstool setup
0
0 comments X

The pith

A custom LLM agent achieves 94% success in setting up software analysis tools on a benchmark of 35 tasks, far exceeding baseline agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AnalysisBench, a benchmark of 35 tool-project pairs across seven analysis tools and ten C/C++ and Java projects, each with a reference setup. It tests four agent architectures on four LLMs and finds that a custom agent reaches 94% manually verified success rate, compared to 77% for the strongest baseline. The evaluation reveals that agent architecture is more important than the specific LLM used, and identifies common problems in other agents such as mixing different stages, failing to localize errors, and stopping too soon. Readers would care because this shows a path to automating the difficult task of configuring analysis tools for real projects, which currently requires significant manual effort.

Core claim

Our custom agent, AnalysisAgent, achieves manually verified success rates of 94% (Gemini-3-Flash, 33/35 tasks), compared to 77% for the best baseline (ExecutionAgent). Beyond quantitative results, we identify key limitations in existing agents, including stage mixing, poor error localization, and premature termination, and show that agentic architecture matters more than LLM capability alone. We further find that whole-program analyses and symbolic execution are the most difficult tasks, that Java toolchains pose greater challenges than C/C++, and that LLM-self-validated success consistently overstates manually verified success.

What carries the argument

AnalysisBench, a benchmark of manually constructed reference setups for 35 tool-project pairs, used to measure agent success in installing, configuring, and running analysis tools to produce valid outputs.

If this is right

  • Targeted agent designs can reliably automate software analysis tool deployment without expert intervention.
  • Limitations like stage mixing and premature termination can be mitigated through improved agent workflows.
  • Whole-program and symbolic analyses remain harder, suggesting need for special agent strategies.
  • Java setups are tougher than C/C++, pointing to language-specific difficulties in agent performance.
  • LLM self-validation overestimates true success, requiring external verification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • These results suggest that for configuration-heavy software engineering tasks, investing in agent structure may yield higher returns than scaling up model size alone.
  • The benchmark could be extended to other languages or analysis domains to test if the architecture advantage generalizes.
  • If adopted, such agents might make advanced program analysis tools more widely usable by non-experts in open source projects.
  • Future work might explore combining these agents with code generation for fixing setup errors dynamically.

Load-bearing premise

The manually constructed reference setups represent the definitive correct configurations for each tool-project pair and human manual verification of agent outputs is objective and free of bias.

What would settle it

Re-running the agent evaluations with different human verifiers or independent reference setups and finding substantially lower success rates or high disagreement among verifiers.

Figures

Figures reproduced from arXiv: 2604.11270 by Cristian Cadar, Islem Bouzenia, Michael Pradel.

Figure 1
Figure 1. Figure 1: Stage 4 (analysis run) for KLEE on fastfetch. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: How many times more likely AnalysisAgent is to succeed compared to baselines, per LLM backend (circles) and pooled [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average verified success rate by analysis tool and [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Root-cause failure categories by agent (%). [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Resource consumption by agent and LLM backend. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Numerous software analysis tools exist today, yet applying them to diverse open-source projects remains challenging due to environment setup, dependency resolution, and tool configuration. LLM-based agents offer a potential solution, yet no prior work has systematically studied their effectiveness on the specific task of automated software analysis, which, unlike issue solving or general environment setup, requires installing and configuring a separate analysis tool alongside the target project, generating tool-specific prerequisites, and validating that the tool produces meaningful analysis outputs. We introduce AnalysisBench, a benchmark of 35 tool-project pairs spanning seven analysis tools and ten diverse C/C++ and Java projects, each with a manually constructed reference setup. Using AnalysisBench, we evaluate four agent architectures across four LLM backends. Our custom agent, AnalysisAgent, achieves manually verified success rates of 94% (Gemini-3-Flash, 33/35 tasks), compared to 77% for the best baseline (ExecutionAgent). Beyond quantitative results, we identify key limitations in existing agents, including stage mixing, poor error localization, and premature termination, and show that agentic architecture matters more than LLM capability alone. We further find that whole-program analyses and symbolic execution are the most difficult tasks, that Java toolchains pose greater challenges than C/C++, and that LLM-self-validated success consistently overstates manually verified success.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces AnalysisBench, a benchmark of 35 manually constructed tool-project pairs spanning seven analysis tools and ten C/C++/Java projects. It evaluates four agent architectures (including a proposed AnalysisAgent) across four LLM backends on the task of installing, configuring, and executing software analysis tools to produce meaningful outputs. Key claims include AnalysisAgent reaching 94% manually verified success (33/35 tasks with Gemini-3-Flash) versus 77% for the strongest baseline, that agent architecture outweighs LLM choice, that whole-program/symbolic analyses and Java toolchains are hardest, and that existing agents suffer from stage mixing, poor error localization, and premature termination.

Significance. If the manual verification is reliable and reproducible, the work is significant for providing the first systematic empirical study of LLM agents on the specific, multi-stage problem of automated software analysis setup (distinct from general environment setup or issue resolution). The cross-architecture and cross-LLM comparisons, the catalog of failure modes, and the finding that specialized agent design can outperform raw model scale are useful contributions that could guide future agent development in software engineering.

major comments (2)
  1. [§5 and §4.2] §5 (Results) and §4.2 (Evaluation Protocol): The central quantitative claims (94% vs. 77% success, architecture > LLM capability) rest entirely on manually verified outcomes against reference setups, yet no explicit rubric, decision criteria for 'meaningful analysis outputs' (especially for whole-program or symbolic analyses), inter-rater agreement statistics, or blinding protocol are provided. With only 35 tasks, modest verifier variance could materially alter the reported gap and the downstream taxonomy of failure modes.
  2. [§5.3] §5.3 (Task Difficulty Analysis): The claims that whole-program analyses and symbolic execution are the most difficult, and that Java poses greater challenges than C/C++, are presented without per-category success tables, statistical tests, or breakdown by agent/LLM, making it hard to assess whether these patterns are robust or driven by a few edge cases.
minor comments (2)
  1. The exact model identifier 'Gemini-3-Flash' should be clarified (e.g., Gemini 1.5 Flash or a later variant) with version and access date for reproducibility.
  2. Figure captions and table headers could more explicitly state that success rates are manually verified rather than automatically measured.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of transparency in our evaluation protocol and results presentation. We address each major comment below, indicating the revisions we will make to improve clarity and robustness without altering the core findings.

read point-by-point responses
  1. Referee: [§5 and §4.2] §5 (Results) and §4.2 (Evaluation Protocol): The central quantitative claims (94% vs. 77% success, architecture > LLM capability) rest entirely on manually verified outcomes against reference setups, yet no explicit rubric, decision criteria for 'meaningful analysis outputs' (especially for whole-program or symbolic analyses), inter-rater agreement statistics, or blinding protocol are provided. With only 35 tasks, modest verifier variance could materially alter the reported gap and the downstream taxonomy of failure modes.

    Authors: We agree that the manuscript would benefit from greater explicitness regarding the verification process. In the revised version, we will expand §4.2 to include a detailed rubric and specific decision criteria for determining 'meaningful analysis outputs,' with tailored guidance for whole-program and symbolic analyses (e.g., requiring non-empty, tool-specific reports such as call graphs or execution traces that match expected analysis semantics). The verification was performed by the authors against the independently constructed reference setups described in the benchmark. We will document this process, including that it was not formally blinded and that inter-rater agreement statistics are not available because verification was conducted by a single primary verifier for consistency (with spot-checks by co-authors). We will also add a discussion of potential verifier variance as a limitation given the small task count. These changes will strengthen the reproducibility of the 94% vs. 77% claims and the failure mode taxonomy. revision: yes

  2. Referee: [§5.3] §5.3 (Task Difficulty Analysis): The claims that whole-program analyses and symbolic execution are the most difficult, and that Java poses greater challenges than C/C++, are presented without per-category success tables, statistical tests, or breakdown by agent/LLM, making it hard to assess whether these patterns are robust or driven by a few edge cases.

    Authors: We concur that additional granularity in §5.3 would improve interpretability. In the revision, we will add per-category success tables showing breakdown by analysis type (whole-program/symbolic vs. intra-procedural), by language (Java vs. C/C++), and by agent architecture and LLM backend. We will also incorporate statistical tests (e.g., Fisher's exact test for proportions) where the sample sizes per category permit, or explicitly note the descriptive nature of the patterns when tests lack power due to the total of 35 tasks. This will allow readers to evaluate whether the difficulty claims are driven by systematic trends or isolated cases. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical benchmark evaluation

full rationale

The paper reports an empirical evaluation of LLM agents on 35 tool-project pairs using manually constructed reference setups and manual verification of whether agent outputs produce meaningful analysis results. No mathematical derivations, equations, fitted parameters, or self-referential predictions appear in the provided text. Success rates (e.g., 94% for AnalysisAgent) are computed directly from human judgment against the references rather than derived from any model or prior result by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to support central claims. The methodology is self-contained as an experimental comparison and does not reduce any result to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking study. No free parameters, mathematical axioms, or invented entities are required; the claims rest on the construction of reference setups and the definition of manual success verification.

pith-pipeline@v0.9.0 · 5531 in / 1244 out tokens · 71090 ms · 2026-05-10T16:33:59.092944+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 31 canonical work pages · 4 internal anchors

  1. [1]

    Toufique Ahmed, Martin Hirzel, Rangeet Pan, Avraham Shinnar, and Saurabh Sinha. 2024. TDD-Bench Verified: Can LLMs Generate Tests for Issues Before They Get Resolved?CoRRabs/2412.02883 (2024). arXiv:2412.02883 doi:10.48550/ ARXIV.2412.02883

  2. [2]

    Subarno Banerjee, Lazaro Clapp, and Manu Sridharan. 2019. NullAway: practical type-based null safety for Java. InProceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2019, Tallinn, Estonia, August 26-30, 2019, Marlon Dumas, Dietmar Pfahl, Sven Apel, and ...

  3. [3]

    Al Bessey, Ken Block, Benjamin Chelf, Andy Chou, Bryan Fulton, Seth Hallem, Charles Henri-Gros, Asya Kamsky, Scott McPeak, and Dawson R. Engler. 2010. A few billion lines of code later: Using static analysis to find bugs in the real world. Commun. ACM53, 2 (2010), 66–75

  4. [4]

    Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, and Tushar Khot. 2024. SUPER: Evaluat- ing Agents on Setting Up and Executing Tasks from Research Repositories. arXiv:2409.07440 [cs.AI] https://arxiv.org/abs/2409.07440

  5. [5]

    In: 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE)

    Islem Bouzenia, Premkumar T. Devanbu, and Michael Pradel. 2025. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair. In47th IEEE/ACM In- ternational Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025. IEEE, 2188–2200. doi:10.1109/ICSE55347.2025.00157

  6. [6]

    Islem Bouzenia and Michael Pradel. 2025. You Name It, I Run It: An LLM Agent to Execute Tests of Arbitrary Projects.Proc. ACM Softw. Eng.2, ISSTA (2025), 1054–1076. doi:10.1145/3728922

  7. [7]

    Cristian Cadar, Daniel Dunbar, and Dawson R. Engler. 2008. KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. InSymposium on Operating Systems Design and Implementation (OSDI). USENIX, 209–224

  8. [8]

    Cristiano Calcagno, Dino Distefano, Jeremy Dubreil, Dominik Gabi, Pieter Hooimeijer, Martino Luca, Peter O’Hearn, Irene Papakonstantinou, Jim Pur- brick, and Dulma Rodriguez. 2015. Moving Fast with Software Verification. In Proceedings of the 7th International Conference on NASA Formal Methods (NFM’15) (Pasadena, CA, USA)

  9. [9]

    Oliver Chang, Jonathan Metzman, Max Moroz, Martin Barbella, and Abhishek Arya. 2016. OSS-Fuzz: Continuous fuzzing for open source software.URL: https://github. com/google/ossfuzz(2016)

  10. [10]

    Runxiang Cheng, Michele Tufano, Jürgen Cito, José Cambronero, Pat Rondon, Renyao Wei, Aaron Sun, and Satish Chandra. 2025. Agentic Bug Reproduction for Effective Automated Program Repair at Google.arXiv preprint arXiv:2502.01821 (2025)

  11. [11]

    Clang Static Analyzer [n. d.]. Clang Static Analyzer. https://clang-analyzer.llvm. org

  12. [12]

    Charlie Curtsinger and Emery D. Berger. 2015. Coz: finding code that counts with causal profiling. InProceedings of the 25th Symposium on Operating Systems Principles, SOSP 2015, Monterey, CA, USA, October 4-7, 2015. 184–197

  13. [13]

    Le Deng, Zhonghao Jiang, Jialun Cao, Michael Pradel, and Zhongxin Liu. 2025. Nocode-bench: A benchmark for evaluating natural language-driven feature addition.arXiv preprint arXiv:2507.18130(2025)

  14. [14]

    Elizabeth Dinella, Satish Chandra, and Petros Maniatis. 2024. CRQBench: A Benchmark of Code Reasoning Questions. arXiv:2408.08453 [cs.SE] https://arxiv. org/abs/2408.08453

  15. [15]

    LangChain Docs. 2026. Build a RAG agent with LangChain. https://docs.langchain. com/oss/python/langchain/rag

  16. [16]

    Aleksandra Eliseeva, Alexander Kovrigin, Ilia Kholkin, Egor Bogomolov, and Yaroslav Zharov. 2025. EnvBench: A Benchmark for Automated Environment Setup.CoRRabs/2503.14443 (2025). arXiv:2503.14443 doi:10.48550/ARXIV.2503. 14443

  17. [17]

    Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. 2020. {AFL++}: Combining incremental steps of fuzzing research. In14th USENIX workshop on offensive technologies (WOOT 20)

  18. [18]

    Pengfei Gao, Zhao Tian, Xiangxin Meng, Xinchen Wang, Ruida Hu, Yuanan Xiao, Yizhou Liu, Zhao Zhang, Junjie Chen, Cuiyun Gao, et al. 2025. Trae agent: An llm-based agent for software engineering with test-time scaling.arXiv preprint arXiv:2507.23370(2025)

  19. [19]

    GNU cflow [n. d.]. GNU cflow. https://www.gnu.org/software/cflow/

  20. [20]

    Google. 2015. Error Prone: static analysis tool for Java. http://errorprone.info/

  21. [21]

    Graham, Peter B

    Susan L. Graham, Peter B. Kessler, and Marshall K. Mckusick. 1982. Gprof: A call graph execution profiler. InSIGPLAN Symposium on Compiler Construction. ACM, 120–126

  22. [22]

    Konstantin Grotov, Artem Borzilov, Maksim Krivobok, Timofey Bryksin, and Yaroslav Zharov. 2024. Debug Smarter, Not Harder: AI Agents for Error Resolution in Computational Notebooks.arXiv preprint arXiv:2410.14393(2024)

  23. [23]

    Xue Han, Tingting Yu, and Michael Pradel. 2021. ConfProf: White-Box Per- formance Profiling of Configuration Options. InICPE ’21: ACM/SPEC Interna- tional Conference on Performance Engineering, Virtual Event, France, April 19- 21, 2021, Johann Bourcier, Zhen Ming (Jack) Jiang, Cor-Paul Bezemer, Vittorio Cortellessa, Daniele Di Pompeo, and Ana Lucia Varban...

  24. [24]

    Ahmad Hazimeh, Adrian Herrera, and Mathias Payer. 2020. Magma: A ground- truth fuzzing benchmark.Proceedings of the ACM on Measurement and Analysis of Computing Systems4, 3 (2020), 1–29

  25. [25]

    Sture Holm. 1979. A simple sequentially rejective multiple test procedure.Scan- dinavian Journal of Statistics6, 2 (1979), 65–70

  26. [26]

    Eric Horton and Chris Parnin. 2019. Dockerizeme: Automatic inference of environ- ment dependencies for python code snippets. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 328–338

  27. [27]

    Chuxuan Hu, Liyun Zhang, Yeji Lim, Aum Wadhwani, Austin Peters, and Daniel Kang. 2025. REPRO-BENCH: Can Agentic AI Systems Assess the Reproducibility of Social Science Research?arXiv preprint arXiv:2507.18901(2025)

  28. [28]

    Ruida Hu, Chao Peng, Jingyi Ren, Bo Jiang, Xiangxin Meng, Qinyun Wu, Pengfei Gao, Xinchen Wang, and Cuiyun Gao. 2024. A Real-World Benchmark for Evalu- ating Fine-Grained Issue Solving Capabilities of Large Language Models.arXiv preprint arXiv:2411.18019(2024)

  29. [29]

    Ruida Hu, Chao Peng, Xinchen Wang, and Cuiyun Gao. 2025. An LLM- based Agent for Reliable Docker Environment Configuration.arXiv preprint arXiv:2502.13681(2025)

  30. [30]

    AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

    Dong Huang, Qingwen Bu, Jie M. Zhang, Michael Luck, and Heming Cui. 2024. AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation. arXiv:2312.13010 [cs.CL]

  31. [31]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real- World GitHub Issues?. InInternational Conference on Learning Representations (ICLR)

  32. [32]

    Brittany Johnson, Yoonki Song, Emerson Murphy-Hill, and Robert Bowdidge

  33. [33]

    In 2013 35th International Conference on Software Engineering (ICSE)

    Why don’t software developers use static analysis tools to find bugs?. In 2013 35th International Conference on Software Engineering (ICSE). IEEE, 672–681

  34. [34]

    Cheryl Lee, Chunqiu Steven Xia, Longji Yang, Jen tse Huang, Zhouruixin Zhu, Lingming Zhang, and Michael R. Lyu. 2024. A Unified Debugging Approach via LLM-Based Multi-Agent Synergy. arXiv:2404.17153 [cs.SE] https://arxiv.org/abs/ 2404.17153

  35. [35]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33 (2020), 9459–9474

  36. [36]

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. 2024. AgentBench: Evaluating LLMs as Agents. InThe Twelfth International Conference on Lear...

  37. [37]

    Yizhou Liu, Pengfei Gao, Xinchen Wang, Chao Peng, and Zhao Zhang

  38. [38]

    MarsCode Agent: AI-native Automated Bug Fixing.arXiv preprint arXiv:2409.00899(2024)

  39. [39]

    Yiling Lou, Zhenpeng Chen, Yanbin Cao, Dan Hao, and Lu Zhang. 2020. Un- derstanding build issue resolution in practice: symptoms and fix patterns. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 617–628

  40. [40]

    Yingwei Ma, Rongyu Cao, Yongchang Cao, Yue Zhang, Jue Chen, Yibo Liu, Yuchen Liu, Binhua Li, Fei Huang, and Yongbin Li. 2024. Lingma SWE-GPT: An Open Development-Process-Centric Language Model for Automated Software Improve- ment.arXiv preprint arXiv:2411.00622(2024)

  41. [41]

    Quinn McNemar. 1947. Note on the sampling error of the difference between correlated proportions or percentages.Psychometrika12, 2 (1947), 153–157. doi:10.1007/BF02295996

  42. [42]

    Jonathan Metzman, László Szekeres, Laurent Simon, Read Sprabery, and Abhishek Arya. 2021. Fuzzbench: an open fuzzer benchmarking platform and service. In Proceedings of the 29th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 1393–1403

  43. [43]

    Louis Milliken, Sungmin Kang, and Shin Yoo. 2025. Beyond pip install: Evaluating llm agents for the automated installation of python projects. In2025 IEEE Inter- national Conference on Software Analysis, Evolution and Reengineering (SANER). Conference’17, July 2017, Washington, DC, USA Islem Bouzenia, Cristian Cadar, and Michael Pradel IEEE, 1–11

  44. [44]

    Suchita Mukherjee, Abigail Almanza, and Cindy Rubio-González. 2021. Fixing dependency errors for Python build reproducibility. InProceedings of the 30th ACM SIGSOFT international symposium on software testing and analysis. 439–451

  45. [45]

    Niels Mündler, Mark Niklas Müller, Jingxuan He, and Martin Vechev. 2024. Code Agents are State of the Art Software Testers. arXiv:2406.12952 [cs.SE] https: //arxiv.org/abs/2406.12952

  46. [46]

    Noor Nashid, Islem Bouzenia, Michael Pradel, and Ali Mesbah. 2026. Issue2Test: Generating Reproducing Test Cases from Issue Reports. InInternational Confer- ence on Software Engineering (ICSE)

  47. [47]

    Pat Rondon, Renyao Wei, José Cambronero, Jürgen Cito, Aaron Sun, Siddhant Sanyam, Michele Tufano, and Satish Chandra. 2025. Evaluating Agent-Based Pro- gram Repair at Google. In47th IEEE/ACM International Conference on Software En- gineering: Software Engineering in Practice, SEIP@ICSE 2025, Ottawa, ON, Canada, April 27 - May 3, 2025. IEEE, 365–376. doi:1...

  48. [48]

    Las-Casas, Rodrigo Fonseca, and Saravan Rajmohan

    Devjeet Roy, Xuchao Zhang, Rashi Bhave, Chetan Bansal, Pedro Henrique B. Las-Casas, Rodrigo Fonseca, and Saravan Rajmohan. 2024. Exploring LLM-Based Agents for Root Cause Analysis. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, FSE 2024, Porto de Galinhas, Brazil, July 15-19, 2024, Marcelo d’Am...

  49. [49]

    Caitlin Sadowski, Edward Aftandilian, Alex Eagle, Liam Miller-Cushon, and Ciera Jaspan. 2018. Lessons from building static analysis tools at google.Commun. ACM61, 4 (2018), 58–66

  50. [50]

    Koushik Sen, Darko Marinov, and Gul Agha. 2005. CUTE: a concolic unit testing engine for C. InEuropean Software Engineering Conference and International Symposium on Foundations of Software Engineering (ESEC/FSE). ACM, 263–272

  51. [51]

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. InAdvances in Neural Information Processing Systems (NeurIPS)

  52. [52]

    Zachary S Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan. 2024. Core-bench: Fostering the credibility of published research through a computational reproducibility agent benchmark.arXiv preprint arXiv:2409.11363(2024)

  53. [53]

    Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, et al. 2025. PaperBench: Evaluating AI’s Ability to Replicate AI Research.arXiv preprint arXiv:2504.01848(2025)

  54. [54]

    Nick Stephens, John Grosen, Christopher Salls, Andrew Dutcher, Ruoyu Wang, Jacopo Corbetta, Yan Shoshitaishvili, Christopher Kruegel, and Giovanni Vigna

  55. [55]

    In 23rd Annual Network and Distributed System Security Symposium, NDSS 2016, San Diego, California, USA, February 21-24, 2016

    Driller: Augmenting Fuzzing Through Selective Symbolic Execution. In 23rd Annual Network and Distributed System Security Symposium, NDSS 2016, San Diego, California, USA, February 21-24, 2016

  56. [56]

    Swiss Java Knife (SJK) [n. d.]. Swiss Java Knife (SJK). https://github.com/aragozin/ jvm-tools

  57. [57]

    T. J. Watson Libraries for Analysis (WALA) [n. d.]. T. J. Watson Libraries for Analysis (WALA). https://github.com/wala/WALA

  58. [58]

    Wei Tao, Yucheng Zhou, Wenqiang Zhang, and Yu Cheng. 2024. MAGIS: LLM- Based Multi-Agent Framework for GitHub Issue Resolution.arXiv preprint arXiv:2403.17927(2024)

  59. [59]

    Daniel Tarlow, Subhodeep Moitra, Andrew Rice, Zimin Chen, Pierre-Antoine Manzagol, Charles Sutton, and Edward Aftandilian. 2020. Learning to Fix Build Errors with Graph2Diff Neural Networks. InICSE ’20: 42nd International Confer- ence on Software Engineering, Workshops, Seoul, Republic of Korea, 27 June - 19 July, 2020. ACM, 19–20. doi:10.1145/3387940.3392181

  60. [60]

    Hendren, Patrick Lam, and Vijay Sundaresan

    Raja Vallée-Rai, Phong Co, Etienne Gagnon, Laurie J. Hendren, Patrick Lam, and Vijay Sundaresan. 1999. Soot - a Java bytecode optimization framework. In Conference of the Centre for Advanced Studies on Collaborative Research (CASCON). IBM, 125–135

  61. [61]

    Konstantinos Vergopoulos, Mark Niklas Müller, and Martin Vechev. 2025. Auto- mated benchmark generation for repository-level coding tasks.arXiv preprint arXiv:2503.07701(2025)

  62. [62]

    Jiawei Wang, Tzu-yang Kuo, Li Li, and Andreas Zeller. 2020. Assessing and restor- ing reproducibility of Jupyter notebooks. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 138–149

  63. [63]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2024. OpenHands: An Open Platform for A...

  64. [64]

    Guoqing (Harry) Xu, Matthew Arnold, Nick Mitchell, Atanas Rountev, and Gary Sevitsky. 2009. Go with the flow: profiling copies to find runtime bloat. In Conference on Programming Language Design and Implementation (PLDI). ACM, 419–430

  65. [65]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Inter- faces Enable Automated Software Engineering. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, Dece...

  66. [66]

    John Yang, Kilian Leret, Carlos E Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. 2025. SWE-smith: Scaling Data for Software Engineering Agents.arXiv preprint arXiv:2504.21798(2025)

  67. [67]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations (ICLR)

  68. [68]

    Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Tao Xie, and Qianxiang Wang. 2024. CoderEval: A Benchmark of Prag- matic Code Generation with Generative Pre-trained Models. InICSE

  69. [69]

    Michal Zalewski. 2013. American Fuzzy Lop (AFL). https://lcamtuf.coredump.cx/afl/. https://lcamtuf.coredump.cx/afl/

  70. [70]

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Au- toCodeRover: Autonomous Program Improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024, Vienna, Austria, September 16-20, 2024, Maria Christakis and Michael Pradel (Eds.). ACM, 1592–1604. doi:10.1145/3650212.3680384

  71. [71]

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2024. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems36 (2024)

  72. [72]

    Hao-Nan Zhu and Cindy Rubio-González. 2023. On the Reproducibility of Soft- ware Defect Datasets. InICSE