arxiv: 2604.11270 · v2 · submitted 2026-04-13 · 💻 cs.SE

Recognition: unknown

Evaluating LLM Agents on Automated Software Analysis Tasks

Islem Bouzenia , Cristian Cadar , Michael Pradel

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:33 UTC · model grok-4.3

classification 💻 cs.SE

keywords LLM agentssoftware analysis toolsbenchmarkagent architectureautomated configurationC/C++ Java projectstool setup

0 comments

The pith

A custom LLM agent achieves 94% success in setting up software analysis tools on a benchmark of 35 tasks, far exceeding baseline agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AnalysisBench, a benchmark of 35 tool-project pairs across seven analysis tools and ten C/C++ and Java projects, each with a reference setup. It tests four agent architectures on four LLMs and finds that a custom agent reaches 94% manually verified success rate, compared to 77% for the strongest baseline. The evaluation reveals that agent architecture is more important than the specific LLM used, and identifies common problems in other agents such as mixing different stages, failing to localize errors, and stopping too soon. Readers would care because this shows a path to automating the difficult task of configuring analysis tools for real projects, which currently requires significant manual effort.

Core claim

Our custom agent, AnalysisAgent, achieves manually verified success rates of 94% (Gemini-3-Flash, 33/35 tasks), compared to 77% for the best baseline (ExecutionAgent). Beyond quantitative results, we identify key limitations in existing agents, including stage mixing, poor error localization, and premature termination, and show that agentic architecture matters more than LLM capability alone. We further find that whole-program analyses and symbolic execution are the most difficult tasks, that Java toolchains pose greater challenges than C/C++, and that LLM-self-validated success consistently overstates manually verified success.

What carries the argument

AnalysisBench, a benchmark of manually constructed reference setups for 35 tool-project pairs, used to measure agent success in installing, configuring, and running analysis tools to produce valid outputs.

If this is right

Targeted agent designs can reliably automate software analysis tool deployment without expert intervention.
Limitations like stage mixing and premature termination can be mitigated through improved agent workflows.
Whole-program and symbolic analyses remain harder, suggesting need for special agent strategies.
Java setups are tougher than C/C++, pointing to language-specific difficulties in agent performance.
LLM self-validation overestimates true success, requiring external verification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These results suggest that for configuration-heavy software engineering tasks, investing in agent structure may yield higher returns than scaling up model size alone.
The benchmark could be extended to other languages or analysis domains to test if the architecture advantage generalizes.
If adopted, such agents might make advanced program analysis tools more widely usable by non-experts in open source projects.
Future work might explore combining these agents with code generation for fixing setup errors dynamically.

Load-bearing premise

The manually constructed reference setups represent the definitive correct configurations for each tool-project pair and human manual verification of agent outputs is objective and free of bias.

What would settle it

Re-running the agent evaluations with different human verifiers or independent reference setups and finding substantially lower success rates or high disagreement among verifiers.

Figures

Figures reproduced from arXiv: 2604.11270 by Cristian Cadar, Islem Bouzenia, Michael Pradel.

**Figure 2.** Figure 2: How many times more likely AnalysisAgent is to succeed compared to baselines, per LLM backend (circles) and pooled [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Average verified success rate by analysis tool and [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Root-cause failure categories by agent (%). [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Resource consumption by agent and LLM backend. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Numerous software analysis tools exist today, yet applying them to diverse open-source projects remains challenging due to environment setup, dependency resolution, and tool configuration. LLM-based agents offer a potential solution, yet no prior work has systematically studied their effectiveness on the specific task of automated software analysis, which, unlike issue solving or general environment setup, requires installing and configuring a separate analysis tool alongside the target project, generating tool-specific prerequisites, and validating that the tool produces meaningful analysis outputs. We introduce AnalysisBench, a benchmark of 35 tool-project pairs spanning seven analysis tools and ten diverse C/C++ and Java projects, each with a manually constructed reference setup. Using AnalysisBench, we evaluate four agent architectures across four LLM backends. Our custom agent, AnalysisAgent, achieves manually verified success rates of 94% (Gemini-3-Flash, 33/35 tasks), compared to 77% for the best baseline (ExecutionAgent). Beyond quantitative results, we identify key limitations in existing agents, including stage mixing, poor error localization, and premature termination, and show that agentic architecture matters more than LLM capability alone. We further find that whole-program analyses and symbolic execution are the most difficult tasks, that Java toolchains pose greater challenges than C/C++, and that LLM-self-validated success consistently overstates manually verified success.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a useful new benchmark for LLM agents configuring analysis tools on real projects and shows architecture choices matter, but the headline success rates rest on manual verification without a clear rubric or agreement stats.

read the letter

The two things to take away are that the authors built AnalysisBench with 35 tool-project pairs across seven analysis tools and ten C/C++ and Java projects, each with a hand-made reference setup, and that their custom AnalysisAgent hits 94% manually verified success while the best baseline reaches 77%. They also map out recurring failure modes like stage mixing, weak error localization, and early termination, and note that whole-program and symbolic tasks are hardest while Java setups lag C/C++ ones. LLM self-checks overstate the real numbers.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces AnalysisBench, a benchmark of 35 manually constructed tool-project pairs spanning seven analysis tools and ten C/C++/Java projects. It evaluates four agent architectures (including a proposed AnalysisAgent) across four LLM backends on the task of installing, configuring, and executing software analysis tools to produce meaningful outputs. Key claims include AnalysisAgent reaching 94% manually verified success (33/35 tasks with Gemini-3-Flash) versus 77% for the strongest baseline, that agent architecture outweighs LLM choice, that whole-program/symbolic analyses and Java toolchains are hardest, and that existing agents suffer from stage mixing, poor error localization, and premature termination.

Significance. If the manual verification is reliable and reproducible, the work is significant for providing the first systematic empirical study of LLM agents on the specific, multi-stage problem of automated software analysis setup (distinct from general environment setup or issue resolution). The cross-architecture and cross-LLM comparisons, the catalog of failure modes, and the finding that specialized agent design can outperform raw model scale are useful contributions that could guide future agent development in software engineering.

major comments (2)

[§5 and §4.2] §5 (Results) and §4.2 (Evaluation Protocol): The central quantitative claims (94% vs. 77% success, architecture > LLM capability) rest entirely on manually verified outcomes against reference setups, yet no explicit rubric, decision criteria for 'meaningful analysis outputs' (especially for whole-program or symbolic analyses), inter-rater agreement statistics, or blinding protocol are provided. With only 35 tasks, modest verifier variance could materially alter the reported gap and the downstream taxonomy of failure modes.
[§5.3] §5.3 (Task Difficulty Analysis): The claims that whole-program analyses and symbolic execution are the most difficult, and that Java poses greater challenges than C/C++, are presented without per-category success tables, statistical tests, or breakdown by agent/LLM, making it hard to assess whether these patterns are robust or driven by a few edge cases.

minor comments (2)

The exact model identifier 'Gemini-3-Flash' should be clarified (e.g., Gemini 1.5 Flash or a later variant) with version and access date for reproducibility.
Figure captions and table headers could more explicitly state that success rates are manually verified rather than automatically measured.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of transparency in our evaluation protocol and results presentation. We address each major comment below, indicating the revisions we will make to improve clarity and robustness without altering the core findings.

read point-by-point responses

Referee: [§5 and §4.2] §5 (Results) and §4.2 (Evaluation Protocol): The central quantitative claims (94% vs. 77% success, architecture > LLM capability) rest entirely on manually verified outcomes against reference setups, yet no explicit rubric, decision criteria for 'meaningful analysis outputs' (especially for whole-program or symbolic analyses), inter-rater agreement statistics, or blinding protocol are provided. With only 35 tasks, modest verifier variance could materially alter the reported gap and the downstream taxonomy of failure modes.

Authors: We agree that the manuscript would benefit from greater explicitness regarding the verification process. In the revised version, we will expand §4.2 to include a detailed rubric and specific decision criteria for determining 'meaningful analysis outputs,' with tailored guidance for whole-program and symbolic analyses (e.g., requiring non-empty, tool-specific reports such as call graphs or execution traces that match expected analysis semantics). The verification was performed by the authors against the independently constructed reference setups described in the benchmark. We will document this process, including that it was not formally blinded and that inter-rater agreement statistics are not available because verification was conducted by a single primary verifier for consistency (with spot-checks by co-authors). We will also add a discussion of potential verifier variance as a limitation given the small task count. These changes will strengthen the reproducibility of the 94% vs. 77% claims and the failure mode taxonomy. revision: yes
Referee: [§5.3] §5.3 (Task Difficulty Analysis): The claims that whole-program analyses and symbolic execution are the most difficult, and that Java poses greater challenges than C/C++, are presented without per-category success tables, statistical tests, or breakdown by agent/LLM, making it hard to assess whether these patterns are robust or driven by a few edge cases.

Authors: We concur that additional granularity in §5.3 would improve interpretability. In the revision, we will add per-category success tables showing breakdown by analysis type (whole-program/symbolic vs. intra-procedural), by language (Java vs. C/C++), and by agent architecture and LLM backend. We will also incorporate statistical tests (e.g., Fisher's exact test for proportions) where the sample sizes per category permit, or explicitly note the descriptive nature of the patterns when tests lack power due to the total of 35 tasks. This will allow readers to evaluate whether the difficulty claims are driven by systematic trends or isolated cases. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical benchmark evaluation

full rationale

The paper reports an empirical evaluation of LLM agents on 35 tool-project pairs using manually constructed reference setups and manual verification of whether agent outputs produce meaningful analysis results. No mathematical derivations, equations, fitted parameters, or self-referential predictions appear in the provided text. Success rates (e.g., 94% for AnalysisAgent) are computed directly from human judgment against the references rather than derived from any model or prior result by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to support central claims. The methodology is self-contained as an experimental comparison and does not reduce any result to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking study. No free parameters, mathematical axioms, or invented entities are required; the claims rest on the construction of reference setups and the definition of manual success verification.

pith-pipeline@v0.9.0 · 5531 in / 1244 out tokens · 71090 ms · 2026-05-10T16:33:59.092944+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

72 extracted references · 31 canonical work pages · 4 internal anchors

[1]

Toufique Ahmed, Martin Hirzel, Rangeet Pan, Avraham Shinnar, and Saurabh Sinha. 2024. TDD-Bench Verified: Can LLMs Generate Tests for Issues Before They Get Resolved?CoRRabs/2412.02883 (2024). arXiv:2412.02883 doi:10.48550/ ARXIV.2412.02883

work page arXiv 2024
[2]

Subarno Banerjee, Lazaro Clapp, and Manu Sridharan. 2019. NullAway: practical type-based null safety for Java. InProceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2019, Tallinn, Estonia, August 26-30, 2019, Marlon Dumas, Dietmar Pfahl, Sven Apel, and ...

work page doi:10.1145/3338906.3338919 2019
[3]

Al Bessey, Ken Block, Benjamin Chelf, Andy Chou, Bryan Fulton, Seth Hallem, Charles Henri-Gros, Asya Kamsky, Scott McPeak, and Dawson R. Engler. 2010. A few billion lines of code later: Using static analysis to find bugs in the real world. Commun. ACM53, 2 (2010), 66–75

2010
[4]

Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, and Tushar Khot. 2024. SUPER: Evaluat- ing Agents on Setting Up and Executing Tasks from Research Repositories. arXiv:2409.07440 [cs.AI] https://arxiv.org/abs/2409.07440

work page arXiv 2024
[5]

In: 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE)

Islem Bouzenia, Premkumar T. Devanbu, and Michael Pradel. 2025. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair. In47th IEEE/ACM In- ternational Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025. IEEE, 2188–2200. doi:10.1109/ICSE55347.2025.00157

work page doi:10.1109/icse55347.2025.00157 2025
[6]

Islem Bouzenia and Michael Pradel. 2025. You Name It, I Run It: An LLM Agent to Execute Tests of Arbitrary Projects.Proc. ACM Softw. Eng.2, ISSTA (2025), 1054–1076. doi:10.1145/3728922

work page doi:10.1145/3728922 2025
[7]

Cristian Cadar, Daniel Dunbar, and Dawson R. Engler. 2008. KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. InSymposium on Operating Systems Design and Implementation (OSDI). USENIX, 209–224

2008
[8]

Cristiano Calcagno, Dino Distefano, Jeremy Dubreil, Dominik Gabi, Pieter Hooimeijer, Martino Luca, Peter O’Hearn, Irene Papakonstantinou, Jim Pur- brick, and Dulma Rodriguez. 2015. Moving Fast with Software Verification. In Proceedings of the 7th International Conference on NASA Formal Methods (NFM’15) (Pasadena, CA, USA)

2015
[9]

Oliver Chang, Jonathan Metzman, Max Moroz, Martin Barbella, and Abhishek Arya. 2016. OSS-Fuzz: Continuous fuzzing for open source software.URL: https://github. com/google/ossfuzz(2016)

2016
[10]

Runxiang Cheng, Michele Tufano, Jürgen Cito, José Cambronero, Pat Rondon, Renyao Wei, Aaron Sun, and Satish Chandra. 2025. Agentic Bug Reproduction for Effective Automated Program Repair at Google.arXiv preprint arXiv:2502.01821 (2025)

work page arXiv 2025
[11]

Clang Static Analyzer [n. d.]. Clang Static Analyzer. https://clang-analyzer.llvm. org
[12]

Charlie Curtsinger and Emery D. Berger. 2015. Coz: finding code that counts with causal profiling. InProceedings of the 25th Symposium on Operating Systems Principles, SOSP 2015, Monterey, CA, USA, October 4-7, 2015. 184–197

2015
[13]

Le Deng, Zhonghao Jiang, Jialun Cao, Michael Pradel, and Zhongxin Liu. 2025. Nocode-bench: A benchmark for evaluating natural language-driven feature addition.arXiv preprint arXiv:2507.18130(2025)

work page arXiv 2025
[14]

Elizabeth Dinella, Satish Chandra, and Petros Maniatis. 2024. CRQBench: A Benchmark of Code Reasoning Questions. arXiv:2408.08453 [cs.SE] https://arxiv. org/abs/2408.08453

work page arXiv 2024
[15]

LangChain Docs. 2026. Build a RAG agent with LangChain. https://docs.langchain. com/oss/python/langchain/rag

2026
[16]

Aleksandra Eliseeva, Alexander Kovrigin, Ilia Kholkin, Egor Bogomolov, and Yaroslav Zharov. 2025. EnvBench: A Benchmark for Automated Environment Setup.CoRRabs/2503.14443 (2025). arXiv:2503.14443 doi:10.48550/ARXIV.2503. 14443

work page doi:10.48550/arxiv.2503 2025
[17]

Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. 2020. {AFL++}: Combining incremental steps of fuzzing research. In14th USENIX workshop on offensive technologies (WOOT 20)

2020
[18]

Pengfei Gao, Zhao Tian, Xiangxin Meng, Xinchen Wang, Ruida Hu, Yuanan Xiao, Yizhou Liu, Zhao Zhang, Junjie Chen, Cuiyun Gao, et al. 2025. Trae agent: An llm-based agent for software engineering with test-time scaling.arXiv preprint arXiv:2507.23370(2025)

work page arXiv 2025
[19]

GNU cflow [n. d.]. GNU cflow. https://www.gnu.org/software/cflow/
[20]

Google. 2015. Error Prone: static analysis tool for Java. http://errorprone.info/

2015
[21]

Graham, Peter B

Susan L. Graham, Peter B. Kessler, and Marshall K. Mckusick. 1982. Gprof: A call graph execution profiler. InSIGPLAN Symposium on Compiler Construction. ACM, 120–126

1982
[22]

Konstantin Grotov, Artem Borzilov, Maksim Krivobok, Timofey Bryksin, and Yaroslav Zharov. 2024. Debug Smarter, Not Harder: AI Agents for Error Resolution in Computational Notebooks.arXiv preprint arXiv:2410.14393(2024)

work page arXiv 2024
[23]

Xue Han, Tingting Yu, and Michael Pradel. 2021. ConfProf: White-Box Per- formance Profiling of Configuration Options. InICPE ’21: ACM/SPEC Interna- tional Conference on Performance Engineering, Virtual Event, France, April 19- 21, 2021, Johann Bourcier, Zhen Ming (Jack) Jiang, Cor-Paul Bezemer, Vittorio Cortellessa, Daniele Di Pompeo, and Ana Lucia Varban...

work page doi:10.1145/3427921.3450255 2021
[24]

Ahmad Hazimeh, Adrian Herrera, and Mathias Payer. 2020. Magma: A ground- truth fuzzing benchmark.Proceedings of the ACM on Measurement and Analysis of Computing Systems4, 3 (2020), 1–29

2020
[25]

Sture Holm. 1979. A simple sequentially rejective multiple test procedure.Scan- dinavian Journal of Statistics6, 2 (1979), 65–70

1979
[26]

Eric Horton and Chris Parnin. 2019. Dockerizeme: Automatic inference of environ- ment dependencies for python code snippets. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 328–338

2019
[27]

Chuxuan Hu, Liyun Zhang, Yeji Lim, Aum Wadhwani, Austin Peters, and Daniel Kang. 2025. REPRO-BENCH: Can Agentic AI Systems Assess the Reproducibility of Social Science Research?arXiv preprint arXiv:2507.18901(2025)

work page arXiv 2025
[28]

Ruida Hu, Chao Peng, Jingyi Ren, Bo Jiang, Xiangxin Meng, Qinyun Wu, Pengfei Gao, Xinchen Wang, and Cuiyun Gao. 2024. A Real-World Benchmark for Evalu- ating Fine-Grained Issue Solving Capabilities of Large Language Models.arXiv preprint arXiv:2411.18019(2024)

work page arXiv 2024
[29]

Ruida Hu, Chao Peng, Xinchen Wang, and Cuiyun Gao. 2025. An LLM- based Agent for Reliable Docker Environment Configuration.arXiv preprint arXiv:2502.13681(2025)

work page arXiv 2025
[30]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

Dong Huang, Qingwen Bu, Jie M. Zhang, Michael Luck, and Heming Cui. 2024. AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation. arXiv:2312.13010 [cs.CL]

work page internal anchor Pith review arXiv 2024
[31]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real- World GitHub Issues?. InInternational Conference on Learning Representations (ICLR)

2024
[32]

Brittany Johnson, Yoonki Song, Emerson Murphy-Hill, and Robert Bowdidge
[33]

In 2013 35th International Conference on Software Engineering (ICSE)

Why don’t software developers use static analysis tools to find bugs?. In 2013 35th International Conference on Software Engineering (ICSE). IEEE, 672–681

2013
[34]

Cheryl Lee, Chunqiu Steven Xia, Longji Yang, Jen tse Huang, Zhouruixin Zhu, Lingming Zhang, and Michael R. Lyu. 2024. A Unified Debugging Approach via LLM-Based Multi-Agent Synergy. arXiv:2404.17153 [cs.SE] https://arxiv.org/abs/ 2404.17153

work page arXiv 2024
[35]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33 (2020), 9459–9474

2020
[36]

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. 2024. AgentBench: Evaluating LLMs as Agents. InThe Twelfth International Conference on Lear...

2024
[37]

Yizhou Liu, Pengfei Gao, Xinchen Wang, Chao Peng, and Zhao Zhang
[38]

MarsCode Agent: AI-native Automated Bug Fixing.arXiv preprint arXiv:2409.00899(2024)

work page arXiv 2024
[39]

Yiling Lou, Zhenpeng Chen, Yanbin Cao, Dan Hao, and Lu Zhang. 2020. Un- derstanding build issue resolution in practice: symptoms and fix patterns. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 617–628

2020
[40]

Yingwei Ma, Rongyu Cao, Yongchang Cao, Yue Zhang, Jue Chen, Yibo Liu, Yuchen Liu, Binhua Li, Fei Huang, and Yongbin Li. 2024. Lingma SWE-GPT: An Open Development-Process-Centric Language Model for Automated Software Improve- ment.arXiv preprint arXiv:2411.00622(2024)

work page arXiv 2024
[41]

Quinn McNemar. 1947. Note on the sampling error of the difference between correlated proportions or percentages.Psychometrika12, 2 (1947), 153–157. doi:10.1007/BF02295996

work page doi:10.1007/bf02295996 1947
[42]

Jonathan Metzman, László Szekeres, Laurent Simon, Read Sprabery, and Abhishek Arya. 2021. Fuzzbench: an open fuzzer benchmarking platform and service. In Proceedings of the 29th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 1393–1403

2021
[43]

Louis Milliken, Sungmin Kang, and Shin Yoo. 2025. Beyond pip install: Evaluating llm agents for the automated installation of python projects. In2025 IEEE Inter- national Conference on Software Analysis, Evolution and Reengineering (SANER). Conference’17, July 2017, Washington, DC, USA Islem Bouzenia, Cristian Cadar, and Michael Pradel IEEE, 1–11

2025
[44]

Suchita Mukherjee, Abigail Almanza, and Cindy Rubio-González. 2021. Fixing dependency errors for Python build reproducibility. InProceedings of the 30th ACM SIGSOFT international symposium on software testing and analysis. 439–451

2021
[45]

Niels Mündler, Mark Niklas Müller, Jingxuan He, and Martin Vechev. 2024. Code Agents are State of the Art Software Testers. arXiv:2406.12952 [cs.SE] https: //arxiv.org/abs/2406.12952

work page arXiv 2024
[46]

Noor Nashid, Islem Bouzenia, Michael Pradel, and Ali Mesbah. 2026. Issue2Test: Generating Reproducing Test Cases from Issue Reports. InInternational Confer- ence on Software Engineering (ICSE)

2026
[47]

Pat Rondon, Renyao Wei, José Cambronero, Jürgen Cito, Aaron Sun, Siddhant Sanyam, Michele Tufano, and Satish Chandra. 2025. Evaluating Agent-Based Pro- gram Repair at Google. In47th IEEE/ACM International Conference on Software En- gineering: Software Engineering in Practice, SEIP@ICSE 2025, Ottawa, ON, Canada, April 27 - May 3, 2025. IEEE, 365–376. doi:1...

work page doi:10.1109/icse-seip66354.2025.00038 2025
[48]

Las-Casas, Rodrigo Fonseca, and Saravan Rajmohan

Devjeet Roy, Xuchao Zhang, Rashi Bhave, Chetan Bansal, Pedro Henrique B. Las-Casas, Rodrigo Fonseca, and Saravan Rajmohan. 2024. Exploring LLM-Based Agents for Root Cause Analysis. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, FSE 2024, Porto de Galinhas, Brazil, July 15-19, 2024, Marcelo d’Am...

work page doi:10.1145/3663529.3663841 2024
[49]

Caitlin Sadowski, Edward Aftandilian, Alex Eagle, Liam Miller-Cushon, and Ciera Jaspan. 2018. Lessons from building static analysis tools at google.Commun. ACM61, 4 (2018), 58–66

2018
[50]

Koushik Sen, Darko Marinov, and Gul Agha. 2005. CUTE: a concolic unit testing engine for C. InEuropean Software Engineering Conference and International Symposium on Foundations of Software Engineering (ESEC/FSE). ACM, 263–272

2005
[51]

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. InAdvances in Neural Information Processing Systems (NeurIPS)

2023
[52]

Zachary S Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan. 2024. Core-bench: Fostering the credibility of published research through a computational reproducibility agent benchmark.arXiv preprint arXiv:2409.11363(2024)

work page arXiv 2024
[53]

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, et al. 2025. PaperBench: Evaluating AI’s Ability to Replicate AI Research.arXiv preprint arXiv:2504.01848(2025)

work page internal anchor Pith review arXiv 2025
[54]

Nick Stephens, John Grosen, Christopher Salls, Andrew Dutcher, Ruoyu Wang, Jacopo Corbetta, Yan Shoshitaishvili, Christopher Kruegel, and Giovanni Vigna
[55]

In 23rd Annual Network and Distributed System Security Symposium, NDSS 2016, San Diego, California, USA, February 21-24, 2016

Driller: Augmenting Fuzzing Through Selective Symbolic Execution. In 23rd Annual Network and Distributed System Security Symposium, NDSS 2016, San Diego, California, USA, February 21-24, 2016

2016
[56]

Swiss Java Knife (SJK) [n. d.]. Swiss Java Knife (SJK). https://github.com/aragozin/ jvm-tools
[57]

T. J. Watson Libraries for Analysis (WALA) [n. d.]. T. J. Watson Libraries for Analysis (WALA). https://github.com/wala/WALA
[58]

Wei Tao, Yucheng Zhou, Wenqiang Zhang, and Yu Cheng. 2024. MAGIS: LLM- Based Multi-Agent Framework for GitHub Issue Resolution.arXiv preprint arXiv:2403.17927(2024)

work page arXiv 2024
[59]

Daniel Tarlow, Subhodeep Moitra, Andrew Rice, Zimin Chen, Pierre-Antoine Manzagol, Charles Sutton, and Edward Aftandilian. 2020. Learning to Fix Build Errors with Graph2Diff Neural Networks. InICSE ’20: 42nd International Confer- ence on Software Engineering, Workshops, Seoul, Republic of Korea, 27 June - 19 July, 2020. ACM, 19–20. doi:10.1145/3387940.3392181

work page doi:10.1145/3387940.3392181 2020
[60]

Hendren, Patrick Lam, and Vijay Sundaresan

Raja Vallée-Rai, Phong Co, Etienne Gagnon, Laurie J. Hendren, Patrick Lam, and Vijay Sundaresan. 1999. Soot - a Java bytecode optimization framework. In Conference of the Centre for Advanced Studies on Collaborative Research (CASCON). IBM, 125–135

1999
[61]

Konstantinos Vergopoulos, Mark Niklas Müller, and Martin Vechev. 2025. Auto- mated benchmark generation for repository-level coding tasks.arXiv preprint arXiv:2503.07701(2025)

work page arXiv 2025
[62]

Jiawei Wang, Tzu-yang Kuo, Li Li, and Andreas Zeller. 2020. Assessing and restor- ing reproducibility of Jupyter notebooks. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 138–149

2020
[63]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2024. OpenHands: An Open Platform for A...

work page internal anchor Pith review arXiv 2024
[64]

Guoqing (Harry) Xu, Matthew Arnold, Nick Mitchell, Atanas Rountev, and Gary Sevitsky. 2009. Go with the flow: profiling copies to find runtime bloat. In Conference on Programming Language Design and Implementation (PLDI). ACM, 419–430

2009
[65]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Inter- faces Enable Automated Software Engineering. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, Dece...

2024
[66]

John Yang, Kilian Leret, Carlos E Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. 2025. SWE-smith: Scaling Data for Software Engineering Agents.arXiv preprint arXiv:2504.21798(2025)

work page internal anchor Pith review arXiv 2025
[67]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations (ICLR)

2023
[68]

Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Tao Xie, and Qianxiang Wang. 2024. CoderEval: A Benchmark of Prag- matic Code Generation with Generative Pre-trained Models. InICSE

2024
[69]

Michal Zalewski. 2013. American Fuzzy Lop (AFL). https://lcamtuf.coredump.cx/afl/. https://lcamtuf.coredump.cx/afl/

2013
[70]

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Au- toCodeRover: Autonomous Program Improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024, Vienna, Austria, September 16-20, 2024, Maria Christakis and Michael Pradel (Eds.). ACM, 1592–1604. doi:10.1145/3650212.3680384

work page doi:10.1145/3650212.3680384 2024
[71]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2024. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems36 (2024)

2024
[72]

Hao-Nan Zhu and Cindy Rubio-González. 2023. On the Reproducibility of Soft- ware Defect Datasets. InICSE

2023