Recognition: unknown
Evaluating LLM Agents on Automated Software Analysis Tasks
Pith reviewed 2026-05-10 16:33 UTC · model grok-4.3
The pith
A custom LLM agent achieves 94% success in setting up software analysis tools on a benchmark of 35 tasks, far exceeding baseline agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our custom agent, AnalysisAgent, achieves manually verified success rates of 94% (Gemini-3-Flash, 33/35 tasks), compared to 77% for the best baseline (ExecutionAgent). Beyond quantitative results, we identify key limitations in existing agents, including stage mixing, poor error localization, and premature termination, and show that agentic architecture matters more than LLM capability alone. We further find that whole-program analyses and symbolic execution are the most difficult tasks, that Java toolchains pose greater challenges than C/C++, and that LLM-self-validated success consistently overstates manually verified success.
What carries the argument
AnalysisBench, a benchmark of manually constructed reference setups for 35 tool-project pairs, used to measure agent success in installing, configuring, and running analysis tools to produce valid outputs.
If this is right
- Targeted agent designs can reliably automate software analysis tool deployment without expert intervention.
- Limitations like stage mixing and premature termination can be mitigated through improved agent workflows.
- Whole-program and symbolic analyses remain harder, suggesting need for special agent strategies.
- Java setups are tougher than C/C++, pointing to language-specific difficulties in agent performance.
- LLM self-validation overestimates true success, requiring external verification.
Where Pith is reading between the lines
- These results suggest that for configuration-heavy software engineering tasks, investing in agent structure may yield higher returns than scaling up model size alone.
- The benchmark could be extended to other languages or analysis domains to test if the architecture advantage generalizes.
- If adopted, such agents might make advanced program analysis tools more widely usable by non-experts in open source projects.
- Future work might explore combining these agents with code generation for fixing setup errors dynamically.
Load-bearing premise
The manually constructed reference setups represent the definitive correct configurations for each tool-project pair and human manual verification of agent outputs is objective and free of bias.
What would settle it
Re-running the agent evaluations with different human verifiers or independent reference setups and finding substantially lower success rates or high disagreement among verifiers.
Figures
read the original abstract
Numerous software analysis tools exist today, yet applying them to diverse open-source projects remains challenging due to environment setup, dependency resolution, and tool configuration. LLM-based agents offer a potential solution, yet no prior work has systematically studied their effectiveness on the specific task of automated software analysis, which, unlike issue solving or general environment setup, requires installing and configuring a separate analysis tool alongside the target project, generating tool-specific prerequisites, and validating that the tool produces meaningful analysis outputs. We introduce AnalysisBench, a benchmark of 35 tool-project pairs spanning seven analysis tools and ten diverse C/C++ and Java projects, each with a manually constructed reference setup. Using AnalysisBench, we evaluate four agent architectures across four LLM backends. Our custom agent, AnalysisAgent, achieves manually verified success rates of 94% (Gemini-3-Flash, 33/35 tasks), compared to 77% for the best baseline (ExecutionAgent). Beyond quantitative results, we identify key limitations in existing agents, including stage mixing, poor error localization, and premature termination, and show that agentic architecture matters more than LLM capability alone. We further find that whole-program analyses and symbolic execution are the most difficult tasks, that Java toolchains pose greater challenges than C/C++, and that LLM-self-validated success consistently overstates manually verified success.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces AnalysisBench, a benchmark of 35 manually constructed tool-project pairs spanning seven analysis tools and ten C/C++/Java projects. It evaluates four agent architectures (including a proposed AnalysisAgent) across four LLM backends on the task of installing, configuring, and executing software analysis tools to produce meaningful outputs. Key claims include AnalysisAgent reaching 94% manually verified success (33/35 tasks with Gemini-3-Flash) versus 77% for the strongest baseline, that agent architecture outweighs LLM choice, that whole-program/symbolic analyses and Java toolchains are hardest, and that existing agents suffer from stage mixing, poor error localization, and premature termination.
Significance. If the manual verification is reliable and reproducible, the work is significant for providing the first systematic empirical study of LLM agents on the specific, multi-stage problem of automated software analysis setup (distinct from general environment setup or issue resolution). The cross-architecture and cross-LLM comparisons, the catalog of failure modes, and the finding that specialized agent design can outperform raw model scale are useful contributions that could guide future agent development in software engineering.
major comments (2)
- [§5 and §4.2] §5 (Results) and §4.2 (Evaluation Protocol): The central quantitative claims (94% vs. 77% success, architecture > LLM capability) rest entirely on manually verified outcomes against reference setups, yet no explicit rubric, decision criteria for 'meaningful analysis outputs' (especially for whole-program or symbolic analyses), inter-rater agreement statistics, or blinding protocol are provided. With only 35 tasks, modest verifier variance could materially alter the reported gap and the downstream taxonomy of failure modes.
- [§5.3] §5.3 (Task Difficulty Analysis): The claims that whole-program analyses and symbolic execution are the most difficult, and that Java poses greater challenges than C/C++, are presented without per-category success tables, statistical tests, or breakdown by agent/LLM, making it hard to assess whether these patterns are robust or driven by a few edge cases.
minor comments (2)
- The exact model identifier 'Gemini-3-Flash' should be clarified (e.g., Gemini 1.5 Flash or a later variant) with version and access date for reproducibility.
- Figure captions and table headers could more explicitly state that success rates are manually verified rather than automatically measured.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of transparency in our evaluation protocol and results presentation. We address each major comment below, indicating the revisions we will make to improve clarity and robustness without altering the core findings.
read point-by-point responses
-
Referee: [§5 and §4.2] §5 (Results) and §4.2 (Evaluation Protocol): The central quantitative claims (94% vs. 77% success, architecture > LLM capability) rest entirely on manually verified outcomes against reference setups, yet no explicit rubric, decision criteria for 'meaningful analysis outputs' (especially for whole-program or symbolic analyses), inter-rater agreement statistics, or blinding protocol are provided. With only 35 tasks, modest verifier variance could materially alter the reported gap and the downstream taxonomy of failure modes.
Authors: We agree that the manuscript would benefit from greater explicitness regarding the verification process. In the revised version, we will expand §4.2 to include a detailed rubric and specific decision criteria for determining 'meaningful analysis outputs,' with tailored guidance for whole-program and symbolic analyses (e.g., requiring non-empty, tool-specific reports such as call graphs or execution traces that match expected analysis semantics). The verification was performed by the authors against the independently constructed reference setups described in the benchmark. We will document this process, including that it was not formally blinded and that inter-rater agreement statistics are not available because verification was conducted by a single primary verifier for consistency (with spot-checks by co-authors). We will also add a discussion of potential verifier variance as a limitation given the small task count. These changes will strengthen the reproducibility of the 94% vs. 77% claims and the failure mode taxonomy. revision: yes
-
Referee: [§5.3] §5.3 (Task Difficulty Analysis): The claims that whole-program analyses and symbolic execution are the most difficult, and that Java poses greater challenges than C/C++, are presented without per-category success tables, statistical tests, or breakdown by agent/LLM, making it hard to assess whether these patterns are robust or driven by a few edge cases.
Authors: We concur that additional granularity in §5.3 would improve interpretability. In the revision, we will add per-category success tables showing breakdown by analysis type (whole-program/symbolic vs. intra-procedural), by language (Java vs. C/C++), and by agent architecture and LLM backend. We will also incorporate statistical tests (e.g., Fisher's exact test for proportions) where the sample sizes per category permit, or explicitly note the descriptive nature of the patterns when tests lack power due to the total of 35 tasks. This will allow readers to evaluate whether the difficulty claims are driven by systematic trends or isolated cases. revision: yes
Circularity Check
No circularity: direct empirical benchmark evaluation
full rationale
The paper reports an empirical evaluation of LLM agents on 35 tool-project pairs using manually constructed reference setups and manual verification of whether agent outputs produce meaningful analysis results. No mathematical derivations, equations, fitted parameters, or self-referential predictions appear in the provided text. Success rates (e.g., 94% for AnalysisAgent) are computed directly from human judgment against the references rather than derived from any model or prior result by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to support central claims. The methodology is self-contained as an experimental comparison and does not reduce any result to its own inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
-
[2]
Subarno Banerjee, Lazaro Clapp, and Manu Sridharan. 2019. NullAway: practical type-based null safety for Java. InProceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2019, Tallinn, Estonia, August 26-30, 2019, Marlon Dumas, Dietmar Pfahl, Sven Apel, and ...
-
[3]
Al Bessey, Ken Block, Benjamin Chelf, Andy Chou, Bryan Fulton, Seth Hallem, Charles Henri-Gros, Asya Kamsky, Scott McPeak, and Dawson R. Engler. 2010. A few billion lines of code later: Using static analysis to find bugs in the real world. Commun. ACM53, 2 (2010), 66–75
2010
- [4]
-
[5]
In: 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE)
Islem Bouzenia, Premkumar T. Devanbu, and Michael Pradel. 2025. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair. In47th IEEE/ACM In- ternational Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025. IEEE, 2188–2200. doi:10.1109/ICSE55347.2025.00157
-
[6]
Islem Bouzenia and Michael Pradel. 2025. You Name It, I Run It: An LLM Agent to Execute Tests of Arbitrary Projects.Proc. ACM Softw. Eng.2, ISSTA (2025), 1054–1076. doi:10.1145/3728922
-
[7]
Cristian Cadar, Daniel Dunbar, and Dawson R. Engler. 2008. KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. InSymposium on Operating Systems Design and Implementation (OSDI). USENIX, 209–224
2008
-
[8]
Cristiano Calcagno, Dino Distefano, Jeremy Dubreil, Dominik Gabi, Pieter Hooimeijer, Martino Luca, Peter O’Hearn, Irene Papakonstantinou, Jim Pur- brick, and Dulma Rodriguez. 2015. Moving Fast with Software Verification. In Proceedings of the 7th International Conference on NASA Formal Methods (NFM’15) (Pasadena, CA, USA)
2015
-
[9]
Oliver Chang, Jonathan Metzman, Max Moroz, Martin Barbella, and Abhishek Arya. 2016. OSS-Fuzz: Continuous fuzzing for open source software.URL: https://github. com/google/ossfuzz(2016)
2016
- [10]
-
[11]
Clang Static Analyzer [n. d.]. Clang Static Analyzer. https://clang-analyzer.llvm. org
-
[12]
Charlie Curtsinger and Emery D. Berger. 2015. Coz: finding code that counts with causal profiling. InProceedings of the 25th Symposium on Operating Systems Principles, SOSP 2015, Monterey, CA, USA, October 4-7, 2015. 184–197
2015
- [13]
- [14]
-
[15]
LangChain Docs. 2026. Build a RAG agent with LangChain. https://docs.langchain. com/oss/python/langchain/rag
2026
-
[16]
Aleksandra Eliseeva, Alexander Kovrigin, Ilia Kholkin, Egor Bogomolov, and Yaroslav Zharov. 2025. EnvBench: A Benchmark for Automated Environment Setup.CoRRabs/2503.14443 (2025). arXiv:2503.14443 doi:10.48550/ARXIV.2503. 14443
-
[17]
Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. 2020. {AFL++}: Combining incremental steps of fuzzing research. In14th USENIX workshop on offensive technologies (WOOT 20)
2020
- [18]
-
[19]
GNU cflow [n. d.]. GNU cflow. https://www.gnu.org/software/cflow/
-
[20]
Google. 2015. Error Prone: static analysis tool for Java. http://errorprone.info/
2015
-
[21]
Graham, Peter B
Susan L. Graham, Peter B. Kessler, and Marshall K. Mckusick. 1982. Gprof: A call graph execution profiler. InSIGPLAN Symposium on Compiler Construction. ACM, 120–126
1982
- [22]
-
[23]
Xue Han, Tingting Yu, and Michael Pradel. 2021. ConfProf: White-Box Per- formance Profiling of Configuration Options. InICPE ’21: ACM/SPEC Interna- tional Conference on Performance Engineering, Virtual Event, France, April 19- 21, 2021, Johann Bourcier, Zhen Ming (Jack) Jiang, Cor-Paul Bezemer, Vittorio Cortellessa, Daniele Di Pompeo, and Ana Lucia Varban...
-
[24]
Ahmad Hazimeh, Adrian Herrera, and Mathias Payer. 2020. Magma: A ground- truth fuzzing benchmark.Proceedings of the ACM on Measurement and Analysis of Computing Systems4, 3 (2020), 1–29
2020
-
[25]
Sture Holm. 1979. A simple sequentially rejective multiple test procedure.Scan- dinavian Journal of Statistics6, 2 (1979), 65–70
1979
-
[26]
Eric Horton and Chris Parnin. 2019. Dockerizeme: Automatic inference of environ- ment dependencies for python code snippets. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 328–338
2019
- [27]
- [28]
- [29]
-
[30]
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
Dong Huang, Qingwen Bu, Jie M. Zhang, Michael Luck, and Heming Cui. 2024. AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation. arXiv:2312.13010 [cs.CL]
work page internal anchor Pith review arXiv 2024
-
[31]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real- World GitHub Issues?. InInternational Conference on Learning Representations (ICLR)
2024
-
[32]
Brittany Johnson, Yoonki Song, Emerson Murphy-Hill, and Robert Bowdidge
-
[33]
In 2013 35th International Conference on Software Engineering (ICSE)
Why don’t software developers use static analysis tools to find bugs?. In 2013 35th International Conference on Software Engineering (ICSE). IEEE, 672–681
2013
- [34]
-
[35]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33 (2020), 9459–9474
2020
-
[36]
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. 2024. AgentBench: Evaluating LLMs as Agents. InThe Twelfth International Conference on Lear...
2024
-
[37]
Yizhou Liu, Pengfei Gao, Xinchen Wang, Chao Peng, and Zhao Zhang
- [38]
-
[39]
Yiling Lou, Zhenpeng Chen, Yanbin Cao, Dan Hao, and Lu Zhang. 2020. Un- derstanding build issue resolution in practice: symptoms and fix patterns. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 617–628
2020
- [40]
-
[41]
Quinn McNemar. 1947. Note on the sampling error of the difference between correlated proportions or percentages.Psychometrika12, 2 (1947), 153–157. doi:10.1007/BF02295996
-
[42]
Jonathan Metzman, László Szekeres, Laurent Simon, Read Sprabery, and Abhishek Arya. 2021. Fuzzbench: an open fuzzer benchmarking platform and service. In Proceedings of the 29th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 1393–1403
2021
-
[43]
Louis Milliken, Sungmin Kang, and Shin Yoo. 2025. Beyond pip install: Evaluating llm agents for the automated installation of python projects. In2025 IEEE Inter- national Conference on Software Analysis, Evolution and Reengineering (SANER). Conference’17, July 2017, Washington, DC, USA Islem Bouzenia, Cristian Cadar, and Michael Pradel IEEE, 1–11
2025
-
[44]
Suchita Mukherjee, Abigail Almanza, and Cindy Rubio-González. 2021. Fixing dependency errors for Python build reproducibility. InProceedings of the 30th ACM SIGSOFT international symposium on software testing and analysis. 439–451
2021
- [45]
-
[46]
Noor Nashid, Islem Bouzenia, Michael Pradel, and Ali Mesbah. 2026. Issue2Test: Generating Reproducing Test Cases from Issue Reports. InInternational Confer- ence on Software Engineering (ICSE)
2026
-
[47]
Pat Rondon, Renyao Wei, José Cambronero, Jürgen Cito, Aaron Sun, Siddhant Sanyam, Michele Tufano, and Satish Chandra. 2025. Evaluating Agent-Based Pro- gram Repair at Google. In47th IEEE/ACM International Conference on Software En- gineering: Software Engineering in Practice, SEIP@ICSE 2025, Ottawa, ON, Canada, April 27 - May 3, 2025. IEEE, 365–376. doi:1...
-
[48]
Las-Casas, Rodrigo Fonseca, and Saravan Rajmohan
Devjeet Roy, Xuchao Zhang, Rashi Bhave, Chetan Bansal, Pedro Henrique B. Las-Casas, Rodrigo Fonseca, and Saravan Rajmohan. 2024. Exploring LLM-Based Agents for Root Cause Analysis. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, FSE 2024, Porto de Galinhas, Brazil, July 15-19, 2024, Marcelo d’Am...
-
[49]
Caitlin Sadowski, Edward Aftandilian, Alex Eagle, Liam Miller-Cushon, and Ciera Jaspan. 2018. Lessons from building static analysis tools at google.Commun. ACM61, 4 (2018), 58–66
2018
-
[50]
Koushik Sen, Darko Marinov, and Gul Agha. 2005. CUTE: a concolic unit testing engine for C. InEuropean Software Engineering Conference and International Symposium on Foundations of Software Engineering (ESEC/FSE). ACM, 263–272
2005
-
[51]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. InAdvances in Neural Information Processing Systems (NeurIPS)
2023
- [52]
-
[53]
Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, et al. 2025. PaperBench: Evaluating AI’s Ability to Replicate AI Research.arXiv preprint arXiv:2504.01848(2025)
work page internal anchor Pith review arXiv 2025
-
[54]
Nick Stephens, John Grosen, Christopher Salls, Andrew Dutcher, Ruoyu Wang, Jacopo Corbetta, Yan Shoshitaishvili, Christopher Kruegel, and Giovanni Vigna
-
[55]
In 23rd Annual Network and Distributed System Security Symposium, NDSS 2016, San Diego, California, USA, February 21-24, 2016
Driller: Augmenting Fuzzing Through Selective Symbolic Execution. In 23rd Annual Network and Distributed System Security Symposium, NDSS 2016, San Diego, California, USA, February 21-24, 2016
2016
-
[56]
Swiss Java Knife (SJK) [n. d.]. Swiss Java Knife (SJK). https://github.com/aragozin/ jvm-tools
-
[57]
T. J. Watson Libraries for Analysis (WALA) [n. d.]. T. J. Watson Libraries for Analysis (WALA). https://github.com/wala/WALA
- [58]
-
[59]
Daniel Tarlow, Subhodeep Moitra, Andrew Rice, Zimin Chen, Pierre-Antoine Manzagol, Charles Sutton, and Edward Aftandilian. 2020. Learning to Fix Build Errors with Graph2Diff Neural Networks. InICSE ’20: 42nd International Confer- ence on Software Engineering, Workshops, Seoul, Republic of Korea, 27 June - 19 July, 2020. ACM, 19–20. doi:10.1145/3387940.3392181
-
[60]
Hendren, Patrick Lam, and Vijay Sundaresan
Raja Vallée-Rai, Phong Co, Etienne Gagnon, Laurie J. Hendren, Patrick Lam, and Vijay Sundaresan. 1999. Soot - a Java bytecode optimization framework. In Conference of the Centre for Advanced Studies on Collaborative Research (CASCON). IBM, 125–135
1999
- [61]
-
[62]
Jiawei Wang, Tzu-yang Kuo, Li Li, and Andreas Zeller. 2020. Assessing and restor- ing reproducibility of Jupyter notebooks. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 138–149
2020
-
[63]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2024. OpenHands: An Open Platform for A...
work page internal anchor Pith review arXiv 2024
-
[64]
Guoqing (Harry) Xu, Matthew Arnold, Nick Mitchell, Atanas Rountev, and Gary Sevitsky. 2009. Go with the flow: profiling copies to find runtime bloat. In Conference on Programming Language Design and Implementation (PLDI). ACM, 419–430
2009
-
[65]
Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Inter- faces Enable Automated Software Engineering. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, Dece...
2024
-
[66]
John Yang, Kilian Leret, Carlos E Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, and Diyi Yang. 2025. SWE-smith: Scaling Data for Software Engineering Agents.arXiv preprint arXiv:2504.21798(2025)
work page internal anchor Pith review arXiv 2025
-
[67]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations (ICLR)
2023
-
[68]
Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Tao Xie, and Qianxiang Wang. 2024. CoderEval: A Benchmark of Prag- matic Code Generation with Generative Pre-trained Models. InICSE
2024
-
[69]
Michal Zalewski. 2013. American Fuzzy Lop (AFL). https://lcamtuf.coredump.cx/afl/. https://lcamtuf.coredump.cx/afl/
2013
-
[70]
Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Au- toCodeRover: Autonomous Program Improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024, Vienna, Austria, September 16-20, 2024, Maria Christakis and Michael Pradel (Eds.). ACM, 1592–1604. doi:10.1145/3650212.3680384
-
[71]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2024. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems36 (2024)
2024
-
[72]
Hao-Nan Zhu and Cindy Rubio-González. 2023. On the Reproducibility of Soft- ware Defect Datasets. InICSE
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.