pith. machine review for the scientific record. sign in

arxiv: 2605.09304 · v1 · submitted 2026-05-10 · 💻 cs.SE · cs.HC

Recognition: 2 theorem links

· Lean Theorem

Generating Complex Code Analyzers from Natural Language Questions

Amirmohammad Nazari, Mukund Raghothaman, Robin Jia, Sadra Sabouri, Souti Chattopadhyay, Wang Bill Zhu

Pith reviewed 2026-05-12 04:58 UTC · model grok-4.3

classification 💻 cs.SE cs.HC
keywords CodeQLnatural language to queryprogram analysislarge language modelsRAGassistive queriesbug findinguser study
0
0 comments X

The pith

Merlin converts natural language questions into reliable CodeQL queries that let programmers analyze million-line codebases more accurately and quickly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Merlin to let developers pose free-form questions about large codebases and receive answers that require semantic or inter-procedural reasoning. It solves the core problem of turning those questions into correct CodeQL queries by combining retrieval-augmented generation for iterative refinement with a self-test that uses assistive queries to produce concrete witnesses exposing semantic errors. Experiments show Merlin recovers most issues found by prior tools and uncovers additional ones, while a user study reports that access to the system raises task accuracy by an average of 3.8 times and shortens overall completion time by 31 percent.

Core claim

Merlin integrates an LLM with CodeQL through a RAG-based iterative query-generation approach and a novel self-test technique that builds assistive queries to generate witnesses exposing flaws in candidate queries, thereby producing non-degenerate and semantically correct analyzers from natural language questions about large codebases.

What carries the argument

RAG-based iterative query-generation combined with the assistive-query self-test technique that produces concrete witnesses to debug semantic flaws in candidate CodeQL queries.

If this is right

  • Merlin recovers the majority of software issues reported by other approaches and additionally finds issues that would otherwise remain undetected.
  • Programmers using Merlin complete analysis tasks with 3.8 times higher accuracy on average.
  • Access to Merlin reduces the total time programmers spend on the same set of tasks by 31 percent.
  • The system can answer questions that demand semantic or inter-procedural reasoning beyond the reach of simple text search tools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The assistive-query debugging pattern could be applied to refine LLM outputs in other program-analysis or query languages.
  • Embedding Merlin-style assistance inside development environments might let non-experts perform advanced static analysis without learning query syntax.
  • The same iterative refinement loop may shorten the time needed to adapt CodeQL queries to new codebases or new classes of bugs.

Load-bearing premise

The RAG-based iterative query-generation approach together with the assistive-query self-test will reliably yield non-degenerate and semantically correct CodeQL queries for diverse natural language questions on large codebases.

What would settle it

A test set of natural language questions drawn from real bug-finding tasks where Merlin produces mostly empty or incorrect CodeQL queries that fail to detect known issues in a million-line codebase.

Figures

Figures reproduced from arXiv: 2605.09304 by Amirmohammad Nazari, Mukund Raghothaman, Robin Jia, Sadra Sabouri, Souti Chattopadhyay, Wang Bill Zhu.

Figure 2
Figure 2. Figure 2: The user opens the codebase in their IDE, and provides a [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 1
Figure 1. Figure 1: The Merlin user interaction model (1a). Example Java program in which a constructor calls an overridable method (1b). The concern is that the subclass.doLogic() method might access the subclass.color variable before it was initialized. Automatically generated CodeQL query to find violations of the MET05-J coding guideline (1c). Notice the need to access various kinds of information about the program, such … view at source ↗
Figure 2
Figure 2. Figure 2: The Merlin user interface. While working with a codebase, the user specifies a high-level question along with the desired output table schema. In response, Merlin returns the resulting output table together with the corresponding CodeQL query. Automatic generation of “self-tests”. An important failure mode for LLM-generated static analyzers is their tendency to produce empty outputs. This is because, even … view at source ↗
Figure 3
Figure 3. Figure 3: Overall architecture of Merlin. Merlin first uses an LLM to retrieve relevant documentation and generate test cases that reflect the user’s goal. It then repeatedly uses the LLM to generate a candidate CodeQL query and addresses syntax errors using RAG-based debugging and resolves semantic errors by issuing assistive queries. The final query is executed the entire codebase in order to produce the final tab… view at source ↗
Figure 4
Figure 4. Figure 4: Our running example in Section 3. (4a): The test case generated by the LLMs with an example use of [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overlap between the issues found by Merlin (5d), its ablated variants (5a–5c), and the baselines (5e–5h), as compared to the reference solutions provided by SpotBugs and Security Lab, respectively. Each point indicates the number of locations identified by one benchmark/detector across the entire code repository. Figure 6d shows that Merlin reports a number of previously unreported locations. While it is t… view at source ↗
Figure 6
Figure 6. Figure 6: Proportion of new warnings (i.e., unreported by the reference SpotBugs and GitHub Security Lab analyzers) that are [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Time needed by participants in the user study. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Accuracy of responses in the usefulness user study. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
read the original abstract

Many software development tasks, such as implementing features and fixing bugs, begin with developers posing questions about a codebase. However, answering questions about codebases that span millions of lines of code across thousands of files is non-trivial. Standard tools like grep cannot answer questions requiring semantic or inter-procedural reasoning, and large language models (LLMs) struggle with large codebases due to resource and context constraints. In this paper, we present Merlin, a new system for answering free-form questions that require analytical reasoning about code. Merlin integrates an LLM with CodeQL, a program analysis framework that supports expressive queries over large codebases. We face two principal challenges in the design of such systems: First, program analysis queries are diverse and semantically complex; as a result, even syntactically well-formed queries frequently produce degenerate/empty results. Furthermore, relatively few CodeQL queries are available online, limiting the out-of-the-box effectiveness of LLMs as CodeQL query generators. We address these challenges by developing a RAG-based iterative query-generation approach and a novel self-test technique. Our query debugging technique builds on the idea of assistive queries, which generate concrete witnesses that expose and explain semantic flaws in candidate queries. We evaluate Merlin through both experimental and user studies. Over a set of natural language questions derived from common bug-finding tasks, Merlin discovered not only the majority of software issues reported by other approaches, but also issues that would have otherwise remained undetected. Through a within-subject user study, we found that access to Merlin increased task accuracy by an average of 3.8* and simultaneously reduced the time for programmers to complete all tasks by 31%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents Merlin, a system that combines LLMs with CodeQL to answer free-form natural language questions about large codebases requiring semantic or inter-procedural reasoning. It tackles challenges of query diversity and limited training examples via a RAG-based iterative query-generation approach and a self-test technique using assistive queries that produce concrete witnesses to expose semantic flaws. Experimental evaluation on bug-finding tasks shows Merlin recovers most issues found by prior approaches while detecting additional ones; a within-subject user study reports that Merlin access yields 3.8x higher task accuracy and 31% lower completion time.

Significance. If the user-study results hold after addressing potential confounds, the work would meaningfully advance automated code analysis by providing a practical natural-language interface to expressive static-analysis frameworks. The RAG-plus-self-test pipeline offers a concrete, reproducible method for generating reliable CodeQL queries from informal questions, which could reduce the barrier to using program analysis in everyday development. The empirical demonstration that the system both matches and exceeds existing bug finders on real tasks is a strength.

major comments (1)
  1. User-study section: The central claim of 3.8x accuracy gain and 31% time reduction rests on a within-subject design, yet the manuscript provides no description of counterbalancing task order, randomizing condition sequence, or analyzing per-participant ordering effects. Without these controls, learning or familiarity with the codebases on the second exposure could account for a substantial fraction of the reported deltas, undermining causal attribution to Merlin.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the concern regarding the user study design below and will incorporate the necessary clarifications in the revised manuscript.

read point-by-point responses
  1. Referee: User-study section: The central claim of 3.8x accuracy gain and 31% time reduction rests on a within-subject design, yet the manuscript provides no description of counterbalancing task order, randomizing condition sequence, or analyzing per-participant ordering effects. Without these controls, learning or familiarity with the codebases on the second exposure could account for a substantial fraction of the reported deltas, undermining causal attribution to Merlin.

    Authors: We agree that the current manuscript lacks sufficient detail on these controls, which is necessary to fully support causal claims about Merlin's impact. In the revised version, we will expand the User Study section to describe the counterbalancing of task order across participants, the randomization of condition sequences (Merlin vs. baseline), and any post-experiment analysis of per-participant ordering effects. These additions will allow readers to evaluate the robustness of the 3.8x accuracy and 31% time improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical system evaluation

full rationale

The paper describes the Merlin system architecture (RAG-based iterative query generation plus assistive-query self-test) and supports its claims solely through independent experimental benchmarks and a within-subject user study reporting measured accuracy and time deltas. No equations, fitted parameters, self-definitional constructs, or derivation steps appear in the abstract or described content. The evaluation results are presented as direct measurements on external tasks and participants rather than reductions of outputs to inputs by construction. Self-citations, if present in the full text, are not load-bearing for the central empirical claims.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical systems paper describing a tool and its evaluation. No free parameters, mathematical axioms, or invented entities are invoked in the abstract.

pith-pipeline@v0.9.0 · 5615 in / 1206 out tokens · 53871 ms · 2026-05-12T04:58:23.776512+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 3 internal anchors

  1. [1]

    2025.Amazon CodeGuru Reviewer

    Amazon. 2025.Amazon CodeGuru Reviewer. https://docs.aws.amazon.com/ codeguru/latest/reviewer-ug/welcome.html

  2. [2]

    Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou, and Weizhu Chen. 2024. Make your llm fully utilize the context.Advances in Neural Information Processing Systems37 (2024), 62160–62188

  3. [3]

    Anthropic. 2025. Your code’s new collaborator. https://www.anthropic.com/claude-code. https://www.anthropic.com/claude- code

  4. [4]

    Pavel Avgustinov, Oege de Moor, Michael Peyton Jones, and Max Schäfer. 2016. QL: Object-oriented Queries on Relational Data. InProceedings of the European Conference on Object-Oriented Programming (ECOOP)

  5. [5]

    Cristian Cadar, Daniel Dunbar, and Dawson Engler. 2008. KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs. InProceedings of the 8th USENIX Conference on Operating Systems Design and Implementation(San Diego, California)(OSDI’08). USENIX Association, USA, 209–224

  6. [6]

    2024.Multiline & Structural Code Search

    CodeQue.co. 2024.Multiline & Structural Code Search. https://marketplace. visualstudio.com/items?itemName=CodeQue.codeque

  7. [7]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multi- modality, Long Context, and Next Generation Agentic Capabilities.arXiv preprint arXiv:2507.06261(2025)

  8. [8]

    Cursor. 2025. The AI Code Editor. https://cursor.com/en. https://cursor.com/en

  9. [9]

    Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. 2024. Longrope: Extending llm context window beyond 2 million tokens.arXiv preprint arXiv:2402.13753(2024)

  10. [10]

    Philipp Eibl, Sadra Sabouri, and Souti Chattopadhyay. 2025. Exploring the Chal- lenges and Opportunities of AI-assisted Codebase Generation. In2025 IEEE Sym- posium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 241–252

  11. [11]

    Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. 2020. AFL++: combining incremental steps of fuzzing research. InProceedings of the 14th USENIX Conference on Offensive Technologies (WOOT’20). USENIX Association, USA, Article 10, 1 pages

  12. [12]

    2021.GitHub Copilot

    GitHub. 2021.GitHub Copilot. Retrieved 19 July, 2025 from https://github.com/ features/copilot

  13. [13]

    2025.Expr.MethodCall — CodeQL Standard Libraries

    GitHub. 2025.Expr.MethodCall — CodeQL Standard Libraries. https: //codeql.github.com/codeql-standard-libraries/java/semmle/code/java/Expr.qll/ type.Expr$MethodCall.html

  14. [14]

    Siqi Gu, Quanjun Zhang, Kecheng Li, Chunrong Fang, Fangyuan Tian, Liuchuan Zhu, Jianyi Zhou, and Zhenyu Chen. 2025. TestART: Improving LLM-based Unit Testing via Co-evolution of Automated Generation and Repair Iteration. arXiv:2408.03095 [cs.SE] https://arxiv.org/abs/2408.03095

  15. [15]

    David Hovemeyer and William Pugh. 2004. Finding bugs is easy.SIGPLAN Not. 39, 12 (Dec. 2004), 92–106. doi:10.1145/1052883.1052895

  16. [16]

    2024.Structural search and replace

    JetBrains. 2024.Structural search and replace. https://www.jetbrains.com/help/ idea/structural-search-and-replace.html

  17. [17]

    2024.Joern: The Bug Hunter’s Workbench

    joern.io. 2024.Joern: The Bug Hunter’s Workbench. Retrieved 29 January, 2026 from https://github.com/joernio/joern

  18. [18]

    Mary Beth Kery and Brad A Myers. 2017. Exploring exploratory programming. In2017 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). IEEE, 25–29

  19. [19]

    Joshua Klayman. 1995. Varieties of confirmation bias.Psychology of learning and motivation32 (1995), 385–418

  20. [20]

    Ko, Rebecca DeLine, and Gina Venolia

    Andrew J. Ko, Rebecca DeLine, and Gina Venolia. 2007. Information needs in collocated software development teams. InProceedings of the 29th International Conference on Software Engineering (ICSE). IEEE, 344–353

  21. [21]

    Ko, Brad A

    Amy J. Ko, Brad A. Myers, Michael J. Coblenz, and Htet Htet Aung. 2006. An Exploratory Study of How Developers Seek, Relate, and Collect Relevant In- formation during Software Maintenance Tasks.IEEE Transactions on Software Engineering32, 12 (2006), 971–987. doi:10.1109/TSE.2006.116

  22. [22]

    Thomas D LaToza and Brad A Myers. 2010. Hard-to-answer questions about code. InEvaluation and usability of programming languages and tools. 1–6

  23. [23]

    LaToza and Brad A

    Thomas D. LaToza and Brad A. Myers. 2010. Hard-to-answer questions about code. InEvaluation and Usability of Programming Languages and Tools(Reno, Nevada)(PLATEAU ’10). Association for Computing Machinery, New York, NY, USA, Article 8, 6 pages. doi:10.1145/1937117.1937125

  24. [24]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. InProceedings of the 34th International Conference on Neural Information Processing Systems(Van...

  25. [25]

    Penghui Li, Songchen Yao, Josef Sarfati Korich, Changhua Luo, Jianjia Yu, Yinzhi Cao, and Junfeng Yang. 2025. Automated Static Vulnerability Detection via a Holistic Neuro-symbolic Approach. arXiv:2504.16057 [cs.CR] https://arxiv.org/ abs/2504.16057

  26. [26]

    Ziyang Li, Saikat Dutta, and Mayur Naik. 2025. LLM-Assisted Static Analysis for Detecting Security Vulnerabilities. InInternational Conference on Learning Representations. https://arxiv.org/abs/2405.17238

  27. [27]

    Ben Limpanukorn, Yanjun Wang, Zach Patterson, Pranav Garg, Murali Krishna Ramanathan, Xiaofei Ma, Anoop Deoras, and Miryung Kim. 2025. Structural Code Search using Natural Language Queries. arXiv:2507.02107 [cs.SE] https: //arxiv.org/abs/2507.02107

  28. [28]

    2011.The CERT Oracle Secure Coding Standard for Java

    Fred Long, Dhruv Mohindra, Robert Seacord, Dean Sutherland, and David Svo- boda. 2011.The CERT Oracle Secure Coding Standard for Java. Addison-Wesley

  29. [29]

    Roberto Minelli, Andrea Mocci, and Michele Lanza. 2015. I know what you did last summer-an investigation of how developers spend their time. In2015 IEEE 23rd international conference on program comprehension. IEEE, 25–35

  30. [30]

    2008.MET05-J: Ensure that constructors do not call overridable methods

    Dhruv Mohindra. 2008.MET05-J: Ensure that constructors do not call overridable methods. Retrieved 19 July, 2025 from https://wiki.sei.cmu.edu/confluence/ display/java/MET05-J.+Ensure+that+constructors+do+not+call+overridable+ methods

  31. [31]

    Nicholas Nethercote and Julian Seward. 2007. Valgrind: a framework for heavy- weight dynamic binary instrumentation.SIGPLAN Not.42, 6 (June 2007), 89–100. doi:10.1145/1273442.1250746

  32. [32]

    Juri Opitz. 2024. A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice.Transactions of the Association for Computational Linguistics12 (2024), 820–836. doi:10.1162/tacl_a_00675

  33. [33]

    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019)

  34. [34]

    Sadra Sabouri, Philipp Eibl, Xinyi Zhou, Morteza Ziyadi, Nenad Medvidovic, Lars Lindemann, and Souti Chattopadhyay. 2025. Trust Dynamics in AI-Assisted Development: Definitions, Factors, and Implications . In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 736–736. doi:10.1109/ICSE55347.2025.00199

  35. [35]

    Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitry Vyukov. 2012. AddressSanitizer: A Fast Address Sanity Checker. InUSENIX ATC 2012. https://www.usenix.org/conference/usenixfederatedconferencesweek/ addresssanitizer-fast-address-sanity-checker

  36. [36]

    Murphy, and Kris De Volder

    Jonathan Sillito, Gail C. Murphy, and Kris De Volder. 2008. Asking and answering questions during a programming change task.IEEE Transactions on Software Engineering34, 4 (2008), 434–451

  37. [37]

    M.-A. Storey. 2005. Theories, methods and tools in program comprehension: past, present and future. In13th International Workshop on Program Comprehension (IWPC’05). 181–191. doi:10.1109/WPC.2005.38

  38. [38]

    2025.GitHub Security Lab

    GitHub Security Lab Team. 2025.GitHub Security Lab. Retrieved 19 July, 2025 from https://github.com/github/securitylab

  39. [39]

    2020.Semgrep

    Semgrep Core Team. 2020.Semgrep. Retrieved 19 July, 2025 from https://semgrep. dev

  40. [40]

    2025.SpotBugs: Find Bugs in Java Programs

    SpotBugs Core Team. 2025.SpotBugs: Find Bugs in Java Programs. Retrieved 19 July, 2025 from https://spotbugs.github.io/ Generating Complex Code Analyzers from Natural Language Questions Conference’17, July 2017, Washington, DC, USA

  41. [41]

    2024.Comby - Structural code search and replace for every language

    Rijnard van Tonder. 2024.Comby - Structural code search and replace for every language. https://comby.dev/

  42. [42]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2023. Attention Is All You Need. arXiv:1706.03762 [cs.CL] https://arxiv.org/abs/1706.03762

  43. [43]

    Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran

  44. [44]

    Kodcode: A diverse, challenging, and verifiable synthetic dataset for coding

    KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding. arXiv:2503.02951 [cs.LG] https://arxiv.org/abs/2503.02951

  45. [45]

    Michael JQ Zhang and Eunsol Choi. 2025. Clarify When Necessary: Resolving Ambiguity Through Interaction with LMs. InFindings of the Association for Computational Linguistics: NAACL 2025, Luis Chiruzzo, Alan Ritter, and Lu Wang (Eds.). Association for Computational Linguistics, Albuquerque, New Mexico, 5526–5543. doi:10.18653/v1/2025.findings-naacl.306

  46. [46]

    Xinyi Zhou, Zeinadsadat Saghi, Sadra Sabouri, Rahul Pandita, Mollie McGuire, and Souti Chattopadhyay. 2026. Cognitive Biases in LLM-Assisted Software Development.arXiv preprint arXiv:2601.08045(2026)