Recognition: unknown
Less Is More: Measuring How LLM Involvement affects Chatbot Accuracy in Static Analysis
Pith reviewed 2026-05-09 21:08 UTC · model grok-4.3
The pith
A schema-constrained JSON intermediate representation produces more accurate static analysis queries from natural language than either direct LLM generation or agentic tool use.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The structured intermediate representation achieves the highest result match rates, outperforming direct generation by 15--25 percentage points on large models and surpassing the agentic approach despite the latter consuming 8× more tokens. The benefit of structured intermediates is most pronounced for large models; for small models, schema compliance becomes the bottleneck.
What carries the argument
The spectrum of LLM involvement measured by comparing direct query generation, generation of a schema-constrained JSON intermediate representation, and tool-augmented agentic generation, with the JSON step serving as the mechanism that limits model output while handing final query construction to deterministic code.
Load-bearing premise
The benchmark of 20 code analysis tasks across three complexity tiers is representative of real-world static analysis needs and result match rates accurately measure practical usefulness.
What would settle it
Running the same three architectures on a fresh set of several hundred real user queries collected from actual Joern users and comparing the fraction of queries that return the expected analysis result.
Figures
read the original abstract
Large language models are increasingly used to make static analysis tools accessible through natural language, yet existing systems differ in how much they delegate to the LLM without treating the degree of delegation as an independent variable. We compare three architectures along a spectrum of LLM involvement for translating natural language to Joern's query language \cpgql{}: direct query generation (\approach{1}), generation of a schema-constrained JSON intermediate representation (\approach{2}), and tool-augmented agentic generation (\approach{3}). These are evaluated on a benchmark of 20 code analysis tasks across three complexity tiers, using four open-weight models in a 2\(\times\)2 design (two model families \(\times\) two scales), each with three repetitions. The structured intermediate representation (\approach{2}) achieves the highest result match rates, outperforming direct generation by 15--25 percentage points on large models and surpassing the agentic approach despite the latter consuming 8\(\times\) more tokens. The benefit of structured intermediates is most pronounced for large models; for small models, schema compliance becomes the bottleneck. These findings suggest that in formally structured domains, constraining the LLM's output to a well-typed intermediate representation and delegating query construction to deterministic code yields better results than either unconstrained generation or iterative tool use.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript compares three architectures for using LLMs to translate natural language into CPGQL queries for static analysis: direct generation (approach 1), generation of a schema-constrained JSON intermediate representation followed by deterministic translation (approach 2), and tool-augmented agentic generation (approach 3). In a 2×2 design across four open-weight models (two families × two scales) and a benchmark of 20 tasks in three complexity tiers, each run three times, the structured-IR approach is reported to achieve the highest result match rates, outperforming direct generation by 15–25 percentage points on large models and the agentic approach despite the latter using 8× more tokens; benefits are smaller for small models where schema compliance is the bottleneck.
Significance. If the comparative ordering holds under fuller methodological disclosure, the work supplies controlled evidence that, in formally structured output domains, constraining LLM generation to a well-typed intermediate representation and delegating final construction to deterministic code can be both more accurate and more token-efficient than either unconstrained direct prompting or iterative tool-augmented agent workflows. The scale-dependent interaction and explicit token accounting are useful for practitioners designing LLM-augmented static-analysis interfaces.
major comments (3)
- [Evaluation] Evaluation section: The manuscript provides no description of how the 20 code-analysis tasks were constructed, how the three complexity tiers were defined or populated, or any validation that the benchmark is representative of real-world static-analysis needs; without these details the reported 15–25 pp advantages cannot be assessed for external validity.
- [Results] Results section: Although three repetitions per condition are mentioned, the paper reports only point estimates of result match rates and does not supply per-condition standard deviations, error bars, or any statistical test (e.g., McNemar or paired t-test) for the claimed differences; this leaves the reliability of the central ordering unclear.
- [Methods / Metrics] Metric definition: The exact operational definition of “result match rates” (syntactic query match, execution-result equivalence, or semantic equivalence to a ground-truth query) is never stated, nor is the scoring rule applied when schema compliance fails for small models; both omissions are load-bearing for interpreting the accuracy claims.
minor comments (1)
- [Introduction] The notation “CPGQL” and the macro “cpgql{}” should be expanded on first use; readers outside the Joern community will otherwise be unable to follow the task description.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback, which highlights important areas for improving methodological transparency and rigor. We address each major comment below, indicating the revisions we plan to incorporate.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: The manuscript provides no description of how the 20 code-analysis tasks were constructed, how the three complexity tiers were defined or populated, or any validation that the benchmark is representative of real-world static-analysis needs; without these details the reported 15–25 pp advantages cannot be assessed for external validity.
Authors: We agree that the current manuscript lacks sufficient detail on benchmark construction, limiting assessment of external validity. In the revised version, we will add a new subsection in Evaluation that describes: (1) task construction from representative static analysis scenarios in security and code quality domains (e.g., vulnerability detection and data-flow queries drawn from common Joern use cases); (2) complexity tier definitions based on explicit criteria such as query nesting depth, number of CPG elements involved, and schema coverage; and (3) our validation process via expert review. We will also add a limitations paragraph discussing the benchmark's scope and why a comprehensive representativeness study falls outside this paper's focus. revision: yes
-
Referee: [Results] Results section: Although three repetitions per condition are mentioned, the paper reports only point estimates of result match rates and does not supply per-condition standard deviations, error bars, or any statistical test (e.g., McNemar or paired t-test) for the claimed differences; this leaves the reliability of the central ordering unclear.
Authors: We concur that variability measures and statistical support would strengthen the results. The revised Results section will report per-condition standard deviations across the three repetitions, include error bars on all figures, and add McNemar's tests for paired approach comparisons (or note limitations due to small repetition count and binary outcomes). The consistent ordering across models and repetitions already provides supporting evidence, but these additions will make the reliability clearer. revision: yes
-
Referee: [Methods / Metrics] Metric definition: The exact operational definition of “result match rates” (syntactic query match, execution-result equivalence, or semantic equivalence to a ground-truth query) is never stated, nor is the scoring rule applied when schema compliance fails for small models; both omissions are load-bearing for interpreting the accuracy claims.
Authors: We will explicitly define the metric in the revised Methods section. Result match rate is operationalized as execution-result equivalence: after deterministic translation (where applicable), the query is run on the target codebase and matches if it returns an identical set of code elements to the ground-truth query. Schema compliance failures (primarily for small models) are scored as zero matches since the query cannot execute. We will also specify handling of execution errors or partial results. revision: yes
Circularity Check
No significant circularity
full rationale
The paper reports a controlled empirical evaluation of three LLM architectures for natural-language-to-CPGQL translation on a fixed benchmark of 20 tasks. All reported outcomes are measured result-match rates obtained from direct execution against ground-truth queries; no equations, fitted parameters, or predictions are derived from the inputs by construction. The central claim (structured IR outperforming direct and agentic baselines) rests on observed percentages across model scales and repetitions, not on any self-referential definition or load-bearing self-citation. The work contains no derivation chain that reduces to its own inputs, satisfying the criteria for a self-contained empirical study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 20 code analysis tasks across three complexity tiers form a representative sample for evaluating query-generation accuracy.
Reference graph
Works this paper leans on
-
[1]
Abhinav Anand, Shweta Verma, Krishna Narasimhan, and Mira Mezini
-
[2]
InFindings of the Association for Computational Linguistics: ACL 2024
A Critical Study of What Code-LLMs (Do Not) Learn. InFindings of the Association for Computational Linguistics: ACL 2024. Associa- tion for Computational Linguistics, Bangkok, Thailand, 15869–15889. doi:10.18653/v1/2024.findings-acl.939
-
[3]
Anthropic. 2024. Model Context Protocol Specification. (2024).https: //modelcontextprotocol.io
2024
-
[4]
Mark Chen, Jerry Tworek, Heewoo Jun, et al. 2021. Evaluating Large Language Models Trained on Code.arXiv preprint arXiv:2107.03374 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large 3Repository URL omitted for review. Less Is More: Measuring How LLM Involvement Affects Chatbot Accuracy in Static Analysis Conference’17, July 2017, Washington, DC, USA Language Models for Software Engineering: A Systematic Liter...
-
[6]
Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. 2025. Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions.arXiv preprint arXiv:2503.23278(2025)
work page internal anchor Pith review arXiv 2025
- [7]
-
[8]
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. CodeSearchNet Challenge: Evaluat- ing the State of Semantic Code Search.CoRRabs/1909.09436 (2019). arXiv:1909.09436http://arxiv.org/abs/1909.09436
work page internal anchor Pith review arXiv 2019
-
[9]
Sathvik Joel, Jie Wu, and Fatemeh Fard. 2025. A Survey on LLM- based Code Generation for Low-Resource and Domain-Specific Pro- gramming Languages.ACM Trans. Softw. Eng. Methodol.(Oct. 2025). doi:10.1145/3770084Just Accepted
-
[10]
Brittany Johnson, Yoonki Song, Emerson Murphy-Hill, and Robert Bowdidge. 2013. Why Don’t Software Developers Use Static Analysis Tools to Find Bugs?. InProc. ICSE. 672–681. doi:10.1109/ICSE.2013. 6606613
-
[11]
Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma, Guoliang Li, {Kevin C.C.} Chang, Fei Huang, Reynold Cheng, and Yongbin Li. 2023. Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs.Advances in Neural Information Pr...
2023
-
[12]
Penghui Li, Songchen Yao, Josef Sarfati Korich, Changhua Luo, Jian- jia Yu, Yinzhi Cao, and Junfeng Yang. 2025. Automated Static Vul- nerability Detection via a Holistic Neuro-symbolic Approach.CoRR abs/2504.16057 (2025). arXiv:2504.16057 doi:10.48550/ARXIV.2504. 16057
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504 2025
-
[13]
Ziyang Li, Saikat Dutta, and Mayur Naik. 2025. IRIS: LLM- Assisted Static Analysis for Detecting Security Vulnerabilities. InInternational Conference on Learning Representations, Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu (Eds.), Vol. 2025. 35735– 35758.https://proceedings.iclr.cc/paper_files/paper/2025/file/ 582d4e27fa24168f3af1f4582655034b-Paper-Conference.pdf
2025
-
[14]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and LINGMING ZHANG. 2023. Is Your Code Generated by ChatGPT Really Cor- rect? Rigorous Evaluation of Large Language Models for Code Generation. InAdvances in Neural Information Processing Sys- tems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 2155...
2023
-
[15]
Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwala- puram, Amrita Saha, Doyen Sahoo, Silvio Savarese, Caiming Xiong, and Junnan Li. 2025. MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers. (2025). https://openreview.net/forum?id=juQnezS1vw
2025
-
[16]
Panagiotis Lymperopoulos and Vasanth Sarathy. 2025. Tools in the Loop: Quantifying Uncertainty of LLM Question Answering Systems That Use Tools. (2025), 2645–2647
2025
-
[17]
Marcus Nachtigall, Michael Schlichtig, and Eric Bodden. 2022. A large-scale study of usability criteria addressed by static analysis tools. (2022), 532–543. doi:10.1145/3533767.3534374
-
[18]
Krishna Narasimhan. 2024. Bridging Natural Language and Static Analysis. InProc. BENEVOL. doi:publications/narasimhan2025
2024
-
[19]
Patil, Tianjun Zhang, Xin Wang, and Joseph E
Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez
-
[20]
Patil, Tianjun Zhang, Xin Wang, and Joseph E
Gorilla: Large Language Model Connected with Massive APIs. 37 (2024), 126544–126565. doi:10.52202/079017-4020
-
[21]
Mohammadreza Pourreza and Davood Rafiei. 2023. DIN- SQL: Decomposed In-Context Learning of Text-to-SQL with Self- Correction. InAdvances in Neural Information Processing Sys- tems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 36339– 36348.https://proceedings.neurips.cc/paper_files/paper/2023...
2023
-
[22]
Caitlin Sadowski, Edward Aftandilian, Alex Eagle, Liam Miller-Cushon, and Ciera Jaspan. 2018. Lessons from Building Static Analysis Tools at Google.Commun. ACM61, 4 (2018). doi:2020/papers/google-analysis- cacm.pdf
2018
-
[23]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessí, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: language models can teach themselves to use tools. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates ...
2023
- [24]
-
[25]
Le Wang, Chan Chen, Junyi Zhu, Rufeng Zhan, and Weihong Han. 2026. CQLLM: A Framework for Generating CodeQL Security Vulnerability Detection Code Based on Large Language Model.Applied Sciences16, 1 (2026). doi:10.3390/app16010517
-
[26]
Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. 2014. Modeling and Discovering Vulnerabilities with Code Property Graphs. InProceedings of the 2014 IEEE Symposium on Security and Privacy (SP ’14). IEEE Computer Society, USA, 590–604. doi:10.1109/SP.2014.44
-
[27]
Jialin Yang, Dongfu Jiang, Lipeng He, Sherman Siu, Yuxuan Zhang, Disen Liao, Zhuofeng Li, Huaye Zeng, Yiming Jia, Haozhe Wang, Benjamin Schneider, Chi Ruan, Wentao Ma, Zhiheng Lyu, Yifei Wang, Yi Lu, Quy Duc Do, Ziyan Jiang, Ping Nie, and Wenhu Chen. 2026. StructEval: Benchmarking LLMs’ Capabilities to Generate Structural Outputs. (2026). arXiv:2505.20139...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[28]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. REACT: SYNERGIZING REASON- ING AND ACTING IN LANGUAGE MODELS. Publisher Copyright:© 2023 11th International Conference on Learning Representations, ICLR
2023
-
[29]
All rights reserved.; 11th International Conference on Learning Representations, ICLR 2023 ; Conference date: 01-05-2023 Through 05-05-2023
2023
-
[30]
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zi- fan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to- SQL Task. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Pro...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.