arxiv: 2604.21746 · v1 · submitted 2026-04-23 · 💻 cs.SE

Recognition: unknown

Less Is More: Measuring How LLM Involvement affects Chatbot Accuracy in Static Analysis

Krishna Narasimhan

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:08 UTC · model grok-4.3

classification 💻 cs.SE

keywords LLMstatic analysisquery generationintermediate representationagenticaccuracyJoernstructured output

0 comments

The pith

A schema-constrained JSON intermediate representation produces more accurate static analysis queries from natural language than either direct LLM generation or agentic tool use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests three levels of LLM involvement when translating natural language requests into queries for a static analysis tool. Direct generation asks the model to write the full query. The middle method asks the model only to produce a structured JSON description that follows a fixed schema, after which deterministic code builds the query. The third method lets an agent call analysis tools repeatedly. Across 20 tasks and several model sizes, the JSON intermediate approach matched the expected results most often, gaining 15 to 25 points over direct generation on large models and beating the agent method while using one-eighth the tokens.

Core claim

The structured intermediate representation achieves the highest result match rates, outperforming direct generation by 15--25 percentage points on large models and surpassing the agentic approach despite the latter consuming 8× more tokens. The benefit of structured intermediates is most pronounced for large models; for small models, schema compliance becomes the bottleneck.

What carries the argument

The spectrum of LLM involvement measured by comparing direct query generation, generation of a schema-constrained JSON intermediate representation, and tool-augmented agentic generation, with the JSON step serving as the mechanism that limits model output while handing final query construction to deterministic code.

Load-bearing premise

The benchmark of 20 code analysis tasks across three complexity tiers is representative of real-world static analysis needs and result match rates accurately measure practical usefulness.

What would settle it

Running the same three architectures on a fresh set of several hundred real user queries collected from actual Joern users and comparing the fraction of queries that return the expected analysis result.

Figures

Figures reproduced from arXiv: 2604.21746 by Krishna Narasimhan.

**Figure 1.** Figure 1: The three architectures. Grey boxes are LLM-mediated; white boxes are deterministic. Algorithm 1 A1: Direct generation. Input: NL task 𝑡 Output: Result or failure msgs ← [sys(ref), usr(𝑡) ] for 𝑖 ← 1 to 3 do 𝑞 ← extract(LLM(msgs) ) (ok, 𝑟, 𝑒 ) ← Joern(𝑞) if ok then return 𝑟 msgs += [𝑞, 𝑒 ] return fail Algorithm 2 A2: Structured intermediate. Input: NL task 𝑡 Output: Result or failure msgs ← [sys(schema), u… view at source ↗

**Figure 2.** Figure 2: The three approach procedures, side by side. Grey boxes in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Result match rate by approach and model. A2 (structured intermediate) leads across all models. A3 for Llama 70B is omitted (infrastructure failures). 3.4.1 Finding 1: Structured intermediates outperform direct generation [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Distribution of total token consumption per task (Qwen 72B). A1 and A2 cluster tightly below 2,000 tokens. A3 spans a 14× range (3,081–42,790) with a median of 6,756— consuming 4× more tokens at the median and 8× more on average, yet achieving the lowest accuracy. The small models complete the agentic loop more quickly— averaging 3.1 steps (Qwen 7B) and 2.8 steps (Llama 8B) compared to 4.8 for Qwen 72B—bu… view at source ↗

read the original abstract

Large language models are increasingly used to make static analysis tools accessible through natural language, yet existing systems differ in how much they delegate to the LLM without treating the degree of delegation as an independent variable. We compare three architectures along a spectrum of LLM involvement for translating natural language to Joern's query language \cpgql{}: direct query generation (\approach{1}), generation of a schema-constrained JSON intermediate representation (\approach{2}), and tool-augmented agentic generation (\approach{3}). These are evaluated on a benchmark of 20 code analysis tasks across three complexity tiers, using four open-weight models in a 2\(\times\)2 design (two model families \(\times\) two scales), each with three repetitions. The structured intermediate representation (\approach{2}) achieves the highest result match rates, outperforming direct generation by 15--25 percentage points on large models and surpassing the agentic approach despite the latter consuming 8\(\times\) more tokens. The benefit of structured intermediates is most pronounced for large models; for small models, schema compliance becomes the bottleneck. These findings suggest that in formally structured domains, constraining the LLM's output to a well-typed intermediate representation and delegating query construction to deterministic code yields better results than either unconstrained generation or iterative tool use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript compares three architectures for using LLMs to translate natural language into CPGQL queries for static analysis: direct generation (approach 1), generation of a schema-constrained JSON intermediate representation followed by deterministic translation (approach 2), and tool-augmented agentic generation (approach 3). In a 2×2 design across four open-weight models (two families × two scales) and a benchmark of 20 tasks in three complexity tiers, each run three times, the structured-IR approach is reported to achieve the highest result match rates, outperforming direct generation by 15–25 percentage points on large models and the agentic approach despite the latter using 8× more tokens; benefits are smaller for small models where schema compliance is the bottleneck.

Significance. If the comparative ordering holds under fuller methodological disclosure, the work supplies controlled evidence that, in formally structured output domains, constraining LLM generation to a well-typed intermediate representation and delegating final construction to deterministic code can be both more accurate and more token-efficient than either unconstrained direct prompting or iterative tool-augmented agent workflows. The scale-dependent interaction and explicit token accounting are useful for practitioners designing LLM-augmented static-analysis interfaces.

major comments (3)

[Evaluation] Evaluation section: The manuscript provides no description of how the 20 code-analysis tasks were constructed, how the three complexity tiers were defined or populated, or any validation that the benchmark is representative of real-world static-analysis needs; without these details the reported 15–25 pp advantages cannot be assessed for external validity.
[Results] Results section: Although three repetitions per condition are mentioned, the paper reports only point estimates of result match rates and does not supply per-condition standard deviations, error bars, or any statistical test (e.g., McNemar or paired t-test) for the claimed differences; this leaves the reliability of the central ordering unclear.
[Methods / Metrics] Metric definition: The exact operational definition of “result match rates” (syntactic query match, execution-result equivalence, or semantic equivalence to a ground-truth query) is never stated, nor is the scoring rule applied when schema compliance fails for small models; both omissions are load-bearing for interpreting the accuracy claims.

minor comments (1)

[Introduction] The notation “CPGQL” and the macro “cpgql{}” should be expanded on first use; readers outside the Joern community will otherwise be unable to follow the task description.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important areas for improving methodological transparency and rigor. We address each major comment below, indicating the revisions we plan to incorporate.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The manuscript provides no description of how the 20 code-analysis tasks were constructed, how the three complexity tiers were defined or populated, or any validation that the benchmark is representative of real-world static-analysis needs; without these details the reported 15–25 pp advantages cannot be assessed for external validity.

Authors: We agree that the current manuscript lacks sufficient detail on benchmark construction, limiting assessment of external validity. In the revised version, we will add a new subsection in Evaluation that describes: (1) task construction from representative static analysis scenarios in security and code quality domains (e.g., vulnerability detection and data-flow queries drawn from common Joern use cases); (2) complexity tier definitions based on explicit criteria such as query nesting depth, number of CPG elements involved, and schema coverage; and (3) our validation process via expert review. We will also add a limitations paragraph discussing the benchmark's scope and why a comprehensive representativeness study falls outside this paper's focus. revision: yes
Referee: [Results] Results section: Although three repetitions per condition are mentioned, the paper reports only point estimates of result match rates and does not supply per-condition standard deviations, error bars, or any statistical test (e.g., McNemar or paired t-test) for the claimed differences; this leaves the reliability of the central ordering unclear.

Authors: We concur that variability measures and statistical support would strengthen the results. The revised Results section will report per-condition standard deviations across the three repetitions, include error bars on all figures, and add McNemar's tests for paired approach comparisons (or note limitations due to small repetition count and binary outcomes). The consistent ordering across models and repetitions already provides supporting evidence, but these additions will make the reliability clearer. revision: yes
Referee: [Methods / Metrics] Metric definition: The exact operational definition of “result match rates” (syntactic query match, execution-result equivalence, or semantic equivalence to a ground-truth query) is never stated, nor is the scoring rule applied when schema compliance fails for small models; both omissions are load-bearing for interpreting the accuracy claims.

Authors: We will explicitly define the metric in the revised Methods section. Result match rate is operationalized as execution-result equivalence: after deterministic translation (where applicable), the query is run on the target codebase and matches if it returns an identical set of code elements to the ground-truth query. Schema compliance failures (primarily for small models) are scored as zero matches since the query cannot execute. We will also specify handling of execution errors or partial results. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports a controlled empirical evaluation of three LLM architectures for natural-language-to-CPGQL translation on a fixed benchmark of 20 tasks. All reported outcomes are measured result-match rates obtained from direct execution against ground-truth queries; no equations, fitted parameters, or predictions are derived from the inputs by construction. The central claim (structured IR outperforming direct and agentic baselines) rests on observed percentages across model scales and repetitions, not on any self-referential definition or load-bearing self-citation. The work contains no derivation chain that reduces to its own inputs, satisfying the criteria for a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the 20-task benchmark and result-match metric are valid proxies for real accuracy; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The 20 code analysis tasks across three complexity tiers form a representative sample for evaluating query-generation accuracy.
Invoked to generalize the observed 15-25 point gains beyond the specific benchmark.

pith-pipeline@v0.9.0 · 5523 in / 1353 out tokens · 62927 ms · 2026-05-09T21:08:28.836644+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 16 canonical work pages · 5 internal anchors

[1]

Abhinav Anand, Shweta Verma, Krishna Narasimhan, and Mira Mezini
[2]

InFindings of the Association for Computational Linguistics: ACL 2024

A Critical Study of What Code-LLMs (Do Not) Learn. InFindings of the Association for Computational Linguistics: ACL 2024. Associa- tion for Computational Linguistics, Bangkok, Thailand, 15869–15889. doi:10.18653/v1/2024.findings-acl.939

work page doi:10.18653/v1/2024.findings-acl.939 2024
[3]

Anthropic. 2024. Model Context Protocol Specification. (2024).https: //modelcontextprotocol.io

2024
[4]

Mark Chen, Jerry Tworek, Heewoo Jun, et al. 2021. Evaluating Large Language Models Trained on Code.arXiv preprint arXiv:2107.03374 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large 3Repository URL omitted for review. Less Is More: Measuring How LLM Involvement Affects Chatbot Accuracy in Static Analysis Conference’17, July 2017, Washington, DC, USA Language Models for Software Engineering: A Systematic Liter...

work page doi:10.1145/3695988 2024
[6]

Xinyi Hou, Yanjie Zhao, Shenao Wang, and Haoyu Wang. 2025. Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions.arXiv preprint arXiv:2503.23278(2025)

work page internal anchor Pith review arXiv 2025
[7]

Junze Hu, Xiangyu Jin, Yizhe Zeng, Yuling Liu, Yunpeng Li, Dan Du, Kaiyu Xie, and Hongsong Zhu. 2025. QLPro: Automated Code Vul- nerability Discovery via LLM and Static Code Analysis Integration. (2025). arXiv:2506.23644 [cs.SE]https://arxiv.org/abs/2506.23644

work page arXiv 2025
[8]

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. CodeSearchNet Challenge: Evaluat- ing the State of Semantic Code Search.CoRRabs/1909.09436 (2019). arXiv:1909.09436http://arxiv.org/abs/1909.09436

work page internal anchor Pith review arXiv 2019
[9]

Sathvik Joel, Jie Wu, and Fatemeh Fard. 2025. A Survey on LLM- based Code Generation for Low-Resource and Domain-Specific Pro- gramming Languages.ACM Trans. Softw. Eng. Methodol.(Oct. 2025). doi:10.1145/3770084Just Accepted

work page doi:10.1145/3770084just 2025
[10]

Brittany Johnson, Yoonki Song, Emerson Murphy-Hill, and Robert Bowdidge. 2013. Why Don’t Software Developers Use Static Analysis Tools to Find Bugs?. InProc. ICSE. 672–681. doi:10.1109/ICSE.2013. 6606613

work page doi:10.1109/icse.2013 2013
[11]

Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma, Guoliang Li, {Kevin C.C.} Chang, Fei Huang, Reynold Cheng, and Yongbin Li. 2023. Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs.Advances in Neural Information Pr...

2023
[12]

Penghui Li, Songchen Yao, Josef Sarfati Korich, Changhua Luo, Jian- jia Yu, Yinzhi Cao, and Junfeng Yang. 2025. Automated Static Vul- nerability Detection via a Holistic Neuro-symbolic Approach.CoRR abs/2504.16057 (2025). arXiv:2504.16057 doi:10.48550/ARXIV.2504. 16057

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504 2025
[13]

Ziyang Li, Saikat Dutta, and Mayur Naik. 2025. IRIS: LLM- Assisted Static Analysis for Detecting Security Vulnerabilities. InInternational Conference on Learning Representations, Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu (Eds.), Vol. 2025. 35735– 35758.https://proceedings.iclr.cc/paper_files/paper/2025/file/ 582d4e27fa24168f3af1f4582655034b-Paper-Conference.pdf

2025
[14]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and LINGMING ZHANG. 2023. Is Your Code Generated by ChatGPT Really Cor- rect? Rigorous Evaluation of Large Language Models for Code Generation. InAdvances in Neural Information Processing Sys- tems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 2155...

2023
[15]

Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwala- puram, Amrita Saha, Doyen Sahoo, Silvio Savarese, Caiming Xiong, and Junnan Li. 2025. MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers. (2025). https://openreview.net/forum?id=juQnezS1vw

2025
[16]

Panagiotis Lymperopoulos and Vasanth Sarathy. 2025. Tools in the Loop: Quantifying Uncertainty of LLM Question Answering Systems That Use Tools. (2025), 2645–2647

2025
[17]

Marcus Nachtigall, Michael Schlichtig, and Eric Bodden. 2022. A large-scale study of usability criteria addressed by static analysis tools. (2022), 532–543. doi:10.1145/3533767.3534374

work page doi:10.1145/3533767.3534374 2022
[18]

Krishna Narasimhan. 2024. Bridging Natural Language and Static Analysis. InProc. BENEVOL. doi:publications/narasimhan2025

2024
[19]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez
[20]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Gorilla: Large Language Model Connected with Massive APIs. 37 (2024), 126544–126565. doi:10.52202/079017-4020

work page doi:10.52202/079017-4020 2024
[21]

Mohammadreza Pourreza and Davood Rafiei. 2023. DIN- SQL: Decomposed In-Context Learning of Text-to-SQL with Self- Correction. InAdvances in Neural Information Processing Sys- tems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 36339– 36348.https://proceedings.neurips.cc/paper_files/paper/2023...

2023
[22]

Caitlin Sadowski, Edward Aftandilian, Alex Eagle, Liam Miller-Cushon, and Ciera Jaspan. 2018. Lessons from Building Static Analysis Tools at Google.Commun. ACM61, 4 (2018). doi:2020/papers/google-analysis- cacm.pdf

2018
[23]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessí, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: language models can teach themselves to use tools. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates ...

2023
[24]

Claire Wang, Ziyang Li, Saikat Dutta, and Mayur Naik. 2025. QLCoder: A Query Synthesizer For Static Analysis of Security Vulnerabilities. arXiv:2511.08462 [cs.CR]https://arxiv.org/abs/2511.08462

work page arXiv 2025
[25]

Le Wang, Chan Chen, Junyi Zhu, Rufeng Zhan, and Weihong Han. 2026. CQLLM: A Framework for Generating CodeQL Security Vulnerability Detection Code Based on Large Language Model.Applied Sciences16, 1 (2026). doi:10.3390/app16010517

work page doi:10.3390/app16010517 2026
[26]

Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. 2014. Modeling and Discovering Vulnerabilities with Code Property Graphs. InProceedings of the 2014 IEEE Symposium on Security and Privacy (SP ’14). IEEE Computer Society, USA, 590–604. doi:10.1109/SP.2014.44

work page doi:10.1109/sp.2014.44 2014
[27]

Jialin Yang, Dongfu Jiang, Lipeng He, Sherman Siu, Yuxuan Zhang, Disen Liao, Zhuofeng Li, Huaye Zeng, Yiming Jia, Haozhe Wang, Benjamin Schneider, Chi Ruan, Wentao Ma, Zhiheng Lyu, Yifei Wang, Yi Lu, Quy Duc Do, Ziyan Jiang, Ping Nie, and Wenhu Chen. 2026. StructEval: Benchmarking LLMs’ Capabilities to Generate Structural Outputs. (2026). arXiv:2505.20139...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. REACT: SYNERGIZING REASON- ING AND ACTING IN LANGUAGE MODELS. Publisher Copyright:© 2023 11th International Conference on Learning Representations, ICLR

2023
[30]

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zi- fan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to- SQL Task. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Pro...

work page doi:10.18653/v1/d18-1425 2018