Recognition: unknown
Strategic Heterogeneous Multi-Agent Architecture for Cost-Effective Code Vulnerability Detection
Pith reviewed 2026-05-09 22:03 UTC · model grok-4.3
The pith
A 3+1 multi-agent system deploys three LLM experts in parallel plus a local adversarial verifier to detect code vulnerabilities at low cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that the two-layer game framework produces super-additive value from the three expert perspectives while the adversarial verifier raises precision, yielding 77.2 percent F1 score, 62.9 percent precision, and 100 percent recall at 0.002 dollars per sample, which exceeds both a single-expert LLM baseline and the Cppcheck static analyzer.
What carries the argument
The 3+1 architecture of three DeepSeek-V3 experts running in parallel under a cooperative game, followed by a Qwen3-8B local verifier in an adversarial verification game.
If this is right
- Parallel execution of the three experts produces a threefold reduction in wall-clock time.
- The adversarial verifier improves precision by more than 10 percentage points over the experts alone.
- The architecture maintains full recall while operating at a cost of two cents per thousand samples.
- Game-theoretic separation of cooperative and adversarial roles can be reused for other cost-sensitive code analysis tasks.
Where Pith is reading between the lines
- The same division of labor may extend to detecting other classes of software defects if new expert perspectives are defined.
- Running the verifier locally could allow integration into continuous-integration pipelines without network latency.
- Performance on proprietary or non-C codebases remains an open question that would require separate evaluation.
Load-bearing premise
The three chosen analysis perspectives and the specific models supply non-redundant information that continues to hold on code samples outside the 262 Juliet test cases.
What would settle it
Applying the same system to a fresh collection of several hundred real-world code samples from open-source projects and measuring recall below 100 percent or F1 below 70 percent would show the reported performance does not generalize.
read the original abstract
Automated code vulnerability detection is critical for software security, yet existing approaches face a fundamental trade-off between detection accuracy and computational cost. We propose a heterogeneous multi-agent architecture inspired by game-theoretic principles, combining cloud-based LLM experts with a local lightweight verifier. Our "3+1" architecture deploys three cloud-based expert agents (DeepSeek-V3) that analyze code from complementary perspectives - code structure, security patterns, and debugging logic - in parallel, while a local verifier (Qwen3-8B) performs adversarial validation at zero marginal cost. We formalize this design through a two-layer game framework: (1) a cooperative game among experts capturing super-additive value from diverse perspectives, and (2) an adversarial verification game modeling quality assurance incentives. Experiments on 262 real samples from the NIST Juliet Test Suite across 14 CWE types, with balanced vulnerable and benign classes, demonstrate that our approach achieves a 77.2% F1 score with 62.9% precision and 100% recall at $0.002 per sample - outperforming both a single-expert LLM baseline (F1 71.4%) and Cppcheck static analysis (MCC 0). The adversarial verifier significantly improves precision (+10.3 percentage points, p < 1e-6, McNemar's test) by filtering false positives, while parallel execution achieves a 3.0x speedup. Our work demonstrates that game-theoretic design principles can guide effective heterogeneous multi-agent architectures for cost-sensitive software engineering tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a heterogeneous multi-agent system for detecting code vulnerabilities. It employs three specialized LLM experts (DeepSeek-V3) operating in parallel from distinct perspectives—code structure, security patterns, and debugging logic—alongside a local verifier model (Qwen3-8B) for adversarial validation. The design is formalized using a two-layer game-theoretic framework consisting of a cooperative game among experts and an adversarial verification game. Experiments on 262 balanced samples from the NIST Juliet Test Suite across 14 CWE types report an F1 score of 77.2%, precision of 62.9%, and 100% recall at a cost of $0.002 per sample, outperforming a single-expert LLM baseline and Cppcheck static analysis, with the verifier providing a statistically significant precision boost.
Significance. If the central claims regarding the benefits of the heterogeneous architecture and game-theoretic design are substantiated, this work could contribute to cost-effective vulnerability detection by demonstrating how complementary perspectives in LLMs can yield super-additive performance gains while maintaining low computational costs. The inclusion of a statistical test (McNemar's) for the verifier's impact is a positive aspect. However, the current evidence is limited by the narrow evaluation scope and absence of controls for alternative explanations such as prompt engineering effects.
major comments (3)
- Experiments section: The reported performance improvements from the '3+1' architecture lack supporting ablation studies. Specifically, there are no results isolating the individual contributions of the three expert perspectives (code structure, security patterns, debugging logic) or comparing the full setup against a simple majority-vote ensemble of the same three models without the game-theoretic framing.
- Two-layer game framework (Section 3): The two-layer game framework is introduced to formalize the design, but the quantitative results (e.g., the +10.3 percentage point precision improvement from the verifier) are not derived from or predicted by the game equations. The performance metrics appear to be measured post-experiment rather than emerging as consequences of the formal model, raising questions about whether the game theory adds predictive power beyond the empirical setup.
- Evaluation section: All experiments are confined to a balanced 262-sample subset of the NIST Juliet Test Suite covering 14 CWE types. No evaluations on unbalanced real-world codebases, other vulnerability datasets (e.g., Big-Vul or Devign), or cross-project settings are provided, which is critical for assessing whether the claimed generalization of the complementary perspectives holds beyond this controlled test set.
minor comments (3)
- Abstract and Results: Error bars, standard deviations, or results from multiple runs are not reported for the F1, precision, and recall metrics, making it difficult to assess the stability of the 77.2% F1 score.
- Methodology: Details on the data split (train/test/validation), how the 262 samples were selected, and the exact prompt templates used for the expert agents and verifier are not provided, hindering reproducibility.
- Baselines: The single-expert LLM baseline (F1 71.4%) implementation details are unclear, such as whether it uses the same model and similar prompting effort as the multi-agent system.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and indicate where revisions will be made to improve the manuscript.
read point-by-point responses
-
Referee: Experiments section: The reported performance improvements from the '3+1' architecture lack supporting ablation studies. Specifically, there are no results isolating the individual contributions of the three expert perspectives (code structure, security patterns, debugging logic) or comparing the full setup against a simple majority-vote ensemble of the same three models without the game-theoretic framing.
Authors: We agree that ablation studies are needed to substantiate the contributions of the heterogeneous design. In the revised manuscript, we will add results for each individual expert, all pairwise combinations, the three-expert setup without the verifier, and a majority-vote ensemble of the same three models. These will quantify the incremental value of each perspective and demonstrate that the game-theoretic coordination yields gains beyond simple voting. The existing single-expert baseline (F1 71.4%) already shows improvement, but the new ablations will provide a fuller isolation of effects. revision: yes
-
Referee: Two-layer game framework (Section 3): The two-layer game framework is introduced to formalize the design, but the quantitative results (e.g., the +10.3 percentage point precision improvement from the verifier) are not derived from or predicted by the game equations. The performance metrics appear to be measured post-experiment rather than emerging as consequences of the formal model, raising questions about whether the game theory adds predictive power beyond the empirical setup.
Authors: The two-layer framework is primarily a design and explanatory model that motivates the cooperative incentives among experts (leading to super-additive performance) and the adversarial verification game (explaining the precision gain). While the exact numerical values are empirical, the model predicts the qualitative benefit of the verifier in reducing false positives. We will revise Section 3 to more explicitly link the framework to the observed results and clarify its role as a guiding principle rather than a fully predictive quantitative tool, while acknowledging limitations in deriving precise metrics from the equations. revision: partial
-
Referee: Evaluation section: All experiments are confined to a balanced 262-sample subset of the NIST Juliet Test Suite covering 14 CWE types. No evaluations on unbalanced real-world codebases, other vulnerability datasets (e.g., Big-Vul or Devign), or cross-project settings are provided, which is critical for assessing whether the claimed generalization of the complementary perspectives holds beyond this controlled test set.
Authors: The balanced NIST Juliet subset was selected to enable fair, controlled evaluation across 14 CWE types without imbalance effects. We acknowledge the limitation in scope for broader generalization claims. In the revision, we will expand the discussion and limitations sections to address applicability to unbalanced real-world code and other datasets, and outline future work on Big-Vul, Devign, and cross-project settings. However, new large-scale experiments on those resources exceed the scope of this revision. revision: partial
- New comprehensive experiments on unbalanced real-world codebases and additional datasets such as Big-Vul or Devign
Circularity Check
No significant circularity; performance metrics are empirical measurements, not derived predictions
full rationale
The paper describes a heterogeneous multi-agent architecture inspired by game-theoretic principles and states that it formalizes the design via a two-layer game framework (cooperative experts + adversarial verifier). However, the headline results (77.2% F1, precision gains, etc.) are explicitly presented as outcomes of experiments on the 262-sample Juliet suite, not as a priori outputs or predictions computed from the game equations. No equations, uniqueness theorems, or self-citations are shown that would reduce the reported metrics to the framework inputs by construction. Model choices and perspective definitions function as design decisions whose value is validated externally via ablation-free but still independent empirical testing against baselines (single-expert LLM and Cppcheck). The derivation chain therefore remains self-contained: inspiration and formalization guide the architecture, while performance is measured rather than tautologically recovered.
Axiom & Free-Parameter Ledger
free parameters (1)
- Expert perspectives and model assignments
axioms (1)
- domain assumption Diverse expert perspectives yield super-additive value in a cooperative game
invented entities (1)
-
Two-layer game framework
no independent evidence
Reference graph
Works this paper leans on
- [1]
-
[2]
Cppcheck Team. 2024. Cppcheck: A Tool for Static C/C++ Code Analysis. https: //cppcheck.sourceforge.io/
2024
-
[3]
DeepSeek-AI. 2024. DeepSeek-V3 Technical Report.arXiv preprint arXiv:2412.19437(2024). https://doi.org/10.48550/ARXIV.2412.19437
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2024
-
[4]
Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, David Wagner, et al. 2025. Vulnerability Detection with Code Language Models: How Far Are We?. InProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE). https://doi.org/10.1109/ICSE55347.2025.00038 PrimeVul benchmark
-
[5]
Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mor- datch. 2023. Improving Factuality and Reasoning in Language Models through Multiagent Debate. InInternational Conference on Machine Learning (ICML)
2023
- [6]
- [7]
-
[8]
Sirui Hong et al. 2023. MetaGPT: Meta Programming for Multi-Agent Collabora- tive Framework.arXiv preprint arXiv:2308.00352(2023)
work page internal anchor Pith review arXiv 2023
-
[9]
Brittany Johnson, Yoonki Song, Emerson Murphy-Hill, and Robert Bowdidge
-
[10]
In2013 35th International Conference on Software Engineering (ICSE)
Why don’t software developers use static analysis tools to find bugs?. In2013 35th International Conference on Software Engineering (ICSE). 672–681. https://doi.org/10.1109/icse.2013.6606613
-
[11]
Ahmed Lekssays et al. 2025. LLMxCPG: Context-Aware Vulnerability Detection Through Code Property Graph-Guided Large Language Models. In34th USENIX Security Symposium
2025
-
[12]
Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society. InNeural Information Processing Systems (NeurIPS)
2023
-
[13]
Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. 2023. Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate. InConference on Empirical Methods in Natural Language Processing (EMNLP)
2023
-
[14]
Steve Morgan. 2020. Cybercrime To Cost The World $10.5 Trillion Annually By 2025.Cybercrime Magazine(2020). Cybersecurity Ventures
2020
-
[15]
2007.Al- gorithmic Game Theory
Noam Nisan, Tim Roughgarden, Eva Tardos, and Vijay V Vazirani. 2007.Al- gorithmic Game Theory. Cambridge University Press. https://doi.org/10.1017/ CBO9780511800481
2007
-
[16]
NIST SAMATE. 2017. Juliet Test Suite for C/C++. https://samate.nist.gov/SARD/ test-suites/112
2017
-
[17]
OpenAI et al. 2023. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Qwen Team. 2025. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388 (2025). https://doi.org/10.48550/ARXIV.2505.09388
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
- [19]
-
[20]
Michael Spence. 1973. Job market signaling.The Quarterly Journal of Economics 87, 3 (1973), 355–374. https://doi.org/10.2307/1882010
-
[21]
Synopsys. 2024. Coverity Static Analysis. https://www.synopsys.com/software- integrity/security-testing/static-analysis-sast.html. Table 6: Complete dataset composition by CWE type CWE Description Vuln. Benign Total CWE-121 Stack Buffer Overflow 10 10 20 CWE-122 Heap Buffer Overflow 10 10 20 CWE-190 Integer Overflow 10 10 20 CWE-401 Memory Leak 10 10 20 C...
2024
- [22]
-
[23]
Qingyun Wu et al. 2023. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation.arXiv preprint arXiv:2308.08155(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [24]
-
[25]
Yi Xie, Zhanke Zhou, Chentao Cao, Qiyu Niu, Tongliang Liu, and Bo Han. 2025. From Debate to Equilibrium: Belief-Driven Multi-Agent LLM Reasoning via Bayesian Nash Equilibrium.arXiv preprint arXiv:2506.08292(2025). Accepted at ICML 2025. A DATASET DETAILS A.1 Juliet Test Suite Extraction We extracted samples from the NIST Juliet Test Suite v1.3 [ 15], whic...
-
[26]
Boundary conditions—off-by-one errors, array index is- sues. 3. Edge cases—empty inputs, max values, null values
-
[27]
Buffer Overflow
Resource management—unclosed files, orphaned alloca- tions. 5. Undefined behavior—uninitialized variables, signed overflow. Output a structured report: VULNERABILITY_FOUND: yes/no; CWE_IDs: [list]; SEVERITY; EVIDENCE; CONFI- DENCE. B.4 Adversarial Verifier You are an adversarial code security reviewer. You receive three expert vulnerability analysis repor...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.