arxiv: 2604.21282 · v1 · submitted 2026-04-23 · 💻 cs.CR · cs.LG· cs.SE

Recognition: unknown

Strategic Heterogeneous Multi-Agent Architecture for Cost-Effective Code Vulnerability Detection

Zhaohui Geoffrey Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:03 UTC · model grok-4.3

classification 💻 cs.CR cs.LGcs.SE

keywords code vulnerability detectionmulti-agent systemslarge language modelsgame theorysoftware securitycost-effective detection

0 comments

The pith

A 3+1 multi-agent system deploys three LLM experts in parallel plus a local adversarial verifier to detect code vulnerabilities at low cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a heterogeneous multi-agent architecture for code vulnerability detection that combines cloud-based experts with a local verifier. Three expert agents examine each code sample from the distinct angles of code structure, security patterns, and debugging logic. A lightweight local model then performs adversarial validation on their outputs to filter errors. This design, framed through cooperative and adversarial games, is tested on 262 balanced samples across 14 vulnerability types from the NIST Juliet suite.

Core claim

The authors claim that the two-layer game framework produces super-additive value from the three expert perspectives while the adversarial verifier raises precision, yielding 77.2 percent F1 score, 62.9 percent precision, and 100 percent recall at 0.002 dollars per sample, which exceeds both a single-expert LLM baseline and the Cppcheck static analyzer.

What carries the argument

The 3+1 architecture of three DeepSeek-V3 experts running in parallel under a cooperative game, followed by a Qwen3-8B local verifier in an adversarial verification game.

If this is right

Parallel execution of the three experts produces a threefold reduction in wall-clock time.
The adversarial verifier improves precision by more than 10 percentage points over the experts alone.
The architecture maintains full recall while operating at a cost of two cents per thousand samples.
Game-theoretic separation of cooperative and adversarial roles can be reused for other cost-sensitive code analysis tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same division of labor may extend to detecting other classes of software defects if new expert perspectives are defined.
Running the verifier locally could allow integration into continuous-integration pipelines without network latency.
Performance on proprietary or non-C codebases remains an open question that would require separate evaluation.

Load-bearing premise

The three chosen analysis perspectives and the specific models supply non-redundant information that continues to hold on code samples outside the 262 Juliet test cases.

What would settle it

Applying the same system to a fresh collection of several hundred real-world code samples from open-source projects and measuring recall below 100 percent or F1 below 70 percent would show the reported performance does not generalize.

read the original abstract

Automated code vulnerability detection is critical for software security, yet existing approaches face a fundamental trade-off between detection accuracy and computational cost. We propose a heterogeneous multi-agent architecture inspired by game-theoretic principles, combining cloud-based LLM experts with a local lightweight verifier. Our "3+1" architecture deploys three cloud-based expert agents (DeepSeek-V3) that analyze code from complementary perspectives - code structure, security patterns, and debugging logic - in parallel, while a local verifier (Qwen3-8B) performs adversarial validation at zero marginal cost. We formalize this design through a two-layer game framework: (1) a cooperative game among experts capturing super-additive value from diverse perspectives, and (2) an adversarial verification game modeling quality assurance incentives. Experiments on 262 real samples from the NIST Juliet Test Suite across 14 CWE types, with balanced vulnerable and benign classes, demonstrate that our approach achieves a 77.2% F1 score with 62.9% precision and 100% recall at $0.002 per sample - outperforming both a single-expert LLM baseline (F1 71.4%) and Cppcheck static analysis (MCC 0). The adversarial verifier significantly improves precision (+10.3 percentage points, p < 1e-6, McNemar's test) by filtering false positives, while parallel execution achieves a 3.0x speedup. Our work demonstrates that game-theoretic design principles can guide effective heterogeneous multi-agent architectures for cost-sensitive software engineering tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a modest F1 lift from three parallel LLM experts plus a local verifier on 262 Juliet samples, but the game-theoretic framing and complementary perspectives lack ablations or external checks.

read the letter

The paper gets a small but real accuracy bump from running three DeepSeek-V3 agents in parallel on code structure, security patterns, and debugging logic, then using a local Qwen3-8B verifier to drop false positives. On the balanced 262-sample Juliet subset across 14 CWEs it reaches 77.2% F1, 100% recall, and $0.002 per sample while beating a single-expert baseline by about 6 points and Cppcheck on MCC. Parallel execution also gives a 3x speedup. Those numbers and the McNemar test on the verifier's effect are the solid parts that stand out from the abstract and results section.

Referee Report

3 major / 3 minor

Summary. The manuscript presents a heterogeneous multi-agent system for detecting code vulnerabilities. It employs three specialized LLM experts (DeepSeek-V3) operating in parallel from distinct perspectives—code structure, security patterns, and debugging logic—alongside a local verifier model (Qwen3-8B) for adversarial validation. The design is formalized using a two-layer game-theoretic framework consisting of a cooperative game among experts and an adversarial verification game. Experiments on 262 balanced samples from the NIST Juliet Test Suite across 14 CWE types report an F1 score of 77.2%, precision of 62.9%, and 100% recall at a cost of $0.002 per sample, outperforming a single-expert LLM baseline and Cppcheck static analysis, with the verifier providing a statistically significant precision boost.

Significance. If the central claims regarding the benefits of the heterogeneous architecture and game-theoretic design are substantiated, this work could contribute to cost-effective vulnerability detection by demonstrating how complementary perspectives in LLMs can yield super-additive performance gains while maintaining low computational costs. The inclusion of a statistical test (McNemar's) for the verifier's impact is a positive aspect. However, the current evidence is limited by the narrow evaluation scope and absence of controls for alternative explanations such as prompt engineering effects.

major comments (3)

Experiments section: The reported performance improvements from the '3+1' architecture lack supporting ablation studies. Specifically, there are no results isolating the individual contributions of the three expert perspectives (code structure, security patterns, debugging logic) or comparing the full setup against a simple majority-vote ensemble of the same three models without the game-theoretic framing.
Two-layer game framework (Section 3): The two-layer game framework is introduced to formalize the design, but the quantitative results (e.g., the +10.3 percentage point precision improvement from the verifier) are not derived from or predicted by the game equations. The performance metrics appear to be measured post-experiment rather than emerging as consequences of the formal model, raising questions about whether the game theory adds predictive power beyond the empirical setup.
Evaluation section: All experiments are confined to a balanced 262-sample subset of the NIST Juliet Test Suite covering 14 CWE types. No evaluations on unbalanced real-world codebases, other vulnerability datasets (e.g., Big-Vul or Devign), or cross-project settings are provided, which is critical for assessing whether the claimed generalization of the complementary perspectives holds beyond this controlled test set.

minor comments (3)

Abstract and Results: Error bars, standard deviations, or results from multiple runs are not reported for the F1, precision, and recall metrics, making it difficult to assess the stability of the 77.2% F1 score.
Methodology: Details on the data split (train/test/validation), how the 262 samples were selected, and the exact prompt templates used for the expert agents and verifier are not provided, hindering reproducibility.
Baselines: The single-expert LLM baseline (F1 71.4%) implementation details are unclear, such as whether it uses the same model and similar prompting effort as the multi-agent system.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and indicate where revisions will be made to improve the manuscript.

read point-by-point responses

Referee: Experiments section: The reported performance improvements from the '3+1' architecture lack supporting ablation studies. Specifically, there are no results isolating the individual contributions of the three expert perspectives (code structure, security patterns, debugging logic) or comparing the full setup against a simple majority-vote ensemble of the same three models without the game-theoretic framing.

Authors: We agree that ablation studies are needed to substantiate the contributions of the heterogeneous design. In the revised manuscript, we will add results for each individual expert, all pairwise combinations, the three-expert setup without the verifier, and a majority-vote ensemble of the same three models. These will quantify the incremental value of each perspective and demonstrate that the game-theoretic coordination yields gains beyond simple voting. The existing single-expert baseline (F1 71.4%) already shows improvement, but the new ablations will provide a fuller isolation of effects. revision: yes
Referee: Two-layer game framework (Section 3): The two-layer game framework is introduced to formalize the design, but the quantitative results (e.g., the +10.3 percentage point precision improvement from the verifier) are not derived from or predicted by the game equations. The performance metrics appear to be measured post-experiment rather than emerging as consequences of the formal model, raising questions about whether the game theory adds predictive power beyond the empirical setup.

Authors: The two-layer framework is primarily a design and explanatory model that motivates the cooperative incentives among experts (leading to super-additive performance) and the adversarial verification game (explaining the precision gain). While the exact numerical values are empirical, the model predicts the qualitative benefit of the verifier in reducing false positives. We will revise Section 3 to more explicitly link the framework to the observed results and clarify its role as a guiding principle rather than a fully predictive quantitative tool, while acknowledging limitations in deriving precise metrics from the equations. revision: partial
Referee: Evaluation section: All experiments are confined to a balanced 262-sample subset of the NIST Juliet Test Suite covering 14 CWE types. No evaluations on unbalanced real-world codebases, other vulnerability datasets (e.g., Big-Vul or Devign), or cross-project settings are provided, which is critical for assessing whether the claimed generalization of the complementary perspectives holds beyond this controlled test set.

Authors: The balanced NIST Juliet subset was selected to enable fair, controlled evaluation across 14 CWE types without imbalance effects. We acknowledge the limitation in scope for broader generalization claims. In the revision, we will expand the discussion and limitations sections to address applicability to unbalanced real-world code and other datasets, and outline future work on Big-Vul, Devign, and cross-project settings. However, new large-scale experiments on those resources exceed the scope of this revision. revision: partial

standing simulated objections not resolved

New comprehensive experiments on unbalanced real-world codebases and additional datasets such as Big-Vul or Devign

Circularity Check

0 steps flagged

No significant circularity; performance metrics are empirical measurements, not derived predictions

full rationale

The paper describes a heterogeneous multi-agent architecture inspired by game-theoretic principles and states that it formalizes the design via a two-layer game framework (cooperative experts + adversarial verifier). However, the headline results (77.2% F1, precision gains, etc.) are explicitly presented as outcomes of experiments on the 262-sample Juliet suite, not as a priori outputs or predictions computed from the game equations. No equations, uniqueness theorems, or self-citations are shown that would reduce the reported metrics to the framework inputs by construction. Model choices and perspective definitions function as design decisions whose value is validated externally via ablation-free but still independent empirical testing against baselines (single-expert LLM and Cppcheck). The derivation chain therefore remains self-contained: inspiration and formalization guide the architecture, while performance is measured rather than tautologically recovered.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the chosen expert perspectives are complementary and that the adversarial verifier provides independent quality gains; model selections and the game framework are introduced without derivation from first principles.

free parameters (1)

Expert perspectives and model assignments
The three specific analysis angles and DeepSeek-V3/Qwen3-8B choices are selected to achieve the reported results.

axioms (1)

domain assumption Diverse expert perspectives yield super-additive value in a cooperative game
Invoked to justify the parallel expert layer in the two-layer game framework.

invented entities (1)

Two-layer game framework no independent evidence
purpose: To model cooperative expert interactions and adversarial verification incentives
Newly introduced to structure the architecture; no independent falsifiable prediction outside the reported experiment is provided.

pith-pipeline@v0.9.0 · 5576 in / 1645 out tokens · 44243 ms · 2026-05-09T22:03:42.022582+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 15 canonical work pages · 5 internal anchors

[1]

Pepijn Cobben, Xuan Huang, Thao Pham, et al. 2026. GT-HarmBench: Bench- marking AI Safety Risks Through the Lens of Game Theory.arXiv preprint arXiv:2602.12316(2026)

work page arXiv 2026
[2]

Cppcheck Team. 2024. Cppcheck: A Tool for Static C/C++ Code Analysis. https: //cppcheck.sourceforge.io/

2024
[3]

DeepSeek-AI. 2024. DeepSeek-V3 Technical Report.arXiv preprint arXiv:2412.19437(2024). https://doi.org/10.48550/ARXIV.2412.19437

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2024
[4]

Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, David Wagner, et al. 2025. Vulnerability Detection with Code Language Models: How Far Are We?. InProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE). https://doi.org/10.1109/ICSE55347.2025.00038 PrimeVul benchmark

work page doi:10.1109/icse55347.2025.00038 2025
[5]

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mor- datch. 2023. Improving Factuality and Reasoning in Language Models through Multiagent Debate. InInternational Conference on Machine Learning (ICML)

2023
[6]

Jinhao Duan et al. 2024. GTBench: Uncovering the Strategic Reasoning Limita- tions of LLMs via Game-Theoretic Evaluations. InNeurIPS. https://doi.org/10. 48550/ARXIV.2402.12348

work page arXiv 2024
[7]

Jianing Hao, Han Ding, Yuanjian Xu, Tianze Sun, Ran Chen, Wanbo Zhang, Guang Zhang, and Siguang Li. 2026. Game-Theoretic Lens on LLM-based Multi-Agent Systems.arXiv preprint arXiv:2601.15047(2026)

work page arXiv 2026
[8]

Sirui Hong et al. 2023. MetaGPT: Meta Programming for Multi-Agent Collabora- tive Framework.arXiv preprint arXiv:2308.00352(2023)

work page internal anchor Pith review arXiv 2023
[9]

Brittany Johnson, Yoonki Song, Emerson Murphy-Hill, and Robert Bowdidge
[10]

In2013 35th International Conference on Software Engineering (ICSE)

Why don’t software developers use static analysis tools to find bugs?. In2013 35th International Conference on Software Engineering (ICSE). 672–681. https://doi.org/10.1109/icse.2013.6606613

work page doi:10.1109/icse.2013.6606613 2013
[11]

Ahmed Lekssays et al. 2025. LLMxCPG: Context-Aware Vulnerability Detection Through Code Property Graph-Guided Large Language Models. In34th USENIX Security Symposium

2025
[12]

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society. InNeural Information Processing Systems (NeurIPS)

2023
[13]

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. 2023. Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate. InConference on Empirical Methods in Natural Language Processing (EMNLP)

2023
[14]

Steve Morgan. 2020. Cybercrime To Cost The World $10.5 Trillion Annually By 2025.Cybercrime Magazine(2020). Cybersecurity Ventures

2020
[15]

2007.Al- gorithmic Game Theory

Noam Nisan, Tim Roughgarden, Eva Tardos, and Vijay V Vazirani. 2007.Al- gorithmic Game Theory. Cambridge University Press. https://doi.org/10.1017/ CBO9780511800481

2007
[16]

NIST SAMATE. 2017. Juliet Test Suite for C/C++. https://samate.nist.gov/SARD/ test-suites/112

2017
[17]

OpenAI et al. 2023. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Qwen Team. 2025. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388 (2025). https://doi.org/10.48550/ARXIV.2505.09388

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[19]

Shreshth Rajan et al. 2026. MultiVer: Zero-Shot Multi-Agent Vulnerability Detec- tion.arXiv preprint arXiv:2602.17875(2026)

work page arXiv 2026
[20]

Michael Spence. 1973. Job market signaling.The Quarterly Journal of Economics 87, 3 (1973), 355–374. https://doi.org/10.2307/1882010

work page doi:10.2307/1882010 1973
[21]

Synopsys. 2024. Coverity Static Analysis. https://www.synopsys.com/software- integrity/security-testing/static-analysis-sast.html. Table 6: Complete dataset composition by CWE type CWE Description Vuln. Benign Total CWE-121 Stack Buffer Overflow 10 10 20 CWE-122 Heap Buffer Overflow 10 10 20 CWE-190 Integer Overflow 10 10 20 CWE-401 Memory Leak 10 10 20 C...

2024
[22]

Ratnadira Widyasari, Martin Weyssow, et al. 2025. Let the Trial Begin: A Mock- Court Approach to Vulnerability Detection using LLM-Based Agents.arXiv preprint arXiv:2505.10961(2025). Accepted at ICSE 2026

work page arXiv 2025
[23]

Qingyun Wu et al. 2023. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation.arXiv preprint arXiv:2308.08155(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Zihan Wu, Jie Xu, Yun Peng, Chun Yong Chong, and Xiaohua Jia. 2026. MulVul: Retrieval-augmented Multi-Agent Code Vulnerability Detection via Cross-Model Prompt Evolution.arXiv preprint arXiv:2601.18847(2026)

work page arXiv 2026
[25]

vulnerable

Yi Xie, Zhanke Zhou, Chentao Cao, Qiyu Niu, Tongliang Liu, and Bo Han. 2025. From Debate to Equilibrium: Belief-Driven Multi-Agent LLM Reasoning via Bayesian Nash Equilibrium.arXiv preprint arXiv:2506.08292(2025). Accepted at ICML 2025. A DATASET DETAILS A.1 Juliet Test Suite Extraction We extracted samples from the NIST Juliet Test Suite v1.3 [ 15], whic...

work page arXiv 2025
[26]

Boundary conditions—off-by-one errors, array index is- sues. 3. Edge cases—empty inputs, max values, null values
[27]

Buffer Overflow

Resource management—unclosed files, orphaned alloca- tions. 5. Undefined behavior—uninitialized variables, signed overflow. Output a structured report: VULNERABILITY_FOUND: yes/no; CWE_IDs: [list]; SEVERITY; EVIDENCE; CONFI- DENCE. B.4 Adversarial Verifier You are an adversarial code security reviewer. You receive three expert vulnerability analysis repor...