pith. machine review for the scientific record. sign in

arxiv: 2605.03179 · v1 · submitted 2026-05-04 · 💻 cs.CR · cs.SE

Recognition: 2 theorem links

· Lean Theorem

A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts

Gregory D. Moody, Richard J. Young

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:06 UTC · model grok-4.3

classification 💻 cs.CR cs.SE
keywords malicious codeprompt bankLLM safetyconsensus labelingcode generationsecurity knowledgerefusal evaluationinter-rater agreement
0
0 comments X

The pith

Five large language models reach consensus on separating 1,554 prompts for executable malicious code from those for security knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper seeks to resolve the mixing of two different kinds of requests in malicious coding benchmarks: those asking for ready-to-run harmful software and those asking for information about security vulnerabilities. A sympathetic reader would care because these two categories may activate different safety mechanisms inside aligned language models, so lumping them together prevents clear measurement of either. The authors operationalize a weapons-versus-knowledge split by having five LLMs from four companies vote on labels for over three thousand prompts drawn from public sources. They report that the process produces a clean 1,554-prompt set labeled as code requests with very high agreement across the judges. This sets up the distinction as the central way to organize future tests of code safety in language models.

Core claim

The paper demonstrates that a consensus protocol using five large-language-model judges from different vendors can reliably classify prompts into executable code requests versus security knowledge requests. Applied to 3,133 prompts, the three-of-five majority rule produces a 1,554-prompt consensus-CODE bank with Fleiss' kappa of 0.876 and full coverage of all prompts without exclusions. The authors present this validated bank as the primary artifact and argue that treating the weapons-versus-knowledge distinction as the organizing axis allows more precise evaluation of language model safety on malicious code tasks.

What carries the argument

The weapons-versus-knowledge classification axis, carried out through a five-judge consensus protocol where each prompt receives a binary label under a three-of-five majority vote.

Load-bearing premise

That a clean binary distinction between requests for executable malicious code and requests for security knowledge exists and aligns with how safety-aligned models process these inputs.

What would settle it

If a follow-up study finds that the refusal rates of language models on the 1,554 consensus-CODE prompts do not differ meaningfully from those on the 388 consensus-KNOWLEDGE prompts, the practical value of the separation would be undermined.

Figures

Figures reproduced from arXiv: 2605.03179 by Gregory D. Moody, Richard J. Young.

Figure 1
Figure 1. Figure 1: Consensus-classification pipeline, read left to right. Four source benchmarks (leftmost column) feed a view at source ↗
Figure 2
Figure 2. Figure 2: Filtering cascade from four source benchmarks to the released prompt bank. Each row tracks a single view at source ↗
Figure 3
Figure 3. Figure 3: Agreement-tier distribution across the 3,133 classified prompts. Tiers are ordered from strongest (5/5 view at source ↗
Figure 4
Figure 4. Figure 4: Per-source inter-rater reliability, with bootstrap 95% confidence intervals. Dotted vertical lines mark the view at source ↗
Figure 5
Figure 5. Figure 5: Pairwise Cohen’s κ between the five judges on the 3,133 classified prompts (only pairs where both judges returned valid CODE/KNOWLEDGE labels are included in each cell; cell counts range from 2,517 to 3,131 depending on per-judge error overlap). Values are symmetric; the diagonal is identity. The coder-specialized pair (GPT-5.3-Codex and Qwen3-Coder-Next) is highlighted. All ten inter-judge pairs exceed th… view at source ↗
Figure 6
Figure 6. Figure 6: Per-judge label distribution across the 3,133 classified prompts. CODE (dark red), KNOWLEDGE (dark view at source ↗
read the original abstract

Existing benchmarks of language-model refusal on malicious-coding tasks routinely conflate requests for executable malicious software with requests for harmful security knowledge. This conflation matters because the two request types plausibly trigger distinct refusal pathways in safety-aligned language models, and a single refusal-rate statistic computed over a mixture cannot isolate either. This paper introduces a weapons-versus-knowledge classification axis, operationalized through a five-model consensus protocol, and applies it to 3,133 prompts drawn from four public benchmarks, yielding a 1,554-prompt consensus-CODE bank (the primary released artifact) and a 388-prompt consensus-KNOWLEDGE comparison set used by the companion benchmark paper. The consensus pipeline uses five large-language-model judges spanning four vendor families (Anthropic, OpenAI, Google, Zhipu AI, Alibaba), each issuing a binary CODE/KNOWLEDGE label per prompt under a three-of-five majority rule, with inter-rater reliability quantified by Fleiss' kappa with bootstrap 95% confidence intervals. Across all 3,133 prompts the five judges achieve kappa = 0.876 [95% CI: 0.862, 0.888], "almost perfect" agreement by the Landis & Koch convention, with 69.3% of prompts unanimous at five-of-five; all 3,133 prompts reached the 3-of-5 threshold, so the consensus pipeline produced zero ambiguity-excluded prompts. Whether the axis separates model behavior in practice is an empirical question this paper leaves to the companion benchmark study; the present contribution is the reliability-documented artifact and the case for treating the weapons-versus-knowledge distinction as the organizing axis of code-safety evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper claims to introduce a weapons-versus-knowledge classification axis for malicious coding prompts and applies a five-LLM consensus protocol (three-of-five majority across judges from five vendors) to 3,133 prompts drawn from four public benchmarks, producing a 1,554-prompt consensus-CODE bank and 388-prompt KNOWLEDGE set. It reports Fleiss' kappa = 0.876 [95% CI 0.862-0.888] with 69.3% unanimous labels and zero prompts excluded for ambiguity, positioning the output as a reliability-documented artifact while deferring behavioral validation of the axis to a companion study.

Significance. The work supplies a large, publicly releasable prompt bank that disentangles executable malicious code requests from security knowledge requests, addressing a recognized conflation in existing refusal benchmarks. The multi-vendor judge design, standard statistical reliability quantification, full prompt coverage, and explicit scoping of claims (no behavioral results here) are clear strengths that make the artifact immediately usable for follow-on safety research.

minor comments (1)
  1. [Title and Abstract] Title and abstract: the title employs 'Validated Prompt Bank' while the abstract and body consistently describe the contribution as a 'reliability-documented' or 'consensus-labeled' artifact and explicitly defer empirical validation of behavioral separation to the companion paper. Revising the title to 'A Consensus-Labeled Prompt Bank...' would eliminate the minor terminological mismatch without altering the technical claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our work, the recognition of its strengths in providing a reliability-documented artifact, and the recommendation for minor revision. We have no substantive disagreements with the assessment provided.

Circularity Check

0 steps flagged

No significant circularity in consensus labeling or kappa computation

full rationale

The paper constructs its 1,554-prompt CODE bank by applying a fixed three-of-five majority rule to binary labels from five independent LLM judges spanning distinct vendors, then directly computes Fleiss' kappa from those observed labels. No equations, parameters, or predictions are fitted to the target distinction; the kappa is a pure agreement statistic with no reduction to inputs by construction. The text contains no load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work. The paper explicitly defers any claim that the axis separates model behavior to a companion study, leaving the present contribution as a self-contained, reliability-documented artifact derived from external prompts and independent judges.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters or invented entities are introduced. The contribution is the empirically labeled dataset. The load-bearing premise is the domain assumption that the binary CODE/KNOWLEDGE axis is meaningful for safety evaluation.

axioms (1)
  • domain assumption The binary weapons-versus-knowledge distinction is a valid organizing axis for code-safety evaluation
    Invoked as the central classification axis and justification for the consensus protocol throughout the abstract.

pith-pipeline@v0.9.0 · 5616 in / 1520 out tokens · 88680 ms · 2026-05-08T18:06:41.377558+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

58 extracted references · 41 canonical work pages · 10 internal anchors

  1. [1]

    Code Llama: Open Foundation Models for Code

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023

  2. [2]

    StarCoder: may the source be with you!

    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. StarCoder: May the source be with you!arXiv preprint arXiv:2305.06161, 2023

  3. [3]

    StarCoder 2 and The Stack v2: The Next Generation

    Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. StarCoder 2 and the stack v2: The next generation.arXiv preprint arXiv:2402.19173, 2024

  4. [4]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y . Wu, Y . K. Li, et al. DeepSeek-Coder: When the large language model meets programming – the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024. 16 A Validated Prompt Bank for Malicious Code Generation

  5. [5]

    Qwen2.5-Coder Technical Report

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, et al. Qwen2.5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

  6. [6]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Ponce, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  7. [7]

    Optimal policy for software vulnerability disclosure.Management Science, 54(4):642–656, 2008

    Ashish Arora, Rahul Telang, and Hao Xu. Optimal policy for software vulnerability disclosure.Management Science, 54(4):642–656, 2008

  8. [8]

    To disclose or not? an analysis of software user behavior.Information Economics and Policy, 19(1):43–64, 2007

    Dmitri Nizovtsev and Marie Thursby. To disclose or not? an analysis of software user behavior.Information Economics and Policy, 19(1):43–64, 2007

  9. [9]

    Hunting for vulnerabilities: Call for European protection of security researchers.Journal of Cybersecurity, 2026

    Michal Rampášek, Jozef Andraško, Pavol Sokol, et al. Hunting for vulnerabilities: Call for European protection of security researchers.Journal of Cybersecurity, 2026

  10. [10]

    RMCBench: Benchmarking large language models’ resistance to malicious code

    Jiachi Chen, Qingyuan Zhong, Yanlin Wang, Kaiwen Ning, Yongkun Liu, Zenan Xu, Zhe Zhao, Ting Chen, and Zibin Zheng. RMCBench: Benchmarking large language models’ resistance to malicious code. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE 2024), 2024

  11. [11]

    LLMs caught in the crossfire: Malware requests and jailbreak challenges

    Haoyang Li, Huan Gao, Zhiyuan Zhao, Zhiyu Lin, Junyu Gao, and Xuelong Li. LLMs caught in the crossfire: Malware requests and jailbreak challenges. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), 2025

  12. [12]

    CySecBench: Generative AI-based cybersecurity-focused prompt dataset for benchmarking large language models.arXiv preprint arXiv:2501.01335, 2025

    Johan Wahréus, Ahmed Mohamed Hussain, and Panos Papadimitratos. CySecBench: Generative AI-based cybersecurity-focused prompt dataset for benchmarking large language models.arXiv preprint arXiv:2501.01335, 2025

  13. [13]

    RedCode: Risky code execution and generation benchmark for code agents

    Chengquan Guo, Xun Liu, Chulin Xie, Andy Zhou, Yi Zeng, Zinan Lin, Dawn Song, and Bo Li. RedCode: Risky code execution and generation benchmark for code agents. InAdvances in Neural Information Processing Systems (NeurIPS 2024), Datasets and Benchmarks Track, 2024

  14. [14]

    RealSec-bench: A benchmark for evaluating secure code generation in real-world repositories

    Yanlin Wang et al. RealSec-bench: A benchmark for evaluating secure code generation in real-world repositories. arXiv preprint arXiv:2601.22706, 2026

  15. [15]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

  16. [16]

    harmful_behaviors dataset, 2024

    Maxime Labonne. harmful_behaviors dataset, 2024. HuggingFace dataset, derived from AdvBench

  17. [17]

    Do-not- answer: A dataset for evaluating safeguards in llms.CoRR, abs/2308.13387,

    Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. Do-not-answer: A dataset for evaluating safeguards in LLMs.arXiv preprint arXiv:2308.13387, 2023

  18. [18]

    Röttger, B

    Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest: A test suite for identifying exaggerated safety behaviours in large language models.arXiv preprint arXiv:2308.01263, 2023

  19. [19]

    arXiv preprint arXiv:2307.04657 , year =

    Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset.arXiv preprint arXiv:2307.04657, 2023

  20. [20]

    Zhao, J., Huang, J., Wu, Z., Bau, D., and Shi, W

    Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, et al. SORRY-Bench: Systematically evaluating large language model safety refusal behaviors.arXiv preprint arXiv:2406.14598, 2024

  21. [21]

    A StrongREJECT for empty jailbreaks

    Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer. A StrongREJECT for empty jailbreaks.arXiv preprint arXiv:2402.10260, 2024

  22. [22]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024

  23. [23]

    Jail- breakbench: An open robustness benchmark for jailbreaking large language models

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, and Eric Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models.arXiv preprint arXiv:2404.01318, 2024

  24. [24]

    arXiv:2402.05044 (2024), https://arxiv.org/abs/2402.05044

    Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, and Jing Shao. SALAD-bench: A hierarchical and comprehensive safety benchmark for large language models.arXiv preprint arXiv:2402.05044, 2024. 17 A Validated Prompt Bank for Malicious Code Generation

  25. [25]

    AIR-bench 2024: A safety benchmark based on risk categories from regulations and policies.arXiv preprint arXiv:2407.17436,

    Yi Zeng, Yu Yang, Andy Zhou, Jeffrey Ziwei Tan, Yuheng Tu, Yifan Mai, Kevin Klyman, Minzhou Pan, Ruoxi Jia, Dawn Song, Percy Liang, and Bo Li. AIR-Bench 2024: A safety benchmark based on risk categories from regulations and policies.arXiv preprint arXiv:2407.17436, 2024

  26. [26]

    MoCha: Are code language models robust against multi-turn malicious coding prompts?arXiv preprint arXiv:2507.19598, 2025

    Muntasir Wahed, Xiaona Zhou, et al. MoCha: Are code language models robust against multi-turn malicious coding prompts?arXiv preprint arXiv:2507.19598, 2025

  27. [27]

    CyberLLMInstruct : A new dataset for analysing safety of fine-tuned LLMs using cyber security data

    Adel ElZemity, Budi Arief, and Shujun Li. CyberLLMInstruct: A pseudo-malicious dataset revealing safety- performance trade-offs in cyber security LLM fine-tuning.arXiv preprint arXiv:2503.09334, 2025

  28. [28]

    Joseph L. Fleiss. Measuring nominal scale agreement among many raters.Psychological Bulletin, 76(5):378–382, 1971

  29. [29]

    Richard Landis and Gary G

    J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data.Biometrics, 33(1):159–174, 1977

  30. [30]

    Survey article: Inter-coder agreement for computational linguistics.Computa- tional Linguistics, 34(4):555–596, 2008

    Ron Artstein and Massimo Poesio. Survey article: Inter-coder agreement for computational linguistics.Computa- tional Linguistics, 34(4):555–596, 2008

  31. [31]

    Sage Publications, 1980

    Klaus Krippendorff.Content Analysis: An Introduction to Its Methodology. Sage Publications, 1980

  32. [32]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and chatbot arena.Advances in Neural Information Processing Systems (NeurIPS 2023), 2023

  33. [33]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios N. Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating LLMs by human preference.arXiv preprint arXiv:2403.04132, 2024

  34. [34]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled AlpacaEval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024

  35. [35]

    Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, et al

    Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, Wei Ye, Shikun Zhang, and Yue Zhang. PandaLM: An automatic evaluation benchmark for LLM instruction tuning optimization.arXiv preprint arXiv:2306.05087, 2023

  36. [36]

    A Survey on LLM-as-a-Judge

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on LLM-as-a-judge.arXiv preprint arXiv:2411.15594, 2024

  37. [37]

    Red Teaming Language Models with Language Models

    Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models.arXiv preprint arXiv:2202.03286, 2022

  38. [38]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858, 2022

  39. [39]

    Mart: Improving llm safety with multi-round automatic red-teaming.arXiv preprint arXiv:2311.07689, 2023

    Suyu Ge, Chunting Zhou, Rui Hou, Madian Khabsa, Yi-Chia Wang, Qifan Wang, Jiawei Han, and Yuning Mao. MART: Improving LLM safety with multi-round automatic red-teaming.arXiv preprint arXiv:2311.07689, 2023

  40. [40]

    Verga, S

    Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating LLM generations with a panel of diverse models.arXiv preprint arXiv:2404.18796, 2024

  41. [41]

    Richard J. Young. Evaluating the robustness of large language model safety guardrails against adversarial attacks. arXiv preprint arXiv:2511.22047, 2025

  42. [42]

    Young, Alice Matthews, and Brach Poston

    Richard J. Young, Alice Matthews, and Brach Poston. Benchmarking multiple large language models for automated clinical trial data extraction in aging research.Algorithms, 18(5):296, 2025

  43. [43]

    arXiv preprint arXiv:2406.06369 , year=

    Rajiv Movva, Pang Wei Koh, and Emma Pierson. Annotation alignment: Comparing LLM and human annotations of conversational safety.arXiv preprint arXiv:2406.06369, 2024

  44. [44]

    Judging the judges: Human validation of multi-LLM evaluation for high-quality K–12 science instructional materials.arXiv preprint arXiv:2602.13243, 2026

    Peng He et al. Judging the judges: Human validation of multi-LLM evaluation for high-quality K–12 science instructional materials.arXiv preprint arXiv:2602.13243, 2026

  45. [45]

    Grading scale impact on LLM-as-a-judge: Human-LLM alignment is highest on 0-5 grading scale.arXiv preprint arXiv:2601.03444, 2026

    Weiyue Li et al. Grading scale impact on LLM-as-a-judge: Human-LLM alignment is highest on 0-5 grading scale.arXiv preprint arXiv:2601.03444, 2026

  46. [46]

    Richard J. Young. Measuring faithfulness depends on how you measure: Classifier sensitivity in LLM chain-of- thought evaluation.arXiv preprint arXiv:2603.20172, 2026. 18 A Validated Prompt Bank for Malicious Code Generation

  47. [47]

    Purple llama CyberSecEval : A secure coding benchmark for language models

    Manish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis, Shengye Wan, Ivan Evtimov, Dominik Gabi, Daniel Song, Faizan Ahmad, Cornelius Aschermann, et al. Purple Llama CyberSecEval: A secure coding benchmark for language models.arXiv preprint arXiv:2312.04724, 2023

  48. [48]

    CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models.arXiv preprint arXiv:2404.13161, 2024

    Manish Bhatt, Sahana Chennabasappa, Yue Li, Cyrus Nikolaidis, Daniel Song, Shengye Wan, Faizan Ahmad, Cornelius Aschermann, Yaohui Chen, Dhaval Chow, et al. CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models.arXiv preprint arXiv:2404.13161, 2024

  49. [49]

    CYBERSECEV AL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models, September 2024

    Shengye Wan, Cyrus Nikolaidis, Daniel Song, David Molnar, James Crnkovich, Jayson Grace, Manish Bhatt, Sahana Chennabasappa, Spencer Whitman, Stephanie Ding, Vlad Ionescu, Yue Li, and Joshua Saxe. CyberSecEval 3: Advancing the evaluation of cybersecurity risks and capabilities in large language models.arXiv preprint arXiv:2408.01605, 2024

  50. [50]

    Secodeplt: A unified platform for evaluating the security of code genai.arXiv preprint arXiv:2410.11096,

    Yuzhou Nie, Zhun Wang, Yu Yang, Ruizhe Jiang, Yuheng Tang, Xander Davies, Yarin Gal, Bo Li, Wenbo Guo, and Dawn Song. SeCodePLT: A unified platform for evaluating the security of code GenAI.arXiv preprint arXiv:2410.11096, 2024

  51. [51]

    Instruction tuning for secure code generation

    others He. SafeCoder: Secure code generation.arXiv preprint arXiv:2402.09497, 2024

  52. [52]

    Smoke and mirrors: Jailbreaking LLM-based code generation via implicit malicious prompts.arXiv preprint arXiv:2503.17953, 2025

    Sheng Ouyang, Yihao Qin, Bo Lin, Liqian Chen, Xiaoguang Mao, and Shangwen Wang. Smoke and mirrors: Jailbreaking LLM-based code generation via implicit malicious prompts.arXiv preprint arXiv:2503.17953, 2025

  53. [53]

    Qwen3-coder-next technical report.arXiv preprint arXiv:2603.00729, 2026

    Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al. Qwen3-Coder-Next technical report.arXiv preprint arXiv:2603.00729, 2026

  54. [54]

    arXiv preprint arXiv:2305.15324 , year=

    Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung, Daniel Kokotajlo, Nahema Marchal, Markus Anderljung, Noam Kolt, et al. Model evaluation for extreme risks.arXiv preprint arXiv:2305.15324, 2023

  55. [55]

    Code-safety refusal across coding-specialized language models: A behavioral benchmark

    Richard Young. Code-safety refusal across coding-specialized language models: A behavioral benchmark. Companion paper; manuscript in preparation, 2026

  56. [56]

    Datasheets for datasets.Communications of the ACM, 2021

    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 2021

  57. [57]

    Bender and Batya Friedman

    Emily M. Bender and Batya Friedman. Data statements for natural language processing: Toward mitigating system bias and enabling better science. InTransactions of the Association for Computational Linguistics (TACL), volume 6, pages 587–604, 2018

  58. [58]

    Model cards for model reporting

    Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. InProceedings of the Conference on Fairness, Accountability, and Transparency (F AT*), 2019. 19