arxiv: 2605.03179 · v1 · submitted 2026-05-04 · 💻 cs.CR · cs.SE

Recognition: 2 theorem links

· Lean Theorem

A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts

Gregory D. Moody, Richard J. Young

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:06 UTC · model grok-4.3

classification 💻 cs.CR cs.SE

keywords malicious codeprompt bankLLM safetyconsensus labelingcode generationsecurity knowledgerefusal evaluationinter-rater agreement

0 comments

The pith

Five large language models reach consensus on separating 1,554 prompts for executable malicious code from those for security knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper seeks to resolve the mixing of two different kinds of requests in malicious coding benchmarks: those asking for ready-to-run harmful software and those asking for information about security vulnerabilities. A sympathetic reader would care because these two categories may activate different safety mechanisms inside aligned language models, so lumping them together prevents clear measurement of either. The authors operationalize a weapons-versus-knowledge split by having five LLMs from four companies vote on labels for over three thousand prompts drawn from public sources. They report that the process produces a clean 1,554-prompt set labeled as code requests with very high agreement across the judges. This sets up the distinction as the central way to organize future tests of code safety in language models.

Core claim

The paper demonstrates that a consensus protocol using five large-language-model judges from different vendors can reliably classify prompts into executable code requests versus security knowledge requests. Applied to 3,133 prompts, the three-of-five majority rule produces a 1,554-prompt consensus-CODE bank with Fleiss' kappa of 0.876 and full coverage of all prompts without exclusions. The authors present this validated bank as the primary artifact and argue that treating the weapons-versus-knowledge distinction as the organizing axis allows more precise evaluation of language model safety on malicious code tasks.

What carries the argument

The weapons-versus-knowledge classification axis, carried out through a five-judge consensus protocol where each prompt receives a binary label under a three-of-five majority vote.

Load-bearing premise

That a clean binary distinction between requests for executable malicious code and requests for security knowledge exists and aligns with how safety-aligned models process these inputs.

What would settle it

If a follow-up study finds that the refusal rates of language models on the 1,554 consensus-CODE prompts do not differ meaningfully from those on the 388 consensus-KNOWLEDGE prompts, the practical value of the separation would be undermined.

Figures

Figures reproduced from arXiv: 2605.03179 by Gregory D. Moody, Richard J. Young.

**Figure 1.** Figure 1: Consensus-classification pipeline, read left to right. Four source benchmarks (leftmost column) feed a view at source ↗

**Figure 2.** Figure 2: Filtering cascade from four source benchmarks to the released prompt bank. Each row tracks a single view at source ↗

**Figure 3.** Figure 3: Agreement-tier distribution across the 3,133 classified prompts. Tiers are ordered from strongest (5/5 view at source ↗

**Figure 4.** Figure 4: Per-source inter-rater reliability, with bootstrap 95% confidence intervals. Dotted vertical lines mark the view at source ↗

**Figure 5.** Figure 5: Pairwise Cohen’s κ between the five judges on the 3,133 classified prompts (only pairs where both judges returned valid CODE/KNOWLEDGE labels are included in each cell; cell counts range from 2,517 to 3,131 depending on per-judge error overlap). Values are symmetric; the diagonal is identity. The coder-specialized pair (GPT-5.3-Codex and Qwen3-Coder-Next) is highlighted. All ten inter-judge pairs exceed th… view at source ↗

**Figure 6.** Figure 6: Per-judge label distribution across the 3,133 classified prompts. CODE (dark red), KNOWLEDGE (dark view at source ↗

read the original abstract

Existing benchmarks of language-model refusal on malicious-coding tasks routinely conflate requests for executable malicious software with requests for harmful security knowledge. This conflation matters because the two request types plausibly trigger distinct refusal pathways in safety-aligned language models, and a single refusal-rate statistic computed over a mixture cannot isolate either. This paper introduces a weapons-versus-knowledge classification axis, operationalized through a five-model consensus protocol, and applies it to 3,133 prompts drawn from four public benchmarks, yielding a 1,554-prompt consensus-CODE bank (the primary released artifact) and a 388-prompt consensus-KNOWLEDGE comparison set used by the companion benchmark paper. The consensus pipeline uses five large-language-model judges spanning four vendor families (Anthropic, OpenAI, Google, Zhipu AI, Alibaba), each issuing a binary CODE/KNOWLEDGE label per prompt under a three-of-five majority rule, with inter-rater reliability quantified by Fleiss' kappa with bootstrap 95% confidence intervals. Across all 3,133 prompts the five judges achieve kappa = 0.876 [95% CI: 0.862, 0.888], "almost perfect" agreement by the Landis & Koch convention, with 69.3% of prompts unanimous at five-of-five; all 3,133 prompts reached the 3-of-5 threshold, so the consensus pipeline produced zero ambiguity-excluded prompts. Whether the axis separates model behavior in practice is an empirical question this paper leaves to the companion benchmark study; the present contribution is the reliability-documented artifact and the case for treating the weapons-versus-knowledge distinction as the organizing axis of code-safety evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper supplies a 1,554-prompt consensus-labeled CODE bank from existing benchmarks with strong inter-judge agreement, but the labels rest only on LLM consensus without external checks.

read the letter

The main thing here is a new 1,554-prompt dataset of malicious code requests, separated from security knowledge prompts using a five-LLM consensus labeler on top of four existing benchmarks. The authors report strong agreement among the judges. They do a clean job documenting the pipeline: five judges from different vendors, three-of-five majority, full coverage of the 3,133 prompts with no discards, and Fleiss' kappa at 0.876. That's useful for anyone building code safety evals, because it gives a ready-to-use split with quantified reliability. The soft spot is that this only shows the judges agree with each other, not that the labels correctly capture executable weapons versus knowledge. If the models share similar training biases, they could all misclassify the same ambiguous cases and still get high kappa. The paper doesn't include any human validation or a check that models actually refuse differently on the two sets; that gets pushed to the companion paper. The title's use of validated is a little loose for what is really a consensus-labeled artifact. This is for people working on AI alignment and code generation benchmarks. It provides a concrete resource even if the underlying axis still needs behavioral testing. I'd send it to peer review because the artifact and the reliability numbers are worth having in the literature, with the limitation noted.

Referee Report

0 major / 1 minor

Summary. The paper claims to introduce a weapons-versus-knowledge classification axis for malicious coding prompts and applies a five-LLM consensus protocol (three-of-five majority across judges from five vendors) to 3,133 prompts drawn from four public benchmarks, producing a 1,554-prompt consensus-CODE bank and 388-prompt KNOWLEDGE set. It reports Fleiss' kappa = 0.876 [95% CI 0.862-0.888] with 69.3% unanimous labels and zero prompts excluded for ambiguity, positioning the output as a reliability-documented artifact while deferring behavioral validation of the axis to a companion study.

Significance. The work supplies a large, publicly releasable prompt bank that disentangles executable malicious code requests from security knowledge requests, addressing a recognized conflation in existing refusal benchmarks. The multi-vendor judge design, standard statistical reliability quantification, full prompt coverage, and explicit scoping of claims (no behavioral results here) are clear strengths that make the artifact immediately usable for follow-on safety research.

minor comments (1)

[Title and Abstract] Title and abstract: the title employs 'Validated Prompt Bank' while the abstract and body consistently describe the contribution as a 'reliability-documented' or 'consensus-labeled' artifact and explicitly defer empirical validation of behavioral separation to the companion paper. Revising the title to 'A Consensus-Labeled Prompt Bank...' would eliminate the minor terminological mismatch without altering the technical claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our work, the recognition of its strengths in providing a reliability-documented artifact, and the recommendation for minor revision. We have no substantive disagreements with the assessment provided.

Circularity Check

0 steps flagged

No significant circularity in consensus labeling or kappa computation

full rationale

The paper constructs its 1,554-prompt CODE bank by applying a fixed three-of-five majority rule to binary labels from five independent LLM judges spanning distinct vendors, then directly computes Fleiss' kappa from those observed labels. No equations, parameters, or predictions are fitted to the target distinction; the kappa is a pure agreement statistic with no reduction to inputs by construction. The text contains no load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work. The paper explicitly defers any claim that the axis separates model behavior to a companion study, leaving the present contribution as a self-contained, reliability-documented artifact derived from external prompts and independent judges.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters or invented entities are introduced. The contribution is the empirically labeled dataset. The load-bearing premise is the domain assumption that the binary CODE/KNOWLEDGE axis is meaningful for safety evaluation.

axioms (1)

domain assumption The binary weapons-versus-knowledge distinction is a valid organizing axis for code-safety evaluation
Invoked as the central classification axis and justification for the consensus protocol throughout the abstract.

pith-pipeline@v0.9.0 · 5616 in / 1520 out tokens · 88680 ms · 2026-05-08T18:06:41.377558+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/LogicAsFunctionalEquation (binary distinction primitive) reality_from_one_distinction unclear
weapons-versus-knowledge classification axis ... binary CODE/KNOWLEDGE label

Reference graph

Works this paper leans on

58 extracted references · 41 canonical work pages · 10 internal anchors

[1]

Code Llama: Open Foundation Models for Code

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review arXiv 2023
[2]

StarCoder: may the source be with you!

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. StarCoder: May the source be with you!arXiv preprint arXiv:2305.06161, 2023

work page internal anchor Pith review arXiv 2023
[3]

StarCoder 2 and The Stack v2: The Next Generation

Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. StarCoder 2 and the stack v2: The next generation.arXiv preprint arXiv:2402.19173, 2024

work page internal anchor Pith review arXiv 2024
[4]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y . Wu, Y . K. Li, et al. DeepSeek-Coder: When the large language model meets programming – the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024. 16 A Validated Prompt Bank for Malicious Code Generation

work page internal anchor Pith review arXiv 2024
[5]

Qwen2.5-Coder Technical Report

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, et al. Qwen2.5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

work page internal anchor Pith review arXiv 2024
[6]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Ponce, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review arXiv 2021
[7]

Optimal policy for software vulnerability disclosure.Management Science, 54(4):642–656, 2008

Ashish Arora, Rahul Telang, and Hao Xu. Optimal policy for software vulnerability disclosure.Management Science, 54(4):642–656, 2008

2008
[8]

To disclose or not? an analysis of software user behavior.Information Economics and Policy, 19(1):43–64, 2007

Dmitri Nizovtsev and Marie Thursby. To disclose or not? an analysis of software user behavior.Information Economics and Policy, 19(1):43–64, 2007

2007
[9]

Hunting for vulnerabilities: Call for European protection of security researchers.Journal of Cybersecurity, 2026

Michal Rampášek, Jozef Andraško, Pavol Sokol, et al. Hunting for vulnerabilities: Call for European protection of security researchers.Journal of Cybersecurity, 2026

2026
[10]

RMCBench: Benchmarking large language models’ resistance to malicious code

Jiachi Chen, Qingyuan Zhong, Yanlin Wang, Kaiwen Ning, Yongkun Liu, Zenan Xu, Zhe Zhao, Ting Chen, and Zibin Zheng. RMCBench: Benchmarking large language models’ resistance to malicious code. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE 2024), 2024

2024
[11]

LLMs caught in the crossfire: Malware requests and jailbreak challenges

Haoyang Li, Huan Gao, Zhiyuan Zhao, Zhiyu Lin, Junyu Gao, and Xuelong Li. LLMs caught in the crossfire: Malware requests and jailbreak challenges. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), 2025

2025
[12]

CySecBench: Generative AI-based cybersecurity-focused prompt dataset for benchmarking large language models.arXiv preprint arXiv:2501.01335, 2025

Johan Wahréus, Ahmed Mohamed Hussain, and Panos Papadimitratos. CySecBench: Generative AI-based cybersecurity-focused prompt dataset for benchmarking large language models.arXiv preprint arXiv:2501.01335, 2025

work page arXiv 2025
[13]

RedCode: Risky code execution and generation benchmark for code agents

Chengquan Guo, Xun Liu, Chulin Xie, Andy Zhou, Yi Zeng, Zinan Lin, Dawn Song, and Bo Li. RedCode: Risky code execution and generation benchmark for code agents. InAdvances in Neural Information Processing Systems (NeurIPS 2024), Datasets and Benchmarks Track, 2024

2024
[14]

RealSec-bench: A benchmark for evaluating secure code generation in real-world repositories

Yanlin Wang et al. RealSec-bench: A benchmark for evaluating secure code generation in real-world repositories. arXiv preprint arXiv:2601.22706, 2026

work page arXiv 2026
[15]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

work page Pith review arXiv 2023
[16]

harmful_behaviors dataset, 2024

Maxime Labonne. harmful_behaviors dataset, 2024. HuggingFace dataset, derived from AdvBench

2024
[17]

Do-not- answer: A dataset for evaluating safeguards in llms.CoRR, abs/2308.13387,

Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. Do-not-answer: A dataset for evaluating safeguards in LLMs.arXiv preprint arXiv:2308.13387, 2023

work page arXiv 2023
[18]

Röttger, B

Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest: A test suite for identifying exaggerated safety behaviours in large language models.arXiv preprint arXiv:2308.01263, 2023

work page arXiv 2023
[19]

arXiv preprint arXiv:2307.04657 , year =

Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset.arXiv preprint arXiv:2307.04657, 2023

work page arXiv 2023
[20]

Zhao, J., Huang, J., Wu, Z., Bau, D., and Shi, W

Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, et al. SORRY-Bench: Systematically evaluating large language model safety refusal behaviors.arXiv preprint arXiv:2406.14598, 2024

work page arXiv 2024
[21]

A StrongREJECT for empty jailbreaks

Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer. A StrongREJECT for empty jailbreaks.arXiv preprint arXiv:2402.10260, 2024

work page arXiv 2024
[22]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024

work page internal anchor Pith review arXiv 2024
[23]

Jail- breakbench: An open robustness benchmark for jailbreaking large language models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, and Eric Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models.arXiv preprint arXiv:2404.01318, 2024

work page arXiv 2024
[24]

arXiv:2402.05044 (2024), https://arxiv.org/abs/2402.05044

Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, and Jing Shao. SALAD-bench: A hierarchical and comprehensive safety benchmark for large language models.arXiv preprint arXiv:2402.05044, 2024. 17 A Validated Prompt Bank for Malicious Code Generation

work page arXiv 2024
[25]

AIR-bench 2024: A safety benchmark based on risk categories from regulations and policies.arXiv preprint arXiv:2407.17436,

Yi Zeng, Yu Yang, Andy Zhou, Jeffrey Ziwei Tan, Yuheng Tu, Yifan Mai, Kevin Klyman, Minzhou Pan, Ruoxi Jia, Dawn Song, Percy Liang, and Bo Li. AIR-Bench 2024: A safety benchmark based on risk categories from regulations and policies.arXiv preprint arXiv:2407.17436, 2024

work page arXiv 2024
[26]

MoCha: Are code language models robust against multi-turn malicious coding prompts?arXiv preprint arXiv:2507.19598, 2025

Muntasir Wahed, Xiaona Zhou, et al. MoCha: Are code language models robust against multi-turn malicious coding prompts?arXiv preprint arXiv:2507.19598, 2025

work page arXiv 2025
[27]

CyberLLMInstruct : A new dataset for analysing safety of fine-tuned LLMs using cyber security data

Adel ElZemity, Budi Arief, and Shujun Li. CyberLLMInstruct: A pseudo-malicious dataset revealing safety- performance trade-offs in cyber security LLM fine-tuning.arXiv preprint arXiv:2503.09334, 2025

work page arXiv 2025
[28]

Joseph L. Fleiss. Measuring nominal scale agreement among many raters.Psychological Bulletin, 76(5):378–382, 1971

1971
[29]

Richard Landis and Gary G

J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data.Biometrics, 33(1):159–174, 1977

1977
[30]

Survey article: Inter-coder agreement for computational linguistics.Computa- tional Linguistics, 34(4):555–596, 2008

Ron Artstein and Massimo Poesio. Survey article: Inter-coder agreement for computational linguistics.Computa- tional Linguistics, 34(4):555–596, 2008

2008
[31]

Sage Publications, 1980

Klaus Krippendorff.Content Analysis: An Introduction to Its Methodology. Sage Publications, 1980

1980
[32]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and chatbot arena.Advances in Neural Information Processing Systems (NeurIPS 2023), 2023

2023
[33]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios N. Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating LLMs by human preference.arXiv preprint arXiv:2403.04132, 2024

work page internal anchor Pith review arXiv 2024
[34]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled AlpacaEval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024

work page internal anchor Pith review arXiv 2024
[35]

Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, et al

Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, Wei Ye, Shikun Zhang, and Yue Zhang. PandaLM: An automatic evaluation benchmark for LLM instruction tuning optimization.arXiv preprint arXiv:2306.05087, 2023

work page arXiv 2023
[36]

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on LLM-as-a-judge.arXiv preprint arXiv:2411.15594, 2024

work page Pith review arXiv 2024
[37]

Red Teaming Language Models with Language Models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models.arXiv preprint arXiv:2202.03286, 2022

work page Pith review arXiv 2022
[38]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858, 2022

work page internal anchor Pith review arXiv 2022
[39]

Mart: Improving llm safety with multi-round automatic red-teaming.arXiv preprint arXiv:2311.07689, 2023

Suyu Ge, Chunting Zhou, Rui Hou, Madian Khabsa, Yi-Chia Wang, Qifan Wang, Jiawei Han, and Yuning Mao. MART: Improving LLM safety with multi-round automatic red-teaming.arXiv preprint arXiv:2311.07689, 2023

work page arXiv 2023
[40]

Verga, S

Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating LLM generations with a panel of diverse models.arXiv preprint arXiv:2404.18796, 2024

work page arXiv 2024
[41]

Richard J. Young. Evaluating the robustness of large language model safety guardrails against adversarial attacks. arXiv preprint arXiv:2511.22047, 2025

work page arXiv 2025
[42]

Young, Alice Matthews, and Brach Poston

Richard J. Young, Alice Matthews, and Brach Poston. Benchmarking multiple large language models for automated clinical trial data extraction in aging research.Algorithms, 18(5):296, 2025

2025
[43]

arXiv preprint arXiv:2406.06369 , year=

Rajiv Movva, Pang Wei Koh, and Emma Pierson. Annotation alignment: Comparing LLM and human annotations of conversational safety.arXiv preprint arXiv:2406.06369, 2024

work page arXiv 2024
[44]

Judging the judges: Human validation of multi-LLM evaluation for high-quality K–12 science instructional materials.arXiv preprint arXiv:2602.13243, 2026

Peng He et al. Judging the judges: Human validation of multi-LLM evaluation for high-quality K–12 science instructional materials.arXiv preprint arXiv:2602.13243, 2026

work page arXiv 2026
[45]

Grading scale impact on LLM-as-a-judge: Human-LLM alignment is highest on 0-5 grading scale.arXiv preprint arXiv:2601.03444, 2026

Weiyue Li et al. Grading scale impact on LLM-as-a-judge: Human-LLM alignment is highest on 0-5 grading scale.arXiv preprint arXiv:2601.03444, 2026

work page arXiv 2026
[46]

Richard J. Young. Measuring faithfulness depends on how you measure: Classifier sensitivity in LLM chain-of- thought evaluation.arXiv preprint arXiv:2603.20172, 2026. 18 A Validated Prompt Bank for Malicious Code Generation

work page arXiv 2026
[47]

Purple llama CyberSecEval : A secure coding benchmark for language models

Manish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis, Shengye Wan, Ivan Evtimov, Dominik Gabi, Daniel Song, Faizan Ahmad, Cornelius Aschermann, et al. Purple Llama CyberSecEval: A secure coding benchmark for language models.arXiv preprint arXiv:2312.04724, 2023

work page arXiv 2023
[48]

CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models.arXiv preprint arXiv:2404.13161, 2024

Manish Bhatt, Sahana Chennabasappa, Yue Li, Cyrus Nikolaidis, Daniel Song, Shengye Wan, Faizan Ahmad, Cornelius Aschermann, Yaohui Chen, Dhaval Chow, et al. CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models.arXiv preprint arXiv:2404.13161, 2024

work page arXiv 2024
[49]

CYBERSECEV AL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models, September 2024

Shengye Wan, Cyrus Nikolaidis, Daniel Song, David Molnar, James Crnkovich, Jayson Grace, Manish Bhatt, Sahana Chennabasappa, Spencer Whitman, Stephanie Ding, Vlad Ionescu, Yue Li, and Joshua Saxe. CyberSecEval 3: Advancing the evaluation of cybersecurity risks and capabilities in large language models.arXiv preprint arXiv:2408.01605, 2024

work page arXiv 2024
[50]

Secodeplt: A unified platform for evaluating the security of code genai.arXiv preprint arXiv:2410.11096,

Yuzhou Nie, Zhun Wang, Yu Yang, Ruizhe Jiang, Yuheng Tang, Xander Davies, Yarin Gal, Bo Li, Wenbo Guo, and Dawn Song. SeCodePLT: A unified platform for evaluating the security of code GenAI.arXiv preprint arXiv:2410.11096, 2024

work page arXiv 2024
[51]

Instruction tuning for secure code generation

others He. SafeCoder: Secure code generation.arXiv preprint arXiv:2402.09497, 2024

work page arXiv 2024
[52]

Smoke and mirrors: Jailbreaking LLM-based code generation via implicit malicious prompts.arXiv preprint arXiv:2503.17953, 2025

Sheng Ouyang, Yihao Qin, Bo Lin, Liqian Chen, Xiaoguang Mao, and Shangwen Wang. Smoke and mirrors: Jailbreaking LLM-based code generation via implicit malicious prompts.arXiv preprint arXiv:2503.17953, 2025

work page arXiv 2025
[53]

Qwen3-coder-next technical report.arXiv preprint arXiv:2603.00729, 2026

Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al. Qwen3-Coder-Next technical report.arXiv preprint arXiv:2603.00729, 2026

work page arXiv 2026
[54]

arXiv preprint arXiv:2305.15324 , year=

Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung, Daniel Kokotajlo, Nahema Marchal, Markus Anderljung, Noam Kolt, et al. Model evaluation for extreme risks.arXiv preprint arXiv:2305.15324, 2023

work page arXiv 2023
[55]

Code-safety refusal across coding-specialized language models: A behavioral benchmark

Richard Young. Code-safety refusal across coding-specialized language models: A behavioral benchmark. Companion paper; manuscript in preparation, 2026

2026
[56]

Datasheets for datasets.Communications of the ACM, 2021

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 2021

2021
[57]

Bender and Batya Friedman

Emily M. Bender and Batya Friedman. Data statements for natural language processing: Toward mitigating system bias and enabling better science. InTransactions of the Association for Computational Linguistics (TACL), volume 6, pages 587–604, 2018

2018
[58]

Model cards for model reporting

Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. InProceedings of the Conference on Fairness, Accountability, and Transparency (F AT*), 2019. 19

2019