pith. sign in

arxiv: 2605.20351 · v1 · pith:XE5HXYL2new · submitted 2026-05-19 · 💻 cs.CR

Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025)

Pith reviewed 2026-05-21 07:34 UTC · model grok-4.3

classification 💻 cs.CR
keywords refusal evaluationcoding LLMsmalicious code promptsprompt corporasystematic reviewjailbreak evaluationmalware taxonomy
0
0 comments X

The pith

A review of thirteen malicious-code prompt corpora for coding LLMs identifies three recurring methodological gaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines thirteen publicly released collections of prompts built to test how coding large language models and agents refuse requests involving malicious code. It applies a uniform template to compare each corpus on construction methods, taxonomy of prompts, reproducibility, licensing, and coverage of malware types. The synthesis identifies three recurring gaps: missing human-annotator baselines to calibrate automated judgments, refusal-rate measures that track different constructs and cannot be compared directly, and incompatible malware-category schemes with no shared schema across the sets. A sympathetic reader would care because these gaps prevent clear comparisons of safety performance across models and slow progress on reliable refusal evaluation.

Core claim

By treating the prompt datasets themselves as the unit of analysis rather than code security or jailbreak taxonomy, the review shows that the thirteen corpora exhibit three recurring methodological gaps: the absence of human-annotator baselines against which LLM-judge labels can be calibrated, the absence of cross-corpus comparability with refusal-rate statistics measuring non-equivalent constructs, and the fragmentation of malware-category taxonomies with no canonical schema spanning the thirteen corpora.

What carries the argument

The uniform extraction template applied across all in-scope corpora to synthesize their construction methodology, prompt-construction taxonomy, reproducibility and licensing details, and malware-category coverage.

If this is right

  • Future corpora could adopt pre-registration of inclusion criteria to reduce fragmentation.
  • Validation could draw on multiple independent judges to strengthen label calibration.
  • Reliability reporting could standardize on statistical baselines with confidence intervals.
  • A shared canonical taxonomy for malware categories could support direct cross-corpus comparisons.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Closing the gaps would let safety teams compare refusal rates across different coding models on equivalent terms.
  • A unified taxonomy might extend to refusal testing in non-code domains such as general text or other agent tasks.
  • Adopting the proposed directions could reduce the current scatter in reported refusal statistics.

Load-bearing premise

The assumption that the search strategy and screening process comprehensively identified all relevant corpora released between 2023 and 2025 without significant omissions from publication bias or incomplete database coverage.

What would settle it

Locating one additional corpus released in the 2023-2025 window that the search missed, or finding that one of the thirteen already supplies calibrated human baselines together with a malware taxonomy shared with the others, would challenge the claim that these gaps recur across the full set.

Figures

Figures reproduced from arXiv: 2605.20351 by Gregory D. Moody, Richard J. Young.

Figure 1
Figure 1. Figure 1: Timeline of the thirteen in-scope malicious-code-generation prompt corpora plotted at their arXiv submission [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Scope boundary map for the systematic review. The two axes are the evaluation construct (x: refusal vs. [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Prompt-construction taxonomy across the thirteen in-scope corpora. The figure positions each corpus by [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Coverage of malware categories across the thirteen in-scope corpora, in chronological release order (left to [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Reproducibility and Reliability Gap Matrix for the thirteen in-scope malicious-code-generation prompt [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: PRISMA 2020 flow diagram for the systematic review of malicious-code-generation prompt corpora. Phase [PITH_FULL_IMAGE:figures/full_fig_p030_6.png] view at source ↗
read the original abstract

The evaluation of large language model refusal on malicious-coding tasks now spans at least thirteen publicly released prompt corpora (AdvBench, the CyberSecEval family, RMCBench, RedCode, MCGMark, JailbreakBench, CySecBench, MalwareBench, CIRCLE, MOCHA, ASTRA, Scam2Prompt / Innoc2Scam-bench, and JAWS-Bench), each constructed under a different protocol, released under different licensing terms, and validated (or not) against different inter-rater reliability standards. Existing surveys treat code security, jailbreak taxonomy, or vulnerability detection as the central object and mention these corpora only in passing. This paper reverses that framing: it treats the prompt datasets themselves as the unit of analysis. Following a PRISMA-style protocol, we specify a search strategy, screen the recent literature on coding-LLM refusal evaluation, apply a uniform extraction template to each in-scope corpus, and synthesize the resulting catalogue along construction methodology, prompt-construction taxonomy (modality, turn structure, elicitation style), reproducibility and licensing, and malware-category coverage. The synthesis surfaces three recurring methodological gaps: the absence of human-annotator baselines against which LLM-judge labels can be calibrated, the absence of cross-corpus comparability with refusal-rate statistics measuring non-equivalent constructs, and the fragmentation of malware-category taxonomies, with no canonical schema spanning the thirteen in-scope corpora. The review concludes with proposed methodological directions for next-generation corpora, including pre-registration of inclusion criteria, vendor-diverse multi-judge validation, Fleiss' kappa with bootstrap CI as the reliability baseline, and a candidate canonical taxonomy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript conducts a systematic review of thirteen publicly released prompt corpora (AdvBench, CyberSecEval family, RMCBench, RedCode, MCGMark, JailbreakBench, CySecBench, MalwareBench, CIRCLE, MOCHA, ASTRA, Scam2Prompt/Innoc2Scam-bench, JAWS-Bench) developed 2023-2025 for evaluating refusal on malicious coding tasks in LLMs and code agents. Following a PRISMA-style protocol with uniform extraction on construction methodology, prompt taxonomy (modality, turn structure, elicitation style), reproducibility/licensing, and malware-category coverage, the paper synthesizes recurring methodological gaps: absence of human-annotator baselines for calibrating LLM-judge labels, lack of cross-corpus comparability because refusal-rate statistics measure non-equivalent constructs, and fragmentation of malware taxonomies with no canonical schema spanning the corpora. It concludes with proposed directions including pre-registration of inclusion criteria, vendor-diverse multi-judge validation, Fleiss' kappa with bootstrap CI as reliability baseline, and a candidate canonical taxonomy.

Significance. If the corpus identification is comprehensive and the extracted gaps accurately characterize the literature, the review provides a useful reference catalogue that reverses the typical framing (focusing on datasets rather than models) and could reduce fragmentation in LLM safety evaluation for code. The uniform extraction template and explicit proposals for reliability metrics and taxonomy standardization are concrete strengths that support reproducibility and future corpus design.

major comments (1)
  1. [Methods] Methods (PRISMA protocol description): the central claim that the three gaps recur across the full set of relevant corpora and that no canonical schema spans the literature depends on exhaustive identification of the 2023-2025 corpora. The manuscript states that a search strategy is specified and screening applied, but explicit query strings, list of databases, operational definition of 'coding-LLM refusal evaluation,' and the PRISMA flow diagram with numbers screened/excluded are required to quantify omission risk from publication bias or incomplete indexing.
minor comments (2)
  1. [Abstract] Abstract and §4 (synthesis): the list of thirteen corpora would be clearer if presented in a summary table with columns for release year, licensing, and validation method.
  2. [Conclusion] §5 (proposed directions): the candidate canonical taxonomy is referenced but not illustrated; an example schema or mapping to existing corpora would strengthen the recommendation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our systematic review. The comment highlights an important area for improving methodological transparency, and we will revise the manuscript to address it directly.

read point-by-point responses
  1. Referee: [Methods] Methods (PRISMA protocol description): the central claim that the three gaps recur across the full set of relevant corpora and that no canonical schema spans the literature depends on exhaustive identification of the 2023-2025 corpora. The manuscript states that a search strategy is specified and screening applied, but explicit query strings, list of databases, operational definition of 'coding-LLM refusal evaluation,' and the PRISMA flow diagram with numbers screened/excluded are required to quantify omission risk from publication bias or incomplete indexing.

    Authors: We agree that the current description of the search strategy and screening process is insufficiently detailed for full reproducibility and assessment of coverage. In the revised version we will add: (1) the exact query strings employed in each database, (2) the complete list of sources searched (arXiv, ACL Anthology, Hugging Face datasets, GitHub repositories, and selected conference proceedings), (3) an explicit operational definition of 'coding-LLM refusal evaluation' that was used to determine inclusion, and (4) a PRISMA flow diagram reporting the numbers of records identified, screened, excluded (with reasons), and finally included. These additions will allow readers to evaluate the risk of omitted corpora and will strengthen the foundation for the three recurring gaps we identify. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive systematic review with no derivations or self-referential claims

full rationale

This is a cataloging review that applies a PRISMA-style protocol to identify and extract features from thirteen existing corpora, then observes three recurring gaps in the published literature. No equations, fitted parameters, predictions, or first-principles derivations exist that could reduce to inputs by construction. Claims rest on direct synthesis of external sources rather than self-definition, self-citation chains, or renaming of known results. The central synthesis is therefore self-contained as an empirical survey.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the completeness of the literature search and the accuracy of feature extraction from the corpora descriptions; no free parameters or invented entities are introduced.

axioms (2)
  • standard math A PRISMA-style protocol provides a valid and reproducible framework for screening and synthesizing literature on prompt corpora.
    The paper explicitly follows this established systematic review method as described in the abstract.
  • domain assumption The thirteen listed corpora (AdvBench, CyberSecEval family, etc.) constitute the relevant publicly released prompt sets for malicious-code refusal evaluation from 2023-2025.
    The scope and synthesis depend on this coverage being representative.

pith-pipeline@v0.9.0 · 5840 in / 1643 out tokens · 61199 ms · 2026-05-21T07:34:28.636271+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Following a PRISMA-style systematic-review protocol, this paper specifies a search strategy, screens the recent literature on coding-LLM refusal evaluation, applies a uniform extraction template to each in-scope corpus, and synthesizes the resulting catalogue along four dimensions.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 11 internal anchors

  1. [1]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. Re- leased datasets: harmful_strings (500 items) and harmful_behaviors (500 items in the original pa- per; the widely-redistributed Hugging Face version cu...

  2. [2]

    Purple llama cyberseceval: A secure coding benchmark for language models,

    Manish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis, Shengye Wan, Ivan Evtimov, Dominik Gabi, Daniel Song, Faizan Ahmad, Cornelius Aschermann, et al. Purple Llama CyberSecEval: A secure coding benchmark for language models.arXiv preprint arXiv:2312.04724, 2023

  3. [3]

    CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models.arXiv preprint arXiv:2404.13161, 2024

    Manish Bhatt, Sahana Chennabasappa, Yue Li, Cyrus Nikolaidis, Daniel Song, Shengye Wan, Faizan Ahmad, Cor- nelius Aschermann, Yaohui Chen, Dhaval Kapil, David Molnar, Spencer Whitman, and Joshua Saxe. CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models.arXiv preprint arXiv:2404.13161, 2024

  4. [4]

    CyberSecEval 3: Advancing the evaluation of cybersecurity risks and capabilities in large language models.arXiv preprint arXiv:2408.01605, 2024

    Shengye Wan, Cyrus Nikolaidis, Daniel Song, David Molnar, James Crnkovich, Jayson Grace, Manish Bhatt, Sahana Chennabasappa, Spencer Whitman, Stephanie Ding, Vlad Ionescu, Yue Li, and Joshua Saxe. CyberSecEval 3: Advancing the evaluation of cybersecurity risks and capabilities in large language models.arXiv preprint arXiv:2408.01605, 2024

  5. [5]

    RMCBench: Benchmarking large language models’ resistance to malicious code

    Jiachi Chen, Qingyuan Zhong, Yanlin Wang, Kaiwen Ning, Yongkun Liu, Zenan Xu, Zhe Zhao, Ting Chen, and Zibin Zheng. RMCBench: Benchmarking large language models’ resistance to malicious code. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE 2024), 2024

  6. [6]

    RedCode: Risky code execution and generation benchmark for code agents

    Chengquan Guo, Xun Liu, Chulin Xie, Andy Zhou, Yi Zeng, Zinan Lin, Dawn Song, and Bo Li. RedCode: Risky code execution and generation benchmark for code agents. InAdvances in Neural Information Processing Systems (NeurIPS 2024), Datasets and Benchmarks Track, 2024

  7. [8]

    Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models. InAdvances in Neural Information Processing Systems 38 (NeurIPS 2...

  8. [9]

    CySecBench: Generative AI-based cybersecurity-focused prompt dataset for benchmarking large language models.arXiv preprint arXiv:2501.01335, 2025

    Johan Wahréus, Ahmed Mohamed Hussain, and Panos Papadimitratos. CySecBench: Generative AI-based cybersecurity-focused prompt dataset for benchmarking large language models.arXiv preprint arXiv:2501.01335, 2025

  9. [10]

    LLMs caught in the cross- fire: Malware requests and jailbreak challenges

    Haoyang Li, Huan Gao, Zhiyuan Zhao, Zhiyu Lin, Junyu Gao, and Xuelong Li. LLMs caught in the cross- fire: Malware requests and jailbreak challenges. InProceedings of the 63rd Annual Meeting of the Associa- 24 Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review tion for Computational Linguistics (Volume 1: Long Papers), pages 27833–27848...

  10. [11]

    Running in CIRCLE? a simple benchmark for LLM code interpreter security, 2025

    Gabriel Chua. Running in CIRCLE? a simple benchmark for LLM code interpreter security, 2025. arXiv:2507.19399

  11. [12]

    Nguyen, Tianjiao Yu, Nirav Diwan, Gang Wang, Dilek Hakkani- Tür, and Ismini Lourentzou

    Muntasir Wahed, Xiaona Zhou, Kiet A. Nguyen, Tianjiao Yu, Nirav Diwan, Gang Wang, Dilek Hakkani- Tür, and Ismini Lourentzou. MOCHA: Are code language models robust against multi-turn malicious coding prompts? InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 22922–22948,

  12. [13]

    Test sets released under CC BY-NC 4.0 at huggingface.co/datasets/purpcode/mocha (gated); full training data available on request from the authors (mwahed2@illinois.edu)

    Winner Defender Team at Amazon Nova AI Challenge 2025. Test sets released under CC BY-NC 4.0 at huggingface.co/datasets/purpcode/mocha (gated); full training data available on request from the authors (mwahed2@illinois.edu)

  13. [14]

    ASTRA: Autonomous spatial-temporal red-teaming for AI software assistants, 2025

    Xiangzhe Xu, Guangyu Shen, Zian Su, Siyuan Cheng, Hanxi Guo, Lu Yan, Xuan Chen, Jiasheng Jiang, Xiaolong Jin, Chengpeng Wang, Zhuo Zhang, and Xiangyu Zhang. ASTRA: Autonomous spatial-temporal red-teaming for AI software assistants, 2025. arXiv:2508.03936; released benchmark: PurCL/astra-agent-security (1,995 prompts)

  14. [15]

    Scam2Prompt: A Scalable Framework for Auditing Malicious Scam Endpoints in Production LLMs

    Zhiyang Chen, Tara Saba, Xun Deng, Xujie Si, and Fan Long. Scam2Prompt: A scalable framework for auditing malicious scam endpoints in production LLMs, 2025. arXiv:2509.02372; releases Innoc2Scam-bench, 1,559 innocuous developer prompts

  15. [16]

    Breaking the code: Security assessment of AI code agents through systematic jailbreaking attacks, 2025

    Shoumik Saha, Jifan Chen, Sam Mayers, Sanjay Krishna Gouda, Zijian Wang, and Varun Kumar. Breaking the code: Security assessment of AI code agents through systematic jailbreaking attacks, 2025. arXiv:2510.01359; releases JAWS-Bench: 182 + 100 + 182 prompts/codebases across empty, single-file, and multi-file workspace regimes

  16. [17]

    From vulnerabilities to remediation: A systematic literature review of LLMs in code security, 2024

    Enna Basic and Alberto Giaretta. From vulnerabilities to remediation: A systematic literature review of LLMs in code security, 2024. arXiv:2412.15004

  17. [18]

    Jailbreaking LLMs: A survey of attacks, defenses and evaluation, 2026

    Safayat Bin Hakim, Kanchon Gharami, Nahid Farhady Ghalaty, and Shafika Showkat. Jailbreaking LLMs: A survey of attacks, defenses and evaluation, 2026. TechRxiv preprint

  18. [19]

    LLMs in software security: A survey of vulnerability detection techniques and insights, 2025

    Ze Sheng, Zhicheng Chen, Shuning Gu, Heqing Huang, Guofei Gu, and Jeff Huang. LLMs in software security: A survey of vulnerability detection techniques and insights, 2025. arXiv:2502.07049; preprint as of search cutoff

  19. [20]

    A Survey on Large Language Models for Code Generation

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology, 2025. arXiv:2406.00515

  20. [21]

    A survey on agentic security: Applications, threats and defenses, 2025

    Asif Shahriar, Md Nafiu Rahman, Sadif Ahmed, Farig Sadeque, and Md Rizwan Parvez. A survey on agentic security: Applications, threats and defenses, 2025. arXiv:2510.06445

  21. [22]

    A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts

    Richard J. Young and Gregory D. Moody. A validated prompt bank for malicious code generation: Separating exe- cutable weapons from security knowledge in 1,554 consensus-labeled prompts.arXiv preprint arXiv:2605.03179, 2026

  22. [23]

    Joseph L. Fleiss. Measuring nominal scale agreement among many raters.Psychological Bulletin, 76(5):378–382, 1971

  23. [24]

    Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

    Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating LLM generations with a panel of diverse models.arXiv preprint arXiv:2404.18796, 2024

  24. [25]

    A Survey on LLM-as-a-Judge

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on LLM-as-a-judge.arXiv preprint arXiv:2411.15594, 2024

  25. [26]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024

  26. [27]

    Is your ai-generated code really safe? evaluating large language models on secure code generation with CodeSecEval, 2024

    Jiexin Wang and Xitong Luo. Is your ai-generated code really safe? evaluating large language models on secure code generation with CodeSecEval, 2024. arXiv:2407.02395

  27. [28]

    SecRepoBench: Benchmarking code agents for secure code completion in real-world repositories, 2025

    Chihao Shen and Connor Dilgren. SecRepoBench: Benchmarking code agents for secure code completion in real-world repositories, 2025. arXiv:2504.21205

  28. [29]

    RealSec-bench: A benchmark for evaluating secure code generation in real-world repositories, 2026

    Yanlin Wang, Ziyao Zhang, Chong Wang, Xinyi Xu, Mingwei Liu, Yong Wang, Jiachi Chen, and Zibin Zheng. RealSec-bench: A benchmark for evaluating secure code generation in real-world repositories, 2026. arXiv:2601.22706. 25 Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review

  29. [30]

    ProSec: Fortifying code LLMs with proactive security alignment, 2024

    Xiangzhe Xu and Zian Su. ProSec: Fortifying code LLMs with proactive security alignment, 2024. arXiv:2411.12882

  30. [31]

    SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios

    Junkai Chen, Huihui Huang, Yunbo Lyu, Junwen An, Jieke Shi, Chengran Yang, Ting Zhang, Haoye Tian, Yikun Li, Zhenhao Li, Xin Zhou, Xing Hu, and David Lo. SecureVibeBench: Benchmarking secure vibe coding of AI agents via reconstructing vulnerability-introducing scenarios, 2026. arXiv:2509.22097; originally released as SecureAgentBench, subsequently revised...

  31. [32]

    The WMDP benchmark: Measuring and reducing malicious use with unlearning,

    Nathaniel Li and Alexander Pan. The WMDP benchmark: Measuring and reducing malicious use with unlearning,

  32. [33]

    SG-Bench: Evaluating LLM safety generalization across diverse tasks and prompt types

    Yutao Mou, Shikun Zhang, and Wei Ye. SG-Bench: Evaluating LLM safety generalization across diverse tasks and prompt types. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2024. arXiv:2410.21965

  33. [34]

    WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

    Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. WildGuard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2024. arXiv:2406.18495

  34. [35]

    CIBER: A comprehensive benchmark for security evaluation of code interpreter agents, 2026

    Lei Ba, Qinbin Li, and Songze Li. CIBER: A comprehensive benchmark for security evaluation of code interpreter agents, 2026. arXiv:2602.19547; out of scope (capability/agent-robustness rather than refusal)

  35. [36]

    URLhttps://openreview.net/forum?id=VTF8yNQM66

    Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, and Lingming Zhang. SEC-bench: Automated benchmarking of LLM agents on real-world software security tasks, 2025. arXiv:2506.11791; out of scope (capability/security- engineering rather than refusal)

  36. [37]

    Richard Landis and Gary G

    J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data.Biometrics, 33(1):159–174, 1977

  37. [38]

    Annotation alignment: Comparing LLM and human annotations of conversational safety.arXiv preprint arXiv:2406.06369, 2024

    Rajiv Movva, Pang Wei Koh, and Emma Pierson. Annotation alignment: Comparing LLM and human annotations of conversational safety.arXiv preprint arXiv:2406.06369, 2024

  38. [39]

    Young and Gregory D

    Richard J. Young and Gregory D. Moody. A multi-corpus empirical re-evaluation of refusal-label reliability across twelve malicious-code prompt corpora. Manuscript in preparation. Companion empirical paper to the present systematic review. Applies the five-judge consensus protocol of [21] to all thirteen in-scope corpora and reports Fleiss’ kappa with boot...

  39. [40]

    Jailbroken: How Does LLM Safety Training Fail?

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety training fail? In Advances in Neural Information Processing Systems, 2023. arXiv:2307.02483

  40. [41]

    BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset

    Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset. In Advances in Neural Information Processing Systems Datasets and Benchmarks Track, 2023. arXiv:2307.04657

  41. [42]

    Moussa Koulako Bala Doumbouya, Ananjan Nandi, Gabriel Poesia, Davide Ghilardi, Anna Goldie, Federico Bianchi, Dan Jurafsky, and Christopher D. Manning. h4rm3l: A language for composable jailbreak attack synthesis, 2024. arXiv:2408.04811; ICLR 2025

  42. [43]

    SecCodePLT: A unified benchmark for evaluating the security risks and capabilities of code agents, 2024

    Yuzhou Nie, Zhun Wang, Yu Yang, Ruizhe Jiang, Yuheng Tang, Xander Davies, Yarin Gal, Bo Li, Wenbo Guo, and Dawn Song. SecCodePLT: A unified benchmark for evaluating the security risks and capabilities of code agents, 2024. arXiv:2410.11096

  43. [44]

    AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

    Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, and Xander Davies. AgentHarm: A benchmark for measuring harmfulness of LLM agents, 2024. arXiv:2410.09024; ICLR 2025

  44. [45]

    Ignore this title and HackAPrompt: Expos- ing systemic vulnerabilities of LLMs through a global scale prompt hacking competition, 2023

    Sander Schulhoff, Jeremy Pinto, Anaum Khan, Louis-François Bouchard, Chenglei Si, Svetlina Anati, Valen Tagli- abue, Anson Liu Kost, Christopher Carnahan, and Jordan Boyd-Graber. Ignore this title and HackAPrompt: Expos- ing systemic vulnerabilities of LLMs through a global scale prompt hacking competition, 2023. arXiv:2311.16119

  45. [46]

    malicious code

    Tong Liu, Zizhuang Deng, Guozhu Meng, Yuekang Li, and Kai Chen. Demystifying RCE vulnerabilities in LLM-integrated apps. InProceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security, 2024. arXiv:2309.02926. 26 Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review A Search Protocol Detail This appendix documents ...

  46. [47]

    RealSec-bench Wang 2026 LLM code security benchmark Java repository (Google)

  47. [48]

    USENIX Security CCS 2024–2026 LLM malicious code refusal benchmark dataset (Google)

  48. [49]

    NeurIPS Datasets Benchmarks track 2024–2025 LLM code safety malicious refusal evaluation (Google)

  49. [50]

    ACL EMNLP 2025 coding LLM safety benchmark refusal malware prompts (Google) 27 Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review A.3 Snowball Backward-Citation Protocol To guard against gaps in keyword-based retrieval, a backward-citation snowball was performed on every in-scope corpus paper. For each paper, the Related Work section a...