Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025)

Gregory D. Moody; Richard J. Young

arxiv: 2605.20351 · v1 · pith:XE5HXYL2new · submitted 2026-05-19 · 💻 cs.CR

Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025)

Richard J. Young , Gregory D. Moody This is my paper

Pith reviewed 2026-05-21 07:34 UTC · model grok-4.3

classification 💻 cs.CR

keywords refusal evaluationcoding LLMsmalicious code promptsprompt corporasystematic reviewjailbreak evaluationmalware taxonomy

0 comments

The pith

A review of thirteen malicious-code prompt corpora for coding LLMs identifies three recurring methodological gaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines thirteen publicly released collections of prompts built to test how coding large language models and agents refuse requests involving malicious code. It applies a uniform template to compare each corpus on construction methods, taxonomy of prompts, reproducibility, licensing, and coverage of malware types. The synthesis identifies three recurring gaps: missing human-annotator baselines to calibrate automated judgments, refusal-rate measures that track different constructs and cannot be compared directly, and incompatible malware-category schemes with no shared schema across the sets. A sympathetic reader would care because these gaps prevent clear comparisons of safety performance across models and slow progress on reliable refusal evaluation.

Core claim

By treating the prompt datasets themselves as the unit of analysis rather than code security or jailbreak taxonomy, the review shows that the thirteen corpora exhibit three recurring methodological gaps: the absence of human-annotator baselines against which LLM-judge labels can be calibrated, the absence of cross-corpus comparability with refusal-rate statistics measuring non-equivalent constructs, and the fragmentation of malware-category taxonomies with no canonical schema spanning the thirteen corpora.

What carries the argument

The uniform extraction template applied across all in-scope corpora to synthesize their construction methodology, prompt-construction taxonomy, reproducibility and licensing details, and malware-category coverage.

If this is right

Future corpora could adopt pre-registration of inclusion criteria to reduce fragmentation.
Validation could draw on multiple independent judges to strengthen label calibration.
Reliability reporting could standardize on statistical baselines with confidence intervals.
A shared canonical taxonomy for malware categories could support direct cross-corpus comparisons.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Closing the gaps would let safety teams compare refusal rates across different coding models on equivalent terms.
A unified taxonomy might extend to refusal testing in non-code domains such as general text or other agent tasks.
Adopting the proposed directions could reduce the current scatter in reported refusal statistics.

Load-bearing premise

The assumption that the search strategy and screening process comprehensively identified all relevant corpora released between 2023 and 2025 without significant omissions from publication bias or incomplete database coverage.

What would settle it

Locating one additional corpus released in the 2023-2025 window that the search missed, or finding that one of the thirteen already supplies calibrated human baselines together with a malware taxonomy shared with the others, would challenge the claim that these gaps recur across the full set.

Figures

Figures reproduced from arXiv: 2605.20351 by Gregory D. Moody, Richard J. Young.

**Figure 1.** Figure 1: Timeline of the thirteen in-scope malicious-code-generation prompt corpora plotted at their arXiv submission [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 2.** Figure 2: Scope boundary map for the systematic review. The two axes are the evaluation construct (x: refusal vs. [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Prompt-construction taxonomy across the thirteen in-scope corpora. The figure positions each corpus by [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Coverage of malware categories across the thirteen in-scope corpora, in chronological release order (left to [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: Reproducibility and Reliability Gap Matrix for the thirteen in-scope malicious-code-generation prompt [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: PRISMA 2020 flow diagram for the systematic review of malicious-code-generation prompt corpora. Phase [PITH_FULL_IMAGE:figures/full_fig_p030_6.png] view at source ↗

read the original abstract

The evaluation of large language model refusal on malicious-coding tasks now spans at least thirteen publicly released prompt corpora (AdvBench, the CyberSecEval family, RMCBench, RedCode, MCGMark, JailbreakBench, CySecBench, MalwareBench, CIRCLE, MOCHA, ASTRA, Scam2Prompt / Innoc2Scam-bench, and JAWS-Bench), each constructed under a different protocol, released under different licensing terms, and validated (or not) against different inter-rater reliability standards. Existing surveys treat code security, jailbreak taxonomy, or vulnerability detection as the central object and mention these corpora only in passing. This paper reverses that framing: it treats the prompt datasets themselves as the unit of analysis. Following a PRISMA-style protocol, we specify a search strategy, screen the recent literature on coding-LLM refusal evaluation, apply a uniform extraction template to each in-scope corpus, and synthesize the resulting catalogue along construction methodology, prompt-construction taxonomy (modality, turn structure, elicitation style), reproducibility and licensing, and malware-category coverage. The synthesis surfaces three recurring methodological gaps: the absence of human-annotator baselines against which LLM-judge labels can be calibrated, the absence of cross-corpus comparability with refusal-rate statistics measuring non-equivalent constructs, and the fragmentation of malware-category taxonomies, with no canonical schema spanning the thirteen in-scope corpora. The review concludes with proposed methodological directions for next-generation corpora, including pre-registration of inclusion criteria, vendor-diverse multi-judge validation, Fleiss' kappa with bootstrap CI as the reliability baseline, and a candidate canonical taxonomy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This review catalogs thirteen refusal corpora for coding LLMs and flags three recurring gaps, but the strength of those claims hinges on whether the search really captured the full set from 2023-2025.

read the letter

This paper's main move is to treat the thirteen prompt corpora themselves as the primary object instead of scattering them across a broader survey on jailbreaks or code security. It applies one extraction template to each, covering construction methods, prompt taxonomy, reproducibility, licensing, and malware category coverage. That produces a compact catalogue and surfaces three gaps that show up across the set: missing human-annotator baselines for calibrating LLM judges, refusal-rate numbers that cannot be compared because the corpora test non-equivalent things, and no shared malware taxonomy that spans all thirteen.

Referee Report

1 major / 2 minor

Summary. The manuscript conducts a systematic review of thirteen publicly released prompt corpora (AdvBench, CyberSecEval family, RMCBench, RedCode, MCGMark, JailbreakBench, CySecBench, MalwareBench, CIRCLE, MOCHA, ASTRA, Scam2Prompt/Innoc2Scam-bench, JAWS-Bench) developed 2023-2025 for evaluating refusal on malicious coding tasks in LLMs and code agents. Following a PRISMA-style protocol with uniform extraction on construction methodology, prompt taxonomy (modality, turn structure, elicitation style), reproducibility/licensing, and malware-category coverage, the paper synthesizes recurring methodological gaps: absence of human-annotator baselines for calibrating LLM-judge labels, lack of cross-corpus comparability because refusal-rate statistics measure non-equivalent constructs, and fragmentation of malware taxonomies with no canonical schema spanning the corpora. It concludes with proposed directions including pre-registration of inclusion criteria, vendor-diverse multi-judge validation, Fleiss' kappa with bootstrap CI as reliability baseline, and a candidate canonical taxonomy.

Significance. If the corpus identification is comprehensive and the extracted gaps accurately characterize the literature, the review provides a useful reference catalogue that reverses the typical framing (focusing on datasets rather than models) and could reduce fragmentation in LLM safety evaluation for code. The uniform extraction template and explicit proposals for reliability metrics and taxonomy standardization are concrete strengths that support reproducibility and future corpus design.

major comments (1)

[Methods] Methods (PRISMA protocol description): the central claim that the three gaps recur across the full set of relevant corpora and that no canonical schema spans the literature depends on exhaustive identification of the 2023-2025 corpora. The manuscript states that a search strategy is specified and screening applied, but explicit query strings, list of databases, operational definition of 'coding-LLM refusal evaluation,' and the PRISMA flow diagram with numbers screened/excluded are required to quantify omission risk from publication bias or incomplete indexing.

minor comments (2)

[Abstract] Abstract and §4 (synthesis): the list of thirteen corpora would be clearer if presented in a summary table with columns for release year, licensing, and validation method.
[Conclusion] §5 (proposed directions): the candidate canonical taxonomy is referenced but not illustrated; an example schema or mapping to existing corpora would strengthen the recommendation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our systematic review. The comment highlights an important area for improving methodological transparency, and we will revise the manuscript to address it directly.

read point-by-point responses

Referee: [Methods] Methods (PRISMA protocol description): the central claim that the three gaps recur across the full set of relevant corpora and that no canonical schema spans the literature depends on exhaustive identification of the 2023-2025 corpora. The manuscript states that a search strategy is specified and screening applied, but explicit query strings, list of databases, operational definition of 'coding-LLM refusal evaluation,' and the PRISMA flow diagram with numbers screened/excluded are required to quantify omission risk from publication bias or incomplete indexing.

Authors: We agree that the current description of the search strategy and screening process is insufficiently detailed for full reproducibility and assessment of coverage. In the revised version we will add: (1) the exact query strings employed in each database, (2) the complete list of sources searched (arXiv, ACL Anthology, Hugging Face datasets, GitHub repositories, and selected conference proceedings), (3) an explicit operational definition of 'coding-LLM refusal evaluation' that was used to determine inclusion, and (4) a PRISMA flow diagram reporting the numbers of records identified, screened, excluded (with reasons), and finally included. These additions will allow readers to evaluate the risk of omitted corpora and will strengthen the foundation for the three recurring gaps we identify. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive systematic review with no derivations or self-referential claims

full rationale

This is a cataloging review that applies a PRISMA-style protocol to identify and extract features from thirteen existing corpora, then observes three recurring gaps in the published literature. No equations, fitted parameters, predictions, or first-principles derivations exist that could reduce to inputs by construction. Claims rest on direct synthesis of external sources rather than self-definition, self-citation chains, or renaming of known results. The central synthesis is therefore self-contained as an empirical survey.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the completeness of the literature search and the accuracy of feature extraction from the corpora descriptions; no free parameters or invented entities are introduced.

axioms (2)

standard math A PRISMA-style protocol provides a valid and reproducible framework for screening and synthesizing literature on prompt corpora.
The paper explicitly follows this established systematic review method as described in the abstract.
domain assumption The thirteen listed corpora (AdvBench, CyberSecEval family, etc.) constitute the relevant publicly released prompt sets for malicious-code refusal evaluation from 2023-2025.
The scope and synthesis depend on this coverage being representative.

pith-pipeline@v0.9.0 · 5840 in / 1643 out tokens · 61199 ms · 2026-05-21T07:34:28.636271+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Following a PRISMA-style systematic-review protocol, this paper specifies a search strategy, screens the recent literature on coding-LLM refusal evaluation, applies a uniform extraction template to each in-scope corpus, and synthesizes the resulting catalogue along four dimensions.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 11 internal anchors

[1]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. Re- leased datasets: harmful_strings (500 items) and harmful_behaviors (500 items in the original pa- per; the widely-redistributed Hugging Face version cu...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Purple llama cyberseceval: A secure coding benchmark for language models,

Manish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis, Shengye Wan, Ivan Evtimov, Dominik Gabi, Daniel Song, Faizan Ahmad, Cornelius Aschermann, et al. Purple Llama CyberSecEval: A secure coding benchmark for language models.arXiv preprint arXiv:2312.04724, 2023

work page arXiv 2023
[3]

CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models.arXiv preprint arXiv:2404.13161, 2024

Manish Bhatt, Sahana Chennabasappa, Yue Li, Cyrus Nikolaidis, Daniel Song, Shengye Wan, Faizan Ahmad, Cor- nelius Aschermann, Yaohui Chen, Dhaval Kapil, David Molnar, Spencer Whitman, and Joshua Saxe. CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models.arXiv preprint arXiv:2404.13161, 2024

work page arXiv 2024
[4]

CyberSecEval 3: Advancing the evaluation of cybersecurity risks and capabilities in large language models.arXiv preprint arXiv:2408.01605, 2024

Shengye Wan, Cyrus Nikolaidis, Daniel Song, David Molnar, James Crnkovich, Jayson Grace, Manish Bhatt, Sahana Chennabasappa, Spencer Whitman, Stephanie Ding, Vlad Ionescu, Yue Li, and Joshua Saxe. CyberSecEval 3: Advancing the evaluation of cybersecurity risks and capabilities in large language models.arXiv preprint arXiv:2408.01605, 2024

work page arXiv 2024
[5]

RMCBench: Benchmarking large language models’ resistance to malicious code

Jiachi Chen, Qingyuan Zhong, Yanlin Wang, Kaiwen Ning, Yongkun Liu, Zenan Xu, Zhe Zhao, Ting Chen, and Zibin Zheng. RMCBench: Benchmarking large language models’ resistance to malicious code. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE 2024), 2024

work page 2024
[6]

RedCode: Risky code execution and generation benchmark for code agents

Chengquan Guo, Xun Liu, Chulin Xie, Andy Zhou, Yi Zeng, Zinan Lin, Dawn Song, and Bo Li. RedCode: Risky code execution and generation benchmark for code agents. InAdvances in Neural Information Processing Systems (NeurIPS 2024), Datasets and Benchmarks Track, 2024

work page 2024
[8]

Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models. InAdvances in Neural Information Processing Systems 38 (NeurIPS 2...

work page 2024
[9]

CySecBench: Generative AI-based cybersecurity-focused prompt dataset for benchmarking large language models.arXiv preprint arXiv:2501.01335, 2025

Johan Wahréus, Ahmed Mohamed Hussain, and Panos Papadimitratos. CySecBench: Generative AI-based cybersecurity-focused prompt dataset for benchmarking large language models.arXiv preprint arXiv:2501.01335, 2025

work page arXiv 2025
[10]

LLMs caught in the cross- fire: Malware requests and jailbreak challenges

Haoyang Li, Huan Gao, Zhiyuan Zhao, Zhiyu Lin, Junyu Gao, and Xuelong Li. LLMs caught in the cross- fire: Malware requests and jailbreak challenges. InProceedings of the 63rd Annual Meeting of the Associa- 24 Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review tion for Computational Linguistics (Volume 1: Long Papers), pages 27833–27848...

work page 2025
[11]

Running in CIRCLE? a simple benchmark for LLM code interpreter security, 2025

Gabriel Chua. Running in CIRCLE? a simple benchmark for LLM code interpreter security, 2025. arXiv:2507.19399

work page arXiv 2025
[12]

Nguyen, Tianjiao Yu, Nirav Diwan, Gang Wang, Dilek Hakkani- Tür, and Ismini Lourentzou

Muntasir Wahed, Xiaona Zhou, Kiet A. Nguyen, Tianjiao Yu, Nirav Diwan, Gang Wang, Dilek Hakkani- Tür, and Ismini Lourentzou. MOCHA: Are code language models robust against multi-turn malicious coding prompts? InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 22922–22948,

work page 2025
[13]

Test sets released under CC BY-NC 4.0 at huggingface.co/datasets/purpcode/mocha (gated); full training data available on request from the authors (mwahed2@illinois.edu)

Winner Defender Team at Amazon Nova AI Challenge 2025. Test sets released under CC BY-NC 4.0 at huggingface.co/datasets/purpcode/mocha (gated); full training data available on request from the authors (mwahed2@illinois.edu)

work page 2025
[14]

ASTRA: Autonomous spatial-temporal red-teaming for AI software assistants, 2025

Xiangzhe Xu, Guangyu Shen, Zian Su, Siyuan Cheng, Hanxi Guo, Lu Yan, Xuan Chen, Jiasheng Jiang, Xiaolong Jin, Chengpeng Wang, Zhuo Zhang, and Xiangyu Zhang. ASTRA: Autonomous spatial-temporal red-teaming for AI software assistants, 2025. arXiv:2508.03936; released benchmark: PurCL/astra-agent-security (1,995 prompts)

work page arXiv 2025
[15]

Scam2Prompt: A Scalable Framework for Auditing Malicious Scam Endpoints in Production LLMs

Zhiyang Chen, Tara Saba, Xun Deng, Xujie Si, and Fan Long. Scam2Prompt: A scalable framework for auditing malicious scam endpoints in production LLMs, 2025. arXiv:2509.02372; releases Innoc2Scam-bench, 1,559 innocuous developer prompts

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Breaking the code: Security assessment of AI code agents through systematic jailbreaking attacks, 2025

Shoumik Saha, Jifan Chen, Sam Mayers, Sanjay Krishna Gouda, Zijian Wang, and Varun Kumar. Breaking the code: Security assessment of AI code agents through systematic jailbreaking attacks, 2025. arXiv:2510.01359; releases JAWS-Bench: 182 + 100 + 182 prompts/codebases across empty, single-file, and multi-file workspace regimes

work page arXiv 2025
[17]

From vulnerabilities to remediation: A systematic literature review of LLMs in code security, 2024

Enna Basic and Alberto Giaretta. From vulnerabilities to remediation: A systematic literature review of LLMs in code security, 2024. arXiv:2412.15004

work page arXiv 2024
[18]

Jailbreaking LLMs: A survey of attacks, defenses and evaluation, 2026

Safayat Bin Hakim, Kanchon Gharami, Nahid Farhady Ghalaty, and Shafika Showkat. Jailbreaking LLMs: A survey of attacks, defenses and evaluation, 2026. TechRxiv preprint

work page 2026
[19]

LLMs in software security: A survey of vulnerability detection techniques and insights, 2025

Ze Sheng, Zhicheng Chen, Shuning Gu, Heqing Huang, Guofei Gu, and Jeff Huang. LLMs in software security: A survey of vulnerability detection techniques and insights, 2025. arXiv:2502.07049; preprint as of search cutoff

work page arXiv 2025
[20]

A Survey on Large Language Models for Code Generation

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology, 2025. arXiv:2406.00515

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

A survey on agentic security: Applications, threats and defenses, 2025

Asif Shahriar, Md Nafiu Rahman, Sadif Ahmed, Farig Sadeque, and Md Rizwan Parvez. A survey on agentic security: Applications, threats and defenses, 2025. arXiv:2510.06445

work page arXiv 2025
[22]

A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts

Richard J. Young and Gregory D. Moody. A validated prompt bank for malicious code generation: Separating exe- cutable weapons from security knowledge in 1,554 consensus-labeled prompts.arXiv preprint arXiv:2605.03179, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Joseph L. Fleiss. Measuring nominal scale agreement among many raters.Psychological Bulletin, 76(5):378–382, 1971

work page 1971
[24]

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating LLM generations with a panel of diverse models.arXiv preprint arXiv:2404.18796, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on LLM-as-a-judge.arXiv preprint arXiv:2411.15594, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Is your ai-generated code really safe? evaluating large language models on secure code generation with CodeSecEval, 2024

Jiexin Wang and Xitong Luo. Is your ai-generated code really safe? evaluating large language models on secure code generation with CodeSecEval, 2024. arXiv:2407.02395

work page arXiv 2024
[28]

SecRepoBench: Benchmarking code agents for secure code completion in real-world repositories, 2025

Chihao Shen and Connor Dilgren. SecRepoBench: Benchmarking code agents for secure code completion in real-world repositories, 2025. arXiv:2504.21205

work page arXiv 2025
[29]

RealSec-bench: A benchmark for evaluating secure code generation in real-world repositories, 2026

Yanlin Wang, Ziyao Zhang, Chong Wang, Xinyi Xu, Mingwei Liu, Yong Wang, Jiachi Chen, and Zibin Zheng. RealSec-bench: A benchmark for evaluating secure code generation in real-world repositories, 2026. arXiv:2601.22706. 25 Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review

work page arXiv 2026
[30]

ProSec: Fortifying code LLMs with proactive security alignment, 2024

Xiangzhe Xu and Zian Su. ProSec: Fortifying code LLMs with proactive security alignment, 2024. arXiv:2411.12882

work page arXiv 2024
[31]

SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios

Junkai Chen, Huihui Huang, Yunbo Lyu, Junwen An, Jieke Shi, Chengran Yang, Ting Zhang, Haoye Tian, Yikun Li, Zhenhao Li, Xin Zhou, Xing Hu, and David Lo. SecureVibeBench: Benchmarking secure vibe coding of AI agents via reconstructing vulnerability-introducing scenarios, 2026. arXiv:2509.22097; originally released as SecureAgentBench, subsequently revised...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

The WMDP benchmark: Measuring and reducing malicious use with unlearning,

Nathaniel Li and Alexander Pan. The WMDP benchmark: Measuring and reducing malicious use with unlearning,

work page
[33]

SG-Bench: Evaluating LLM safety generalization across diverse tasks and prompt types

Yutao Mou, Shikun Zhang, and Wei Ye. SG-Bench: Evaluating LLM safety generalization across diverse tasks and prompt types. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2024. arXiv:2410.21965

work page arXiv 2024
[34]

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. WildGuard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2024. arXiv:2406.18495

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

CIBER: A comprehensive benchmark for security evaluation of code interpreter agents, 2026

Lei Ba, Qinbin Li, and Songze Li. CIBER: A comprehensive benchmark for security evaluation of code interpreter agents, 2026. arXiv:2602.19547; out of scope (capability/agent-robustness rather than refusal)

work page arXiv 2026
[36]

URLhttps://openreview.net/forum?id=VTF8yNQM66

Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, and Lingming Zhang. SEC-bench: Automated benchmarking of LLM agents on real-world software security tasks, 2025. arXiv:2506.11791; out of scope (capability/security- engineering rather than refusal)

work page arXiv 2025
[37]

Richard Landis and Gary G

J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data.Biometrics, 33(1):159–174, 1977

work page 1977
[38]

Annotation alignment: Comparing LLM and human annotations of conversational safety.arXiv preprint arXiv:2406.06369, 2024

Rajiv Movva, Pang Wei Koh, and Emma Pierson. Annotation alignment: Comparing LLM and human annotations of conversational safety.arXiv preprint arXiv:2406.06369, 2024

work page arXiv 2024
[39]

Young and Gregory D

Richard J. Young and Gregory D. Moody. A multi-corpus empirical re-evaluation of refusal-label reliability across twelve malicious-code prompt corpora. Manuscript in preparation. Companion empirical paper to the present systematic review. Applies the five-judge consensus protocol of [21] to all thirteen in-scope corpora and reports Fleiss’ kappa with boot...

work page 2026
[40]

Jailbroken: How Does LLM Safety Training Fail?

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety training fail? In Advances in Neural Information Processing Systems, 2023. arXiv:2307.02483

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset

Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset. In Advances in Neural Information Processing Systems Datasets and Benchmarks Track, 2023. arXiv:2307.04657

work page arXiv 2023
[42]

Moussa Koulako Bala Doumbouya, Ananjan Nandi, Gabriel Poesia, Davide Ghilardi, Anna Goldie, Federico Bianchi, Dan Jurafsky, and Christopher D. Manning. h4rm3l: A language for composable jailbreak attack synthesis, 2024. arXiv:2408.04811; ICLR 2025

work page arXiv 2024
[43]

SecCodePLT: A unified benchmark for evaluating the security risks and capabilities of code agents, 2024

Yuzhou Nie, Zhun Wang, Yu Yang, Ruizhe Jiang, Yuheng Tang, Xander Davies, Yarin Gal, Bo Li, Wenbo Guo, and Dawn Song. SecCodePLT: A unified benchmark for evaluating the security risks and capabilities of code agents, 2024. arXiv:2410.11096

work page arXiv 2024
[44]

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, and Xander Davies. AgentHarm: A benchmark for measuring harmfulness of LLM agents, 2024. arXiv:2410.09024; ICLR 2025

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Ignore this title and HackAPrompt: Expos- ing systemic vulnerabilities of LLMs through a global scale prompt hacking competition, 2023

Sander Schulhoff, Jeremy Pinto, Anaum Khan, Louis-François Bouchard, Chenglei Si, Svetlina Anati, Valen Tagli- abue, Anson Liu Kost, Christopher Carnahan, and Jordan Boyd-Graber. Ignore this title and HackAPrompt: Expos- ing systemic vulnerabilities of LLMs through a global scale prompt hacking competition, 2023. arXiv:2311.16119

work page arXiv 2023
[46]

malicious code

Tong Liu, Zizhuang Deng, Guozhu Meng, Yuekang Li, and Kai Chen. Demystifying RCE vulnerabilities in LLM-integrated apps. InProceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security, 2024. arXiv:2309.02926. 26 Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review A Search Protocol Detail This appendix documents ...

work page arXiv 2024
[47]

RealSec-bench Wang 2026 LLM code security benchmark Java repository (Google)

work page 2026
[48]

USENIX Security CCS 2024–2026 LLM malicious code refusal benchmark dataset (Google)

work page 2024
[49]

NeurIPS Datasets Benchmarks track 2024–2025 LLM code safety malicious refusal evaluation (Google)

work page 2024
[50]

ACL EMNLP 2025 coding LLM safety benchmark refusal malware prompts (Google) 27 Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review A.3 Snowball Backward-Citation Protocol To guard against gaps in keyword-based retrieval, a backward-citation snowball was performed on every in-scope corpus paper. For each paper, the Related Work section a...

work page arXiv 2025

[1] [1]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. Re- leased datasets: harmful_strings (500 items) and harmful_behaviors (500 items in the original pa- per; the widely-redistributed Hugging Face version cu...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Purple llama cyberseceval: A secure coding benchmark for language models,

Manish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis, Shengye Wan, Ivan Evtimov, Dominik Gabi, Daniel Song, Faizan Ahmad, Cornelius Aschermann, et al. Purple Llama CyberSecEval: A secure coding benchmark for language models.arXiv preprint arXiv:2312.04724, 2023

work page arXiv 2023

[3] [3]

CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models.arXiv preprint arXiv:2404.13161, 2024

Manish Bhatt, Sahana Chennabasappa, Yue Li, Cyrus Nikolaidis, Daniel Song, Shengye Wan, Faizan Ahmad, Cor- nelius Aschermann, Yaohui Chen, Dhaval Kapil, David Molnar, Spencer Whitman, and Joshua Saxe. CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large language models.arXiv preprint arXiv:2404.13161, 2024

work page arXiv 2024

[4] [4]

CyberSecEval 3: Advancing the evaluation of cybersecurity risks and capabilities in large language models.arXiv preprint arXiv:2408.01605, 2024

Shengye Wan, Cyrus Nikolaidis, Daniel Song, David Molnar, James Crnkovich, Jayson Grace, Manish Bhatt, Sahana Chennabasappa, Spencer Whitman, Stephanie Ding, Vlad Ionescu, Yue Li, and Joshua Saxe. CyberSecEval 3: Advancing the evaluation of cybersecurity risks and capabilities in large language models.arXiv preprint arXiv:2408.01605, 2024

work page arXiv 2024

[5] [5]

RMCBench: Benchmarking large language models’ resistance to malicious code

Jiachi Chen, Qingyuan Zhong, Yanlin Wang, Kaiwen Ning, Yongkun Liu, Zenan Xu, Zhe Zhao, Ting Chen, and Zibin Zheng. RMCBench: Benchmarking large language models’ resistance to malicious code. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE 2024), 2024

work page 2024

[6] [6]

RedCode: Risky code execution and generation benchmark for code agents

Chengquan Guo, Xun Liu, Chulin Xie, Andy Zhou, Yi Zeng, Zinan Lin, Dawn Song, and Bo Li. RedCode: Risky code execution and generation benchmark for code agents. InAdvances in Neural Information Processing Systems (NeurIPS 2024), Datasets and Benchmarks Track, 2024

work page 2024

[7] [8]

Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models. InAdvances in Neural Information Processing Systems 38 (NeurIPS 2...

work page 2024

[8] [9]

CySecBench: Generative AI-based cybersecurity-focused prompt dataset for benchmarking large language models.arXiv preprint arXiv:2501.01335, 2025

Johan Wahréus, Ahmed Mohamed Hussain, and Panos Papadimitratos. CySecBench: Generative AI-based cybersecurity-focused prompt dataset for benchmarking large language models.arXiv preprint arXiv:2501.01335, 2025

work page arXiv 2025

[9] [10]

LLMs caught in the cross- fire: Malware requests and jailbreak challenges

Haoyang Li, Huan Gao, Zhiyuan Zhao, Zhiyu Lin, Junyu Gao, and Xuelong Li. LLMs caught in the cross- fire: Malware requests and jailbreak challenges. InProceedings of the 63rd Annual Meeting of the Associa- 24 Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review tion for Computational Linguistics (Volume 1: Long Papers), pages 27833–27848...

work page 2025

[10] [11]

Running in CIRCLE? a simple benchmark for LLM code interpreter security, 2025

Gabriel Chua. Running in CIRCLE? a simple benchmark for LLM code interpreter security, 2025. arXiv:2507.19399

work page arXiv 2025

[11] [12]

Nguyen, Tianjiao Yu, Nirav Diwan, Gang Wang, Dilek Hakkani- Tür, and Ismini Lourentzou

Muntasir Wahed, Xiaona Zhou, Kiet A. Nguyen, Tianjiao Yu, Nirav Diwan, Gang Wang, Dilek Hakkani- Tür, and Ismini Lourentzou. MOCHA: Are code language models robust against multi-turn malicious coding prompts? InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 22922–22948,

work page 2025

[12] [13]

Test sets released under CC BY-NC 4.0 at huggingface.co/datasets/purpcode/mocha (gated); full training data available on request from the authors (mwahed2@illinois.edu)

Winner Defender Team at Amazon Nova AI Challenge 2025. Test sets released under CC BY-NC 4.0 at huggingface.co/datasets/purpcode/mocha (gated); full training data available on request from the authors (mwahed2@illinois.edu)

work page 2025

[13] [14]

ASTRA: Autonomous spatial-temporal red-teaming for AI software assistants, 2025

Xiangzhe Xu, Guangyu Shen, Zian Su, Siyuan Cheng, Hanxi Guo, Lu Yan, Xuan Chen, Jiasheng Jiang, Xiaolong Jin, Chengpeng Wang, Zhuo Zhang, and Xiangyu Zhang. ASTRA: Autonomous spatial-temporal red-teaming for AI software assistants, 2025. arXiv:2508.03936; released benchmark: PurCL/astra-agent-security (1,995 prompts)

work page arXiv 2025

[14] [15]

Scam2Prompt: A Scalable Framework for Auditing Malicious Scam Endpoints in Production LLMs

Zhiyang Chen, Tara Saba, Xun Deng, Xujie Si, and Fan Long. Scam2Prompt: A scalable framework for auditing malicious scam endpoints in production LLMs, 2025. arXiv:2509.02372; releases Innoc2Scam-bench, 1,559 innocuous developer prompts

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [16]

Breaking the code: Security assessment of AI code agents through systematic jailbreaking attacks, 2025

Shoumik Saha, Jifan Chen, Sam Mayers, Sanjay Krishna Gouda, Zijian Wang, and Varun Kumar. Breaking the code: Security assessment of AI code agents through systematic jailbreaking attacks, 2025. arXiv:2510.01359; releases JAWS-Bench: 182 + 100 + 182 prompts/codebases across empty, single-file, and multi-file workspace regimes

work page arXiv 2025

[16] [17]

From vulnerabilities to remediation: A systematic literature review of LLMs in code security, 2024

Enna Basic and Alberto Giaretta. From vulnerabilities to remediation: A systematic literature review of LLMs in code security, 2024. arXiv:2412.15004

work page arXiv 2024

[17] [18]

Jailbreaking LLMs: A survey of attacks, defenses and evaluation, 2026

Safayat Bin Hakim, Kanchon Gharami, Nahid Farhady Ghalaty, and Shafika Showkat. Jailbreaking LLMs: A survey of attacks, defenses and evaluation, 2026. TechRxiv preprint

work page 2026

[18] [19]

LLMs in software security: A survey of vulnerability detection techniques and insights, 2025

Ze Sheng, Zhicheng Chen, Shuning Gu, Heqing Huang, Guofei Gu, and Jeff Huang. LLMs in software security: A survey of vulnerability detection techniques and insights, 2025. arXiv:2502.07049; preprint as of search cutoff

work page arXiv 2025

[19] [20]

A Survey on Large Language Models for Code Generation

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology, 2025. arXiv:2406.00515

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [21]

A survey on agentic security: Applications, threats and defenses, 2025

Asif Shahriar, Md Nafiu Rahman, Sadif Ahmed, Farig Sadeque, and Md Rizwan Parvez. A survey on agentic security: Applications, threats and defenses, 2025. arXiv:2510.06445

work page arXiv 2025

[21] [22]

A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts

Richard J. Young and Gregory D. Moody. A validated prompt bank for malicious code generation: Separating exe- cutable weapons from security knowledge in 1,554 consensus-labeled prompts.arXiv preprint arXiv:2605.03179, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [23]

Joseph L. Fleiss. Measuring nominal scale agreement among many raters.Psychological Bulletin, 76(5):378–382, 1971

work page 1971

[23] [24]

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating LLM generations with a panel of diverse models.arXiv preprint arXiv:2404.18796, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [25]

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on LLM-as-a-judge.arXiv preprint arXiv:2411.15594, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [26]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [27]

Is your ai-generated code really safe? evaluating large language models on secure code generation with CodeSecEval, 2024

Jiexin Wang and Xitong Luo. Is your ai-generated code really safe? evaluating large language models on secure code generation with CodeSecEval, 2024. arXiv:2407.02395

work page arXiv 2024

[27] [28]

SecRepoBench: Benchmarking code agents for secure code completion in real-world repositories, 2025

Chihao Shen and Connor Dilgren. SecRepoBench: Benchmarking code agents for secure code completion in real-world repositories, 2025. arXiv:2504.21205

work page arXiv 2025

[28] [29]

RealSec-bench: A benchmark for evaluating secure code generation in real-world repositories, 2026

Yanlin Wang, Ziyao Zhang, Chong Wang, Xinyi Xu, Mingwei Liu, Yong Wang, Jiachi Chen, and Zibin Zheng. RealSec-bench: A benchmark for evaluating secure code generation in real-world repositories, 2026. arXiv:2601.22706. 25 Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review

work page arXiv 2026

[29] [30]

ProSec: Fortifying code LLMs with proactive security alignment, 2024

Xiangzhe Xu and Zian Su. ProSec: Fortifying code LLMs with proactive security alignment, 2024. arXiv:2411.12882

work page arXiv 2024

[30] [31]

SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios

Junkai Chen, Huihui Huang, Yunbo Lyu, Junwen An, Jieke Shi, Chengran Yang, Ting Zhang, Haoye Tian, Yikun Li, Zhenhao Li, Xin Zhou, Xing Hu, and David Lo. SecureVibeBench: Benchmarking secure vibe coding of AI agents via reconstructing vulnerability-introducing scenarios, 2026. arXiv:2509.22097; originally released as SecureAgentBench, subsequently revised...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[31] [32]

The WMDP benchmark: Measuring and reducing malicious use with unlearning,

Nathaniel Li and Alexander Pan. The WMDP benchmark: Measuring and reducing malicious use with unlearning,

work page

[32] [33]

SG-Bench: Evaluating LLM safety generalization across diverse tasks and prompt types

Yutao Mou, Shikun Zhang, and Wei Ye. SG-Bench: Evaluating LLM safety generalization across diverse tasks and prompt types. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2024. arXiv:2410.21965

work page arXiv 2024

[33] [34]

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. WildGuard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2024. arXiv:2406.18495

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [35]

CIBER: A comprehensive benchmark for security evaluation of code interpreter agents, 2026

Lei Ba, Qinbin Li, and Songze Li. CIBER: A comprehensive benchmark for security evaluation of code interpreter agents, 2026. arXiv:2602.19547; out of scope (capability/agent-robustness rather than refusal)

work page arXiv 2026

[35] [36]

URLhttps://openreview.net/forum?id=VTF8yNQM66

Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, and Lingming Zhang. SEC-bench: Automated benchmarking of LLM agents on real-world software security tasks, 2025. arXiv:2506.11791; out of scope (capability/security- engineering rather than refusal)

work page arXiv 2025

[36] [37]

Richard Landis and Gary G

J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data.Biometrics, 33(1):159–174, 1977

work page 1977

[37] [38]

Annotation alignment: Comparing LLM and human annotations of conversational safety.arXiv preprint arXiv:2406.06369, 2024

Rajiv Movva, Pang Wei Koh, and Emma Pierson. Annotation alignment: Comparing LLM and human annotations of conversational safety.arXiv preprint arXiv:2406.06369, 2024

work page arXiv 2024

[38] [39]

Young and Gregory D

Richard J. Young and Gregory D. Moody. A multi-corpus empirical re-evaluation of refusal-label reliability across twelve malicious-code prompt corpora. Manuscript in preparation. Companion empirical paper to the present systematic review. Applies the five-judge consensus protocol of [21] to all thirteen in-scope corpora and reports Fleiss’ kappa with boot...

work page 2026

[39] [40]

Jailbroken: How Does LLM Safety Training Fail?

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety training fail? In Advances in Neural Information Processing Systems, 2023. arXiv:2307.02483

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [41]

BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset

Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset. In Advances in Neural Information Processing Systems Datasets and Benchmarks Track, 2023. arXiv:2307.04657

work page arXiv 2023

[41] [42]

Moussa Koulako Bala Doumbouya, Ananjan Nandi, Gabriel Poesia, Davide Ghilardi, Anna Goldie, Federico Bianchi, Dan Jurafsky, and Christopher D. Manning. h4rm3l: A language for composable jailbreak attack synthesis, 2024. arXiv:2408.04811; ICLR 2025

work page arXiv 2024

[42] [43]

SecCodePLT: A unified benchmark for evaluating the security risks and capabilities of code agents, 2024

Yuzhou Nie, Zhun Wang, Yu Yang, Ruizhe Jiang, Yuheng Tang, Xander Davies, Yarin Gal, Bo Li, Wenbo Guo, and Dawn Song. SecCodePLT: A unified benchmark for evaluating the security risks and capabilities of code agents, 2024. arXiv:2410.11096

work page arXiv 2024

[43] [44]

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, and Xander Davies. AgentHarm: A benchmark for measuring harmfulness of LLM agents, 2024. arXiv:2410.09024; ICLR 2025

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [45]

Ignore this title and HackAPrompt: Expos- ing systemic vulnerabilities of LLMs through a global scale prompt hacking competition, 2023

Sander Schulhoff, Jeremy Pinto, Anaum Khan, Louis-François Bouchard, Chenglei Si, Svetlina Anati, Valen Tagli- abue, Anson Liu Kost, Christopher Carnahan, and Jordan Boyd-Graber. Ignore this title and HackAPrompt: Expos- ing systemic vulnerabilities of LLMs through a global scale prompt hacking competition, 2023. arXiv:2311.16119

work page arXiv 2023

[45] [46]

malicious code

Tong Liu, Zizhuang Deng, Guozhu Meng, Yuekang Li, and Kai Chen. Demystifying RCE vulnerabilities in LLM-integrated apps. InProceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security, 2024. arXiv:2309.02926. 26 Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review A Search Protocol Detail This appendix documents ...

work page arXiv 2024

[46] [47]

RealSec-bench Wang 2026 LLM code security benchmark Java repository (Google)

work page 2026

[47] [48]

USENIX Security CCS 2024–2026 LLM malicious code refusal benchmark dataset (Google)

work page 2024

[48] [49]

NeurIPS Datasets Benchmarks track 2024–2025 LLM code safety malicious refusal evaluation (Google)

work page 2024

[49] [50]

ACL EMNLP 2025 coding LLM safety benchmark refusal malware prompts (Google) 27 Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review A.3 Snowball Backward-Citation Protocol To guard against gaps in keyword-based retrieval, a backward-citation snowball was performed on every in-scope corpus paper. For each paper, the Related Work section a...

work page arXiv 2025