Which Defense Closes Which Threat? Attributing OWASP-LLM-Top-10 Coverage and Its Brittleness Under Paraphrasing

Alexandre Cristov\~ao Maiorano

arxiv: 2606.02822 · v1 · pith:A65U6NBFnew · submitted 2026-06-01 · 💻 cs.CR · cs.AI

Which Defense Closes Which Threat? Attributing OWASP-LLM-Top-10 Coverage and Its Brittleness Under Paraphrasing

Alexandre Cristov\~ao Maiorano This is my paper

Pith reviewed 2026-06-28 13:47 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords LLM defensesOWASP LLM Top-10jailbreakrefusal filtersbudget controlsparaphrasing attacksthreat attributionBreach and Attack Simulation

0 comments

The pith

Refusal alone removes all jailbreak and prompt-leak findings while budget controls alone remove all sensitive-disclosure and unbounded-consumption findings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes which defense families in LLM applications close which OWASP LLM Top-10 threats by running a scanner against four synthetic endpoints that isolate refusal, budget controls, and their combination. Refusal eliminates jailbreaks and system-prompt leakage; budget controls eliminate data disclosure and unbounded use by cutting multi-step sequences; excessive agency requires the full stack including authentication. The same setup shows that Gemini-generated paraphrases reduce refusal block rates by 15-25 points on the affected categories while leaving budget performance unchanged. This attribution matters because production stacks combine these defenses yet prior benchmarks report only aggregate scores.

Core claim

Across N=10 replications on the lattice of endpoints, refusal alone removes all LLM01 and LLM07 findings; budget alone removes all LLM02 and LLM10 findings by terminating multi-step sequences; LLM06 requires the full stack. With 300 Gemini-generated paraphrases, L1 refusal block rate falls 15 pp on LLM01 and 25 pp on LLM07. Budget controls show no drop. A real Gemini-2.5-flash backend behind the L3 regex matches the stub endpoint exactly.

What carries the argument

Lattice of four synthetic endpoints (L0 none, L1 refusal-only, L2 budget-only, L3 full stack) plus 21-agent scanner with four OWASP-aware agents, measuring per-category finding counts and paraphrase brittleness.

If this is right

Refusal filters close jailbreaks and system-prompt leakage but can be weakened by paraphrasing without changing attack intent.
Budget controls close sensitive disclosure and unbounded consumption and remain unaffected by the same paraphrases.
Excessive agency threats are only fully addressed when refusal, budget, and tool-registry authentication are combined.
A static refusal whitelist that passes a benchmark can be defeated by an LLM-driven paraphraser.
Replacing the stub backend with a real model behind the same regex produces identical defense performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Defense design may need to prioritize budget controls for certain threat classes because they resist paraphrasing mutations.
Testing pipelines could incorporate LLM-generated paraphrases as a standard brittleness check for refusal mechanisms.
The attribution approach could be extended to measure coverage of additional threat taxonomies beyond OWASP LLM Top-10.

Load-bearing premise

The synthetic endpoints and scanner produce coverage that generalizes to real production applications and the generated paraphrases preserve original attack intent without adding new vectors.

What would settle it

Running the identical scanner and paraphrases against a production LLM application and observing that refusal fails to remove all LLM01 or LLM07 findings or that budget controls lose effectiveness.

Figures

Figures reproduced from arXiv: 2606.02822 by Alexandre Cristov\~ao Maiorano.

**Figure 2.** Figure 2: The defense lattice. L1 and L2 are sibling single-axis ablations — each adds exactly one defense family to L0, with L1 ̸⊂ L2 and L2 ̸⊂ L1. L3 is the union of both families plus tool-registry authentication and credential scrubbing. The diamond shape (not a chain) is essential because a finding that disappears only at L3, but not at L1 or L2 individually, is attributable specifically to the controls L3 adds… view at source ↗

**Figure 3.** Figure 3: Attribution heatmap. Cells are coloured by finding count ( [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Aggregate block rate on 60 original probes [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

Production LLM applications stack several defense families -- refusal-phrase filters, token-budget controls, model allowlists, rate limits, tool-registry authentication -- yet existing breach-and-attack-simulation (BAS) benchmarks report a single aggregate coverage number, hiding which family closes which threat. We measure attribution. We add four OWASP-LLM-Top-10-aware agents to a 21-agent baseline scanner and target a lattice of four synthetic LLM endpoints: $L_0$ (no defenses), $L_1$ (refusal-only), $L_2$ (budget-only), and $L_3$ (full stack). $L_1$ and $L_2$ are sibling single-axis ablations, not subsets of each other; $L_3$ is their union plus tool-registry authentication and credential scrubbing. Across $N=10$ replications, the per-OWASP finding count is clean: refusal alone removes all LLM01 (jailbreak) and LLM07 (system-prompt leakage) findings; budget alone removes all LLM02 (sensitive-info disclosure) and LLM10 (unbounded consumption) findings by terminating multi-step sequences; LLM06 (excessive agency) requires the full stack. We probe brittleness under paraphrasing: with 300 Gemini-generated paraphrases ($K=5$ over a 60-template brittleness corpus), $L_1$ refusal block rate falls 15 pp on LLM01 and 25 pp on LLM07. A fifth target, $L_4$-real, swaps the stub backend for Gemini-2.5-flash behind the same $L_3$ regex and matches $L_1$ exactly, indicating no measurable alignment contribution beyond the regex (not a general claim about alignment). Budget controls show no drop (0 pp once the rate-limit floor is factored out). A refusal whitelist that clears a static benchmark can be defeated by an LLM-driven paraphraser without changing attack intent; a budget control resists the same mutation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper cleanly maps which defense family blocks which OWASP threat via ablations, but the paraphrasing brittleness result lacks evidence that the rewrites preserve attack intent.

read the letter

The useful piece here is the per-threat attribution. Refusal alone clears the jailbreak and prompt-leakage categories, budget alone stops the disclosure and unbounded-consumption ones by cutting sequences short, and excessive agency needs the full combination. That split is more actionable than another aggregate BAS score.

The setup uses four synthetic endpoints (none, refusal-only, budget-only, full stack) plus a 21-agent scanner with four added OWASP agents. They run N=10 replications and then test 300 Gemini paraphrases from 60 templates. The real Gemini-2.5 backend check is a nice control and shows the regex is doing the work.

The attribution experiment is straightforward and the counts line up cleanly across categories. That part holds up as a measurement study.

The paraphrasing claim is the weaker link. The 15 pp and 25 pp drops on the refusal layer are reported as brittleness, yet nothing in the text checks whether the Gemini rewrites keep the original attack intent or strength. If the paraphrases are simply less effective prompts, the drop is an artifact of the mutation rather than defense fragility. No success-rate comparison on originals or semantic-equivalence review is mentioned.

Synthetic endpoints are acceptable for controlled ablation, but generalization to production apps is assumed rather than shown. No error bars or statistical tests accompany the finding counts.

This is for LLM security engineers who need to choose defenses by threat type rather than blanket coverage numbers. The attribution work is solid enough to merit referee time; the brittleness section would benefit from tighter validation on the paraphrase step.

Referee Report

2 major / 2 minor

Summary. The paper claims that a lattice of synthetic LLM endpoints (L0: no defenses; L1: refusal-only; L2: budget-only; L3: full stack) combined with an augmented 21-agent scanner (adding four OWASP-LLM-Top-10-aware agents) enables attribution of specific threats: refusal alone eliminates all LLM01 (jailbreak) and LLM07 (system-prompt leakage) findings; budget alone eliminates all LLM02 (sensitive-info disclosure) and LLM10 (unbounded consumption) findings; LLM06 (excessive agency) requires the full stack. Across N=10 replications the per-category counts are reported as clean. Under 300 Gemini-generated paraphrases (K=5 from a 60-template corpus), L1 refusal block rates drop 15 pp on LLM01 and 25 pp on LLM07 while budget shows 0 pp drop (after rate-limit adjustment). L4 (real Gemini-2.5-flash behind L3 regex) matches L1, indicating no additional alignment effect beyond the regex.

Significance. If the central attribution and brittleness results hold after addressing validation gaps, the work supplies a practical method for mapping defense families to OWASP categories, moving beyond aggregate BAS scores. The controlled ablation lattice (L1 and L2 as sibling single-axis designs) and explicit replication count are strengths that support clean isolation of effects. The paraphrase experiment, if intent preservation is verified, usefully demonstrates limits of static refusal lists versus budget controls.

major comments (2)

[Abstract / brittleness experiment] Abstract / brittleness experiment: the reported 15 pp and 25 pp drops in L1 refusal block rate under paraphrasing are attributed to defense fragility, but the manuscript provides no validation that the 300 Gemini paraphrases preserve original attack intent (e.g., no L0 success-rate comparison between originals and paraphrases, no human semantic-equivalence review, or exclusion criteria). Without this, the drops cannot be unambiguously assigned to brittleness rather than prompt weakening, directly undermining the claim that "a refusal whitelist that clears a static benchmark can be defeated by an LLM-driven paraphraser without changing attack intent."
[Abstract] Abstract: the per-OWASP finding counts are described as "clean" across N=10 replications with no accompanying error bars, statistical tests, or explicit exclusion rules for the synthetic endpoints. This makes it impossible to assess whether the reported complete removal of LLM01/LLM07 by refusal and LLM02/LLM10 by budget is robust to sampling variability, which is load-bearing for the attribution claims.

minor comments (2)

[Abstract] The description of L4 matching L1 exactly while using the L3 regex is stated without clarifying how the "L3 regex" interacts with the refusal-only behavior reported for L1; a short clarification of the exact configuration would improve readability.
The 60-template brittleness corpus and the 21-agent scanner baseline are referenced but lack even high-level selection criteria or agent composition details; adding one sentence on template sourcing would aid reproducibility without altering the central claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on attribution clarity and experimental validation. We address each major comment below and will revise the manuscript accordingly where gaps are identified.

read point-by-point responses

Referee: [Abstract / brittleness experiment] Abstract / brittleness experiment: the reported 15 pp and 25 pp drops in L1 refusal block rate under paraphrasing are attributed to defense fragility, but the manuscript provides no validation that the 300 Gemini paraphrases preserve original attack intent (e.g., no L0 success-rate comparison between originals and paraphrases, no human semantic-equivalence review, or exclusion criteria). Without this, the drops cannot be unambiguously assigned to brittleness rather than prompt weakening, directly undermining the claim that "a refusal whitelist that clears a static benchmark can be defeated by an LLM-driven paraphraser without changing attack intent."

Authors: We agree that explicit validation of intent preservation is required to isolate brittleness from prompt weakening. The paraphrases were generated via Gemini prompted to retain attack goal, target, and core semantics from the 60-template corpus, but the original manuscript omits L0 success-rate comparison on paraphrases and human review. We will add both: (1) L0 attack success rates on the 300 paraphrases (expected to remain high if intent is preserved) and (2) a small-scale human semantic-equivalence annotation on a 30-sample subset. This directly strengthens the brittleness claim. revision: yes
Referee: [Abstract] Abstract: the per-OWASP finding counts are described as "clean" across N=10 replications with no accompanying error bars, statistical tests, or explicit exclusion rules for the synthetic endpoints. This makes it impossible to assess whether the reported complete removal of LLM01/LLM07 by refusal and LLM02/LLM10 by budget is robust to sampling variability, which is load-bearing for the attribution claims.

Authors: The term "clean" denotes that the attributed categories showed exactly zero findings in every one of the N=10 independent replications for L1 and L2 (i.e., deterministic absence rather than statistical reduction). Because the outcome is a strict binary count with no observed variability across replications, conventional error bars or p-values add little information. However, to improve transparency we will (a) report the per-replication counts explicitly in a table, (b) state the exclusion rule used for synthetic endpoints (no findings retained if the endpoint returned an error or rate-limit response), and (c) note that the complete removal holds across all replications. This addresses the robustness concern without altering the attribution result. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical measurement study

full rationale

The paper performs controlled empirical measurements on four synthetic LLM endpoints (L0–L3) using a 21-agent scanner augmented with OWASP-aware agents. Reported attributions (refusal alone eliminates LLM01/LLM07 findings; budget alone eliminates LLM02/LLM10 findings) and paraphrase brittleness drops (15 pp on LLM01, 25 pp on LLM07) are direct observational counts from N=10 replications and 300 Gemini paraphrases. No equations, fitted parameters, self-citations, or derivations are present that would reduce any result to an input by construction. The L4-real comparison and budget floor adjustment are likewise independent measurements. This matches the default expectation for an empirical attribution study with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The measurement implicitly assumes the scanner agents and synthetic endpoints faithfully represent real threats and that paraphrases do not alter intent.

axioms (2)

domain assumption The 21-agent baseline scanner plus four OWASP-aware agents produce representative threat coverage for the OWASP LLM Top-10.
Invoked by the choice of scanner and the claim that findings are clean across replications.
domain assumption Gemini-generated paraphrases preserve attack intent while only changing surface form.
Required for the brittleness claim that refusal drops while budget does not.

pith-pipeline@v0.9.1-grok · 5911 in / 1368 out tokens · 29583 ms · 2026-06-28T13:47:58.786888+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Attack Simulation to SIEM Rule: Deterministic Detection-as-Code Synthesis with Probe-Level Traceability
cs.CR 2026-06 conditional novelty 6.0

A deterministic synthesis function maps findings from locked BAS probe corpora to starter Sigma rules via a 23-template library, with full parseability to Splunk/Elasticsearch and probe-level traceability.

Reference graph

Works this paper leans on

32 extracted references · 11 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

Attackiq continuous threat expo- sure management platform, 2025

AttackIQ. Attackiq continuous threat expo- sure management platform, 2025. URL https: //attackiq.com/platform/. Commercial BAS reference — pricing, methodology, MITRE align- ment

2025
[2]

org/abs/2012.07805

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ul- far Erlingsson, Alina Oprea, and Colin Raffel. Ex- tracting training data from large language mod- els.USENIX Security Symposium, 2021. URL https://arxiv.org/abs/2012.07805. Founda- tional paper on memorization and ...

work page arXiv 2021
[3]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao, Edoardo Debenedetti, Alexan- der Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tram` er, Hamed Hassani, and Eric Wong. Jailbreak- Bench: An open robustness benchmark for jail- breaking large language models, 2024. URL https://arxiv.org/abs/2404.01318 . JBB- Behavio...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Jailbreaking Black Box Large Language Models in Twenty Queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries, 2024. URL https: //arxiv.org/abs/2310.08419. PAIR: Prompt Automatic Iterative Refinement — LLM-driven paraphrasing attack with seeded RNG

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Mitre att&ck v19 unpacked: What changed and how to operationalize, 2026

Cymulate. Mitre att&ck v19 unpacked: What changed and how to operationalize, 2026. URL https://cymulate.com/blog/mitre-attac k-v19-breakdown/ . Industry interpretation of detection coverage scoring and operationalizing ATT&CK framework updates

2026
[6]

Derczynski, E

Leon Derczynski, Erick Galinkin, Jeffrey Martin, Subho Majumdar, and Nanna Inie. garak: A framework for security probing large language models, 2024. URL https://arxiv.org/abs/ 2406.11036. NVIDIA garak open-source LLM vulnerability scanner with∼140 probes

work page arXiv 2024
[7]

L1b3rt4s — curated jailbreak prompt corpus, 2024

Elder-Plinius. L1b3rt4s — curated jailbreak prompt corpus, 2024. URL https://github .com/elder- plinius/L1B3RT4S . Continu- ously updated jailbreak prompts (community- maintained)

2024
[8]

Mitigating the OWASP top 10 for large language models applications using intelligent agents, 2026

Mohammad Fasha, Faisal Abul Rub, Nasim Matar, Bilal Sowan, and Mohammad Al Khaldy. Mitigating the OWASP top 10 for large language models applications using intelligent agents, 2026. URL https://arxiv.org/abs/2601.18105 . Agent-based mitigation strategies for the OWASP LLM Top 10

work page arXiv 2026
[9]

Implement a continuous threat exposure management program, 2023

Gartner. Implement a continuous threat exposure management program, 2023. CTEM five-stage framework: scoping, discovery, prioritization, val- idation, mobilization

2023
[10]

Market guide for breach and attack simulation, 2024

Gartner. Market guide for breach and attack simulation, 2024. Industry definition of BAS tooling and vendor landscape

2024
[11]

Jailbreaking large language models: A practical guide, 2024

Lakera AI. Jailbreaking large language models: A practical guide, 2024. URL https://www.la kera.ai/blog/jailbreaking-large-languag e-models-guide. Industry practitioner overview of jailbreak families

2024
[12]

ACE: A Security Architecture for LLM-Integrated App Systems

Evan Li, Tushin Mallick, Evan Rose, William Robertson, Alina Oprea, and Cristina Nita- Rotaru. ACE: A security architecture for LLM- integrated app systems, 2025. URL https: //arxiv.org/abs/2504.20984 . Defense-in- depth architecture for LLM-integrated applica- tions; accepted at NDSS 2026. 14

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

HarmBench: A standardized evaluation framework for auto- mated red teaming and robust refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench: A standardized evaluation framework for auto- mated red teaming and robust refusal. InICML,
[14]

URL https://arxiv.org/abs/2402.042
[15]

510-behavior standardized red-team evalua- tion; 50-prompt held-out subset used here as a brittleness comparison anchor
[16]

arXiv preprint arXiv:2312.02119 , year =

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of at- tacks: Jailbreaking black-box LLMs automati- cally.NeurIPS, 2024. URL https://arxiv.org/ abs/2312.02119. TAP: Tree-search over LLM- generated paraphrases. Source of probe-mutation variance in the LLM jailbreak literature

work page arXiv 2024
[17]

Mitre att&ck knowledge base, v15, 2024

MITRE Corporation. Mitre att&ck knowledge base, v15, 2024. URL https://attack.mitre .org/. Canonical taxonomy of adversary tactics and techniques

2024
[18]

Mitre att&ck evaluations — artifacts and methodology, 2024

MITRE Engenuity / Center for Threat-Informed Defense. Mitre att&ck evaluations — artifacts and methodology, 2024. URL https://gith ub.com/mitre- engenuity/ctid- attack- e valuations . Reference for transparent BAS evaluation methodology. Original domain (mitre- engenuity.org) retired 2025; artifacts preserved at this repository

2024
[19]

Owasp top 10 for large language model applications, 2025 edition, 2025

OWASP GenAI Security Project. Owasp top 10 for large language model applications, 2025 edition, 2025. URL https://genai.owasp.or g/llm-top-10/. Canonical taxonomy for LLM application risk. Categories LLM01-LLM10

2025
[20]

Pentera automated security validation,

Pentera. Pentera automated security validation,
[21]

Commercial AEV — autonomous pentesting

URL https://pentera.io/platform/ . Commercial AEV — autonomous pentesting
[22]

Picus security cybersecurity glos- sary, 2025

Picus Security. Picus security cybersecurity glos- sary, 2025. URL https://www.picussecurity. com/resource/glossary/ . Industry terminol- ogy reference including BAS, CTEM, and threat coverage definitions

2025
[23]

Atomic red team: Library of mitre att&ck aligned tests, 2024

Red Canary. Atomic red team: Library of mitre att&ck aligned tests, 2024. URL https://gi thub.com/redcanaryco/atomic- red- team . Open-source library of small, portable detection tests mapped to ATT&CK techniques

2024
[24]

Safebreach validate — breach and attack simulation, 2025

SafeBreach. Safebreach validate — breach and attack simulation, 2025. URL https://www. safebreach.com/validate-breach-and-att ack-simulation/ . Commercial BAS — APT emulation focus

2025
[25]

Is the OWASP top 10 list comprehen- sive enough for writing secure code?, 2020

Parth Sane. Is the OWASP top 10 list comprehen- sive enough for writing secure code?, 2020. URL https://arxiv.org/abs/2002.11269 . Dis- cusses the gap between OWASP Top 10 coverage and the long tail of real vulnerabilities

work page arXiv 2020
[26]

Countermind: A multi-layered security architecture for large language models,

Dominik Schwarz. Countermind: A multi-layered security architecture for large language models,
[27]

Multi-layered defense stack for LLMs; closest concurrent work to our cumulative-defense ladder

URL https://arxiv.org/abs/2510 .11837. Multi-layered defense stack for LLMs; closest concurrent work to our cumulative-defense ladder
[28]

Benchmark- ing LLAMA model security against OWASP top 10 for LLM applications, 2026

Nourin Shahin and Izzat Alsmadi. Benchmark- ing LLAMA model security against OWASP top 10 for LLM applications, 2026. URL https: //arxiv.org/abs/2601.19970 . Concurrent OWASP-LLM-Top-10 benchmark focused on a single model family

work page arXiv 2026
[29]

Jailbroken: How Does LLM Safety Training Fail?

Alexander Wei, Nika Haghtalab, and Jacob Stein- hardt. Jailbroken: How does llm safety train- ing fail?NeurIPS, 2023. URL https://arxi v.org/abs/2307.02483 . Catalogues jailbreak failure modes including competing-objective and mismatched-generalization

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Xm cyber attack path management,

XM Cyber. Xm cyber attack path management,
[31]

Reference for multi-step attack chain visualization

URL https://www.xmcyber.com/plat form/ . Reference for multi-step attack chain visualization
[32]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Mi- lad Nasr, J. Zico Kolter, and Matt Fredrik- son. Universal and transferable adversarial at- tacks on aligned language models, 2023. URL https://arxiv.org/abs/2307.15043 . Greedy Coordinate Gradient attack — universal jailbreak suffix generation. A Replication artifacts This appendix lists the artifacts that trav...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Attackiq continuous threat expo- sure management platform, 2025

AttackIQ. Attackiq continuous threat expo- sure management platform, 2025. URL https: //attackiq.com/platform/. Commercial BAS reference — pricing, methodology, MITRE align- ment

2025

[2] [2]

org/abs/2012.07805

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ul- far Erlingsson, Alina Oprea, and Colin Raffel. Ex- tracting training data from large language mod- els.USENIX Security Symposium, 2021. URL https://arxiv.org/abs/2012.07805. Founda- tional paper on memorization and ...

work page arXiv 2021

[3] [3]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao, Edoardo Debenedetti, Alexan- der Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tram` er, Hamed Hassani, and Eric Wong. Jailbreak- Bench: An open robustness benchmark for jail- breaking large language models, 2024. URL https://arxiv.org/abs/2404.01318 . JBB- Behavio...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Jailbreaking Black Box Large Language Models in Twenty Queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries, 2024. URL https: //arxiv.org/abs/2310.08419. PAIR: Prompt Automatic Iterative Refinement — LLM-driven paraphrasing attack with seeded RNG

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Mitre att&ck v19 unpacked: What changed and how to operationalize, 2026

Cymulate. Mitre att&ck v19 unpacked: What changed and how to operationalize, 2026. URL https://cymulate.com/blog/mitre-attac k-v19-breakdown/ . Industry interpretation of detection coverage scoring and operationalizing ATT&CK framework updates

2026

[6] [6]

Derczynski, E

Leon Derczynski, Erick Galinkin, Jeffrey Martin, Subho Majumdar, and Nanna Inie. garak: A framework for security probing large language models, 2024. URL https://arxiv.org/abs/ 2406.11036. NVIDIA garak open-source LLM vulnerability scanner with∼140 probes

work page arXiv 2024

[7] [7]

L1b3rt4s — curated jailbreak prompt corpus, 2024

Elder-Plinius. L1b3rt4s — curated jailbreak prompt corpus, 2024. URL https://github .com/elder- plinius/L1B3RT4S . Continu- ously updated jailbreak prompts (community- maintained)

2024

[8] [8]

Mitigating the OWASP top 10 for large language models applications using intelligent agents, 2026

Mohammad Fasha, Faisal Abul Rub, Nasim Matar, Bilal Sowan, and Mohammad Al Khaldy. Mitigating the OWASP top 10 for large language models applications using intelligent agents, 2026. URL https://arxiv.org/abs/2601.18105 . Agent-based mitigation strategies for the OWASP LLM Top 10

work page arXiv 2026

[9] [9]

Implement a continuous threat exposure management program, 2023

Gartner. Implement a continuous threat exposure management program, 2023. CTEM five-stage framework: scoping, discovery, prioritization, val- idation, mobilization

2023

[10] [10]

Market guide for breach and attack simulation, 2024

Gartner. Market guide for breach and attack simulation, 2024. Industry definition of BAS tooling and vendor landscape

2024

[11] [11]

Jailbreaking large language models: A practical guide, 2024

Lakera AI. Jailbreaking large language models: A practical guide, 2024. URL https://www.la kera.ai/blog/jailbreaking-large-languag e-models-guide. Industry practitioner overview of jailbreak families

2024

[12] [12]

ACE: A Security Architecture for LLM-Integrated App Systems

Evan Li, Tushin Mallick, Evan Rose, William Robertson, Alina Oprea, and Cristina Nita- Rotaru. ACE: A security architecture for LLM- integrated app systems, 2025. URL https: //arxiv.org/abs/2504.20984 . Defense-in- depth architecture for LLM-integrated applica- tions; accepted at NDSS 2026. 14

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

HarmBench: A standardized evaluation framework for auto- mated red teaming and robust refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench: A standardized evaluation framework for auto- mated red teaming and robust refusal. InICML,

[14] [14]

URL https://arxiv.org/abs/2402.042

[15] [15]

510-behavior standardized red-team evalua- tion; 50-prompt held-out subset used here as a brittleness comparison anchor

[16] [16]

arXiv preprint arXiv:2312.02119 , year =

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of at- tacks: Jailbreaking black-box LLMs automati- cally.NeurIPS, 2024. URL https://arxiv.org/ abs/2312.02119. TAP: Tree-search over LLM- generated paraphrases. Source of probe-mutation variance in the LLM jailbreak literature

work page arXiv 2024

[17] [17]

Mitre att&ck knowledge base, v15, 2024

MITRE Corporation. Mitre att&ck knowledge base, v15, 2024. URL https://attack.mitre .org/. Canonical taxonomy of adversary tactics and techniques

2024

[18] [18]

Mitre att&ck evaluations — artifacts and methodology, 2024

MITRE Engenuity / Center for Threat-Informed Defense. Mitre att&ck evaluations — artifacts and methodology, 2024. URL https://gith ub.com/mitre- engenuity/ctid- attack- e valuations . Reference for transparent BAS evaluation methodology. Original domain (mitre- engenuity.org) retired 2025; artifacts preserved at this repository

2024

[19] [19]

Owasp top 10 for large language model applications, 2025 edition, 2025

OWASP GenAI Security Project. Owasp top 10 for large language model applications, 2025 edition, 2025. URL https://genai.owasp.or g/llm-top-10/. Canonical taxonomy for LLM application risk. Categories LLM01-LLM10

2025

[20] [20]

Pentera automated security validation,

Pentera. Pentera automated security validation,

[21] [21]

Commercial AEV — autonomous pentesting

URL https://pentera.io/platform/ . Commercial AEV — autonomous pentesting

[22] [22]

Picus security cybersecurity glos- sary, 2025

Picus Security. Picus security cybersecurity glos- sary, 2025. URL https://www.picussecurity. com/resource/glossary/ . Industry terminol- ogy reference including BAS, CTEM, and threat coverage definitions

2025

[23] [23]

Atomic red team: Library of mitre att&ck aligned tests, 2024

Red Canary. Atomic red team: Library of mitre att&ck aligned tests, 2024. URL https://gi thub.com/redcanaryco/atomic- red- team . Open-source library of small, portable detection tests mapped to ATT&CK techniques

2024

[24] [24]

Safebreach validate — breach and attack simulation, 2025

SafeBreach. Safebreach validate — breach and attack simulation, 2025. URL https://www. safebreach.com/validate-breach-and-att ack-simulation/ . Commercial BAS — APT emulation focus

2025

[25] [25]

Is the OWASP top 10 list comprehen- sive enough for writing secure code?, 2020

Parth Sane. Is the OWASP top 10 list comprehen- sive enough for writing secure code?, 2020. URL https://arxiv.org/abs/2002.11269 . Dis- cusses the gap between OWASP Top 10 coverage and the long tail of real vulnerabilities

work page arXiv 2020

[26] [26]

Countermind: A multi-layered security architecture for large language models,

Dominik Schwarz. Countermind: A multi-layered security architecture for large language models,

[27] [27]

Multi-layered defense stack for LLMs; closest concurrent work to our cumulative-defense ladder

URL https://arxiv.org/abs/2510 .11837. Multi-layered defense stack for LLMs; closest concurrent work to our cumulative-defense ladder

[28] [28]

Benchmark- ing LLAMA model security against OWASP top 10 for LLM applications, 2026

Nourin Shahin and Izzat Alsmadi. Benchmark- ing LLAMA model security against OWASP top 10 for LLM applications, 2026. URL https: //arxiv.org/abs/2601.19970 . Concurrent OWASP-LLM-Top-10 benchmark focused on a single model family

work page arXiv 2026

[29] [29]

Jailbroken: How Does LLM Safety Training Fail?

Alexander Wei, Nika Haghtalab, and Jacob Stein- hardt. Jailbroken: How does llm safety train- ing fail?NeurIPS, 2023. URL https://arxi v.org/abs/2307.02483 . Catalogues jailbreak failure modes including competing-objective and mismatched-generalization

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Xm cyber attack path management,

XM Cyber. Xm cyber attack path management,

[31] [31]

Reference for multi-step attack chain visualization

URL https://www.xmcyber.com/plat form/ . Reference for multi-step attack chain visualization

[32] [32]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Mi- lad Nasr, J. Zico Kolter, and Matt Fredrik- son. Universal and transferable adversarial at- tacks on aligned language models, 2023. URL https://arxiv.org/abs/2307.15043 . Greedy Coordinate Gradient attack — universal jailbreak suffix generation. A Replication artifacts This appendix lists the artifacts that trav...

work page internal anchor Pith review Pith/arXiv arXiv 2023