arxiv: 2605.15118 · v1 · submitted 2026-05-14 · 💻 cs.CR · cs.CL

Recognition: no theorem link

Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks

Karthik Raghu Iyer , Yazdan Jamshidi , Nicholas Bray , Alexey A. Shvets

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:17 UTC · model grok-4.3

classification 💻 cs.CR cs.CL

keywords attackmatrixattacksbenchmarkbenchmarkscoveragetaxonomytimes

0 comments

The pith

A new taxonomy and matrix audit shows leading LLM attack benchmarks cover at most 25% of the STRIDE threat surface, with entire categories such as Service Disruption and Model Internals lacking any standardized tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors reviewed nearly a thousand arXiv papers on LLM security from 2023 to 2026 and organized the attacks into a detailed taxonomy with 507 categories. They then created a simple 4 by 6 grid that crosses attack targets with techniques, using the established STRIDE security framework as the backbone. When they placed six well-known benchmarks onto this grid, the three biggest ones sat in completely separate cells and together touched no more than a quarter of the possible spaces. Whole rows of the grid, including attacks that disrupt services or reach inside the model, had zero benchmark coverage even though separate studies had already shown those attacks can succeed 96 percent of the time and multiply token usage by 46 times. The team is publishing the full taxonomy, the attack list, and the mapping tool so that future benchmarks can be checked against the same grid.

Core claim

Applying it to six public benchmarks reveals that the three primary frameworks (HarmBench, InjecAgent, AgentDojo) occupy non-overlapping cells covering at most 25% of the matrix, while entire STRIDE threat categories (Service Disruption, Model Internals) lack any standardized evaluation, despite published attacks in these categories achieving 46× token amplification and 96% attack success rates.

Load-bearing premise

That the 507-leaf taxonomy extracted from the 932 arXiv studies comprehensively and without major omission represents the full threat surface of inference-time LLM attacks.

read the original abstract

We introduce a reusable framework for auditing whether LLM attack benchmarks collectively cover the threat surface: a 4$\times$6 Target $\times$ Technique matrix grounded in STRIDE, constructed from a 507-leaf taxonomy -- 401 data-populated and 106 threat-model-derived leaves -- of inference-time attacks extracted from 932 arXiv security studies (2023--2026). The matrix enables benchmark-external validation -- auditing collective coverage rather than individual benchmark consistency. Applying it to six public benchmarks reveals that the three primary frameworks (HarmBench, InjecAgent, AgentDojo) occupy non-overlapping cells covering at most 25\% of the matrix, while entire STRIDE threat categories (Service Disruption, Model Internals) lack any standardized evaluation, despite published attacks in these categories achieving 46$\times$ token amplification and 96\% attack success rates through mechanisms which no benchmark tests. The corpus of 2,521 unique attack groups further reveals pervasive naming fragmentation (up to 29 surface forms for a single attack) and heavy concentration in Safety \& Alignment Bypass, structural properties invisible at smaller scale. The taxonomy, attack records, and coverage mappings are released as extensible artifacts; as new benchmarks emerge, they can be mapped onto the same matrix, enabling the community to track whether evaluation gaps are closing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The 4x6 STRIDE matrix and released taxonomy give a practical way to audit LLM attack benchmark coverage, but the extraction from 932 papers lacks reported validation.

read the letter

The main takeaway is that this paper maps six public benchmarks onto a new 4x6 Target x Technique matrix grounded in STRIDE and finds they cover at most 25% of the cells, with Service Disruption and Model Internals completely empty despite published attacks showing 46x token use and 96% success rates. The three big frameworks sit in non-overlapping spots, and the work also flags heavy naming fragmentation across 2521 attack groups from the 932 papers. Releasing the full 507-leaf taxonomy, the attack records, and the mappings as extensible artifacts is the most useful part; it lets others add new benchmarks without starting from scratch. The observation that Safety & Alignment Bypass dominates while other STRIDE areas get ignored is concrete and worth having in one place. The soft spot is the taxonomy construction. The split into 401 data-populated leaves and 106 threat-model-derived ones is described, but there is no inter-rater agreement, no held-out validation set, and no external reference taxonomy check. If the leaf rules systematically under-counted certain cells, the 25% figure and the contrast with the high-success untested attacks become less reliable. The audit itself is external to the benchmarks, which avoids some circularity, but the central quantitative claims still rest on unverified extraction steps. This is for researchers who build or review LLM security evaluations and want a shared map of the threat surface. It is worth sending to peer review because the framework and released artifacts can be used and stress-tested by others even if the exact coverage numbers need more justification in revision.

Referee Report

3 major / 2 minor

Summary. The paper introduces a reusable 4×6 Target × Technique matrix grounded in STRIDE, derived from a 507-leaf taxonomy (401 data-populated + 106 threat-model-derived) extracted from 932 arXiv studies (2023-2026). It audits six public LLM attack benchmarks, finding that HarmBench, InjecAgent, and AgentDojo occupy non-overlapping cells covering ≤25% of the matrix, with entire STRIDE categories (Service Disruption, Model Internals) lacking standardized tests despite published attacks achieving 46× token amplification and 96% ASR. The work also documents naming fragmentation across 2,521 attack groups and releases the taxonomy, records, and mappings as extensible artifacts.

Significance. If the taxonomy construction is validated, the framework offers a concrete, benchmark-external method for tracking collective coverage of the LLM inference-time threat surface and could guide development of more complete evaluation suites. The release of artifacts and the quantitative contrast between benchmark gaps and high-success published attacks are practical strengths that support community adoption.

major comments (3)

[§3] §3 (Taxonomy Construction): The process for extracting and deduplicating the 507-leaf taxonomy from 932 papers—including rules for the 401 data-populated vs. 106 threat-model-derived leaves and the 2,521 attack groups—is presented without inter-rater agreement, coverage validation against held-out papers, or comparison to external taxonomies. This directly undermines the reliability of the 25% coverage claim.
[§5.1] §5.1 (Benchmark Audit Results): The 25% coverage figure and the assertion of non-overlapping cells for HarmBench, InjecAgent, and AgentDojo are stated without an explicit cell-by-cell mapping table or counting method (cells vs. leaves), preventing verification of which STRIDE categories are actually populated and which remain empty.
[§4] §4 (Attack Examples): The 46× token amplification and 96% ASR attacks cited as evidence for gaps in Service Disruption and Model Internals are referenced but not mapped to specific matrix cells or linked to the original papers, leaving the contrast with benchmark deficiencies unsupported.

minor comments (2)

[Abstract] Abstract: The date range '2023--2026' should specify the exact search cutoff date to allow reproducibility of the 932-paper corpus.
[§2] Notation: The distinction between 'Target' and 'Technique' dimensions in the 4×6 matrix would benefit from an explicit definition table early in the paper.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (Taxonomy Construction): The process for extracting and deduplicating the 507-leaf taxonomy from 932 papers—including rules for the 401 data-populated vs. 106 threat-model-derived leaves and the 2,521 attack groups—is presented without inter-rater agreement, coverage validation against held-out papers, or comparison to external taxonomies. This directly undermines the reliability of the 25% coverage claim.

Authors: We agree that greater transparency on the construction process is warranted. The taxonomy was built via a systematic pipeline: keyword-based retrieval of the 932 papers, followed by manual extraction of attack instances to populate the 401 data-driven leaves and addition of 106 leaves to ensure full STRIDE coverage where empirical reports were absent. The 2,521 attack groups were deduplicated through semantic clustering of descriptions combined with author review to consolidate surface-form variants. While formal inter-rater agreement was not computed, the process relied on iterative internal consistency checks. We will add a dedicated appendix that documents the precise extraction rules, categorization examples, and deduplication criteria. A comparison to prior taxonomies will also be included in the related-work section. These changes will support the 25% coverage claim without requiring changes to the underlying data or matrix. revision: partial
Referee: [§5.1] §5.1 (Benchmark Audit Results): The 25% coverage figure and the assertion of non-overlapping cells for HarmBench, InjecAgent, and AgentDojo are stated without an explicit cell-by-cell mapping table or counting method (cells vs. leaves), preventing verification of which STRIDE categories are actually populated and which remain empty.

Authors: We accept that an explicit mapping table is necessary for verification. Coverage is measured at the cell level of the 4×6 matrix (24 cells total): a cell counts as covered if any benchmark tests at least one leaf belonging to that Target–Technique pair. The 25% figure reflects the union of unique cells occupied by the three primary benchmarks. In the revised manuscript we will insert a supplementary table that enumerates, for each benchmark, the exact cells it populates, thereby demonstrating the non-overlapping distribution and confirming the empty STRIDE categories (Service Disruption, Model Internals). The counting procedure will be stated explicitly in §5.1. revision: yes
Referee: [§4] §4 (Attack Examples): The 46× token amplification and 96% ASR attacks cited as evidence for gaps in Service Disruption and Model Internals are referenced but not mapped to specific matrix cells or linked to the original papers, leaving the contrast with benchmark deficiencies unsupported.

Authors: We will revise the text to provide the missing mappings and citations. The 46× token-amplification attack will be explicitly assigned to the Service Disruption target and its corresponding technique cell, with direct references to the source papers. The 96% ASR attack will be placed in the Model Internals category with its original citation. A concise table will be added that links these high-success attacks to their matrix cells, thereby directly contrasting them with the uncovered cells in the audited benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: taxonomy extraction and benchmark mapping are independent empirical steps

full rationale

The paper extracts a 507-leaf taxonomy (401 data-populated + 106 threat-model-derived) from 932 external arXiv studies, grounds a 4x6 Target x Technique matrix in the standard STRIDE framework, and then maps six public benchmarks onto the resulting cells to compute coverage percentages. The 25% figure and the identification of missing STRIDE categories (Service Disruption, Model Internals) are direct empirical counts from this mapping; they do not reduce to any fitted parameter, self-citation chain, or definitional equivalence. No equations appear, no self-citations are load-bearing for the central claim, and the construction process relies on external literature rather than internal redefinition. The audit result is therefore self-contained against external benchmarks and studies.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework depends on the assumption that STRIDE categories transfer directly to LLM inference attacks and that literature extraction from 932 papers yields a near-complete threat surface; both are domain assumptions without independent falsification in the abstract.

free parameters (1)

Matrix dimensions (4 targets x 6 techniques)
Chosen to align with STRIDE and target types; selection criteria not detailed in abstract.

axioms (1)

domain assumption STRIDE threat categories apply without modification to inference-time LLM attacks
Used to structure the 4x6 matrix and identify uncovered rows.

invented entities (1)

4x6 Target x Technique matrix no independent evidence
purpose: To enable benchmark-external validation of collective coverage
Newly constructed artifact for this audit; no independent evidence outside the paper.

pith-pipeline@v0.9.0 · 5549 in / 1428 out tokens · 71036 ms · 2026-05-15T03:17:03.769535+00:00 · methodology

Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks

Core claim

Load-bearing premise

discussion (0)