Recognition: no theorem link
Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks
Pith reviewed 2026-05-15 03:17 UTC · model grok-4.3
The pith
A new taxonomy and matrix audit shows leading LLM attack benchmarks cover at most 25% of the STRIDE threat surface, with entire categories such as Service Disruption and Model Internals lacking any standardized tests.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Applying it to six public benchmarks reveals that the three primary frameworks (HarmBench, InjecAgent, AgentDojo) occupy non-overlapping cells covering at most 25% of the matrix, while entire STRIDE threat categories (Service Disruption, Model Internals) lack any standardized evaluation, despite published attacks in these categories achieving 46× token amplification and 96% attack success rates.
Load-bearing premise
That the 507-leaf taxonomy extracted from the 932 arXiv studies comprehensively and without major omission represents the full threat surface of inference-time LLM attacks.
read the original abstract
We introduce a reusable framework for auditing whether LLM attack benchmarks collectively cover the threat surface: a 4$\times$6 Target $\times$ Technique matrix grounded in STRIDE, constructed from a 507-leaf taxonomy -- 401 data-populated and 106 threat-model-derived leaves -- of inference-time attacks extracted from 932 arXiv security studies (2023--2026). The matrix enables benchmark-external validation -- auditing collective coverage rather than individual benchmark consistency. Applying it to six public benchmarks reveals that the three primary frameworks (HarmBench, InjecAgent, AgentDojo) occupy non-overlapping cells covering at most 25\% of the matrix, while entire STRIDE threat categories (Service Disruption, Model Internals) lack any standardized evaluation, despite published attacks in these categories achieving 46$\times$ token amplification and 96\% attack success rates through mechanisms which no benchmark tests. The corpus of 2,521 unique attack groups further reveals pervasive naming fragmentation (up to 29 surface forms for a single attack) and heavy concentration in Safety \& Alignment Bypass, structural properties invisible at smaller scale. The taxonomy, attack records, and coverage mappings are released as extensible artifacts; as new benchmarks emerge, they can be mapped onto the same matrix, enabling the community to track whether evaluation gaps are closing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a reusable 4×6 Target × Technique matrix grounded in STRIDE, derived from a 507-leaf taxonomy (401 data-populated + 106 threat-model-derived) extracted from 932 arXiv studies (2023-2026). It audits six public LLM attack benchmarks, finding that HarmBench, InjecAgent, and AgentDojo occupy non-overlapping cells covering ≤25% of the matrix, with entire STRIDE categories (Service Disruption, Model Internals) lacking standardized tests despite published attacks achieving 46× token amplification and 96% ASR. The work also documents naming fragmentation across 2,521 attack groups and releases the taxonomy, records, and mappings as extensible artifacts.
Significance. If the taxonomy construction is validated, the framework offers a concrete, benchmark-external method for tracking collective coverage of the LLM inference-time threat surface and could guide development of more complete evaluation suites. The release of artifacts and the quantitative contrast between benchmark gaps and high-success published attacks are practical strengths that support community adoption.
major comments (3)
- [§3] §3 (Taxonomy Construction): The process for extracting and deduplicating the 507-leaf taxonomy from 932 papers—including rules for the 401 data-populated vs. 106 threat-model-derived leaves and the 2,521 attack groups—is presented without inter-rater agreement, coverage validation against held-out papers, or comparison to external taxonomies. This directly undermines the reliability of the 25% coverage claim.
- [§5.1] §5.1 (Benchmark Audit Results): The 25% coverage figure and the assertion of non-overlapping cells for HarmBench, InjecAgent, and AgentDojo are stated without an explicit cell-by-cell mapping table or counting method (cells vs. leaves), preventing verification of which STRIDE categories are actually populated and which remain empty.
- [§4] §4 (Attack Examples): The 46× token amplification and 96% ASR attacks cited as evidence for gaps in Service Disruption and Model Internals are referenced but not mapped to specific matrix cells or linked to the original papers, leaving the contrast with benchmark deficiencies unsupported.
minor comments (2)
- [Abstract] Abstract: The date range '2023--2026' should specify the exact search cutoff date to allow reproducibility of the 932-paper corpus.
- [§2] Notation: The distinction between 'Target' and 'Technique' dimensions in the 4×6 matrix would benefit from an explicit definition table early in the paper.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating the revisions we will incorporate to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Taxonomy Construction): The process for extracting and deduplicating the 507-leaf taxonomy from 932 papers—including rules for the 401 data-populated vs. 106 threat-model-derived leaves and the 2,521 attack groups—is presented without inter-rater agreement, coverage validation against held-out papers, or comparison to external taxonomies. This directly undermines the reliability of the 25% coverage claim.
Authors: We agree that greater transparency on the construction process is warranted. The taxonomy was built via a systematic pipeline: keyword-based retrieval of the 932 papers, followed by manual extraction of attack instances to populate the 401 data-driven leaves and addition of 106 leaves to ensure full STRIDE coverage where empirical reports were absent. The 2,521 attack groups were deduplicated through semantic clustering of descriptions combined with author review to consolidate surface-form variants. While formal inter-rater agreement was not computed, the process relied on iterative internal consistency checks. We will add a dedicated appendix that documents the precise extraction rules, categorization examples, and deduplication criteria. A comparison to prior taxonomies will also be included in the related-work section. These changes will support the 25% coverage claim without requiring changes to the underlying data or matrix. revision: partial
-
Referee: [§5.1] §5.1 (Benchmark Audit Results): The 25% coverage figure and the assertion of non-overlapping cells for HarmBench, InjecAgent, and AgentDojo are stated without an explicit cell-by-cell mapping table or counting method (cells vs. leaves), preventing verification of which STRIDE categories are actually populated and which remain empty.
Authors: We accept that an explicit mapping table is necessary for verification. Coverage is measured at the cell level of the 4×6 matrix (24 cells total): a cell counts as covered if any benchmark tests at least one leaf belonging to that Target–Technique pair. The 25% figure reflects the union of unique cells occupied by the three primary benchmarks. In the revised manuscript we will insert a supplementary table that enumerates, for each benchmark, the exact cells it populates, thereby demonstrating the non-overlapping distribution and confirming the empty STRIDE categories (Service Disruption, Model Internals). The counting procedure will be stated explicitly in §5.1. revision: yes
-
Referee: [§4] §4 (Attack Examples): The 46× token amplification and 96% ASR attacks cited as evidence for gaps in Service Disruption and Model Internals are referenced but not mapped to specific matrix cells or linked to the original papers, leaving the contrast with benchmark deficiencies unsupported.
Authors: We will revise the text to provide the missing mappings and citations. The 46× token-amplification attack will be explicitly assigned to the Service Disruption target and its corresponding technique cell, with direct references to the source papers. The 96% ASR attack will be placed in the Model Internals category with its original citation. A concise table will be added that links these high-success attacks to their matrix cells, thereby directly contrasting them with the uncovered cells in the audited benchmarks. revision: yes
Circularity Check
No circularity: taxonomy extraction and benchmark mapping are independent empirical steps
full rationale
The paper extracts a 507-leaf taxonomy (401 data-populated + 106 threat-model-derived) from 932 external arXiv studies, grounds a 4x6 Target x Technique matrix in the standard STRIDE framework, and then maps six public benchmarks onto the resulting cells to compute coverage percentages. The 25% figure and the identification of missing STRIDE categories (Service Disruption, Model Internals) are direct empirical counts from this mapping; they do not reduce to any fitted parameter, self-citation chain, or definitional equivalence. No equations appear, no self-citations are load-bearing for the central claim, and the construction process relies on external literature rather than internal redefinition. The audit result is therefore self-contained against external benchmarks and studies.
Axiom & Free-Parameter Ledger
free parameters (1)
- Matrix dimensions (4 targets x 6 techniques)
axioms (1)
- domain assumption STRIDE threat categories apply without modification to inference-time LLM attacks
invented entities (1)
-
4x6 Target x Technique matrix
no independent evidence
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.