pith. sign in

arxiv: 2606.02959 · v1 · pith:EEQNJJ24new · submitted 2026-06-01 · 💻 cs.LG · cs.CR

Gate AI: LLM Security Benchmark Evaluation Methodology and Results

Pith reviewed 2026-06-28 15:03 UTC · model grok-4.3

classification 💻 cs.LG cs.CR
keywords LLM securityprompt injection detectionjailbreak detectionevaluation methodologycross-validationoperating pointfalse positive ratebenchmarking
0
0 comments X

The pith

Evaluation harness for LLM detectors selects one global operating point and applies it uniformly across 16 benchmarks

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Published evaluations of prompt-injection and jailbreak detectors for large language models often tune thresholds separately on each dataset and leave the chosen operating points undisclosed. The paper introduces an evaluation harness that runs 5-fold cross-validation over a pooled collection of 16 public benchmarks containing 12,111 samples. On the held-out folds it selects a single global threshold that maximizes F1 subject to a false-positive rate of at most 1 percent and then applies that same threshold to every dataset. Additional diagnostic passes, including a group-fold leakage check that clusters near-duplicates, test whether the chosen point generalizes rather than overfits to individual benchmarks.

Core claim

The harness scores any detector across the 16 benchmarks with 5-fold cross-validation, selects one global operating point on the held-out folds by maximizing F1 while constraining FPR to ≤1 percent, and applies that operating point uniformly so that per-dataset scores reflect a consistent threshold rather than per-benchmark optimization; a parallel group-fold pass over prompt-ID and MinHash clusters provides a leakage diagnostic.

What carries the argument

The 5-fold cross-validation procedure that selects a single global max-F1 operating point at FPR ≤1 percent on held-out folds and applies it uniformly, with a parallel StratifiedGroupKFold leakage diagnostic over composite near-duplicate keys.

If this is right

  • Per-dataset results now reflect performance under one fixed operating point instead of benchmark-specific tuning.
  • Head-to-head comparisons with external detectors are performed after re-tuning the harness threshold to match each competitor's published false-positive rate.
  • A battery of diagnostics (leave-one-dataset-out, random-label control, length-bias correlation, cross-source duplicate detection) quantifies generalization beyond the main cross-validation.
  • The group-fold leakage diagnostic runs in parallel with the row-stratified pass to surface hidden prompt overlap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread use of the harness would make it harder for published detector papers to report inflated numbers obtained through hidden per-dataset tuning.
  • The approach could be extended by adding private or adversarially generated benchmarks to test whether the global threshold still holds outside the current public set.
  • If the near-duplicate clustering at Jaccard 0.8 misses semantically equivalent prompts that differ in wording, the leakage diagnostic may understate the risk of train-test contamination.

Load-bearing premise

The 16 public benchmarks together with the chosen near-duplicate clustering supply a representative and leakage-controlled distribution on which one global threshold remains meaningful.

What would settle it

If the globally selected threshold produces markedly lower F1 on held-out data than the best per-dataset tuned thresholds, or if the generalization diagnostics (leave-one-dataset-out, paraphrase invariance, threshold transferability) consistently fail their quantitative thresholds, the claim that the harness yields more reliable comparisons would be falsified.

Figures

Figures reproduced from arXiv: 2606.02959 by Marcus Sousa, Ryle Goehausen.

Figure 1
Figure 1. Figure 1: Per-fold ROC overlay for the headline 5-fold cross [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Adversarial-validation AUC per fold (target [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-fold train vs OOF F1. 2.4 Per-chunk to per-prompt aggregation When the pipeline chunks long inputs, training operates per chunk and aggregates at metric time. Let chunks of parent prompt p be indexed by c ∈ C(p). Continuous probabilities are max-pooled, and the hard label is the single global thresh￾old θ˜ = mediank θ ∗ k (from §2.3) applied to the max-pooled probability: pˆp = max c∈C(p) pˆp,c, yˆp = … view at source ↗
Figure 6
Figure 6. Figure 6: Per-source θ ∗ s at matched FPR vs the global θ op . Each dot is a source. 2.7 Matched-FPR per-dataset comparisons Comparing detectors at the same global threshold can mis￾lead when competitors publish their numbers at very differ￾ent operating points. To remove that confound, every per￾dataset competitor entry that publishes both a primary metric and an FPR is also surfaced with our value re-evaluated at … view at source ↗
Figure 8
Figure 8. Figure 8: Random-label control: predicted hard labels scored [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Leave-one-dataset-out F1 delta from the macro [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Length-bias Pearson correlation between input [PITH_FULL_IMAGE:figures/full_fig_p005_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Top permutation-importance features by held-out [PITH_FULL_IMAGE:figures/full_fig_p006_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Pairwise Cohen’s κ between ensemble heads. Other checks (summary only). Cross-source near-duplicate hashing catches prompts that appear in multiple datasets un￾der conflicting labels (the most common form of implicit data-quality leak when stitching public benchmarks), and sur￾faces them for manual relabel; determinism replay diffs two runs with identical seed for byte-equal OOF probabilities; a paraphras… view at source ↗
Figure 12
Figure 12. Figure 12: Reliability diagram: predicted probability bin [PITH_FULL_IMAGE:figures/full_fig_p007_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Per-dataset primary metric with nested bootstrap CIs. [PITH_FULL_IMAGE:figures/full_fig_p008_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Metric availability across published competitor [PITH_FULL_IMAGE:figures/full_fig_p009_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Global Pareto: F1 vs FPR averaged across each system’s independently verified rows on the benchmarks evaluated [PITH_FULL_IMAGE:figures/full_fig_p011_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Detection Error Tradeoff on probit axes. Probit [PITH_FULL_IMAGE:figures/full_fig_p011_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Maximum TPR at each FPR budget. Gate sweeps [PITH_FULL_IMAGE:figures/full_fig_p012_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Every (system × dataset) operating point. Gate at natural and at FPR-1%; Lakera Guard in red; second-most￾published competitor in orange; everything else grey. Head-to-head: Lakera Guard Lakera Guard is the most-cited commercial competitor in this space and publishes per-dataset numbers on the broadest set of benchmarks. The figures below report the head-to-head comparison at per-dataset matched FPR so th… view at source ↗
Figure 20
Figure 20. Figure 20: WildGuard-benign: false-positive rate against ev [PITH_FULL_IMAGE:figures/full_fig_p013_20.png] view at source ↗
Figure 19
Figure 19. Figure 19: NotInject: false-positive rate against every pub [PITH_FULL_IMAGE:figures/full_fig_p013_19.png] view at source ↗
Figure 21
Figure 21. Figure 21: Gate vs Lakera Guard: per-dataset values at matched FPR. Bars are the per-dataset primary metric (F1 for balanced [PITH_FULL_IMAGE:figures/full_fig_p014_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Latency density distribution: density histogram [PITH_FULL_IMAGE:figures/full_fig_p014_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: End-to-end detection latency: Gate (blue, p50 [PITH_FULL_IMAGE:figures/full_fig_p015_23.png] view at source ↗
read the original abstract

Published evaluations of prompt-injection and jailbreak detectors for Large Language Models often suffer from two systematic weaknesses: per-dataset threshold tuning and undisclosed operating points. We describe an evaluation harness that addresses both. The detector under evaluation is scored across 16 public benchmarks (12,111 samples) using 5-fold cross-validation. StratifiedKFold (by row) is the headline pass; a parallel StratifiedGroupKFold pass over a composite key (parent-prompt id plus MinHash + LSH near-duplicate clusters at Jaccard $\gtrsim 0.8$) runs alongside it as a leakage-premium diagnostic. A single global operating point is selected on the held-out folds (max F1 subject to FPR $\leq 1\%$) and applied uniformly to every dataset, so per-dataset results reflect one threshold rather than per-benchmark optimisation. Generalisation is examined through a battery of diagnostics (leave-one-dataset-out cross-validation, a random-label control, adversarial validation, permutation feature importance, length-bias correlation, classifier-head agreement, cross-source near-duplicate detection, threshold transferability, train-vs-OOF agreement, and a paraphrase-invariance probe), most with a quantitative pass threshold and the remainder with a stated failure mode. For every external comparison, the detector's threshold is re-tuned to the competitor's published false-positive rate so head-to-head values are evaluated at matched operating points.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims to introduce an evaluation harness for prompt-injection and jailbreak detectors that mitigates per-dataset threshold tuning and undisclosed operating points. It scores detectors on 16 public benchmarks (12,111 samples) via 5-fold cross-validation, selects a single global operating point (max F1 subject to FPR ≤ 1%) on held-out folds, and applies it uniformly; StratifiedKFold by row is the headline protocol while StratifiedGroupKFold (parent-prompt + MinHash/LSH clusters at Jaccard ≳ 0.8) serves as a leakage diagnostic. A suite of generalization diagnostics (leave-one-dataset-out, random-label control, adversarial validation, etc.) is applied, most with quantitative pass thresholds, and external comparisons are performed at matched FPRs.

Significance. If the central claim holds, the work would provide a reproducible, standardized protocol for LLM security detector evaluation that reduces the common practice of per-benchmark optimization. The battery of diagnostics with stated quantitative thresholds and the explicit handling of operating-point matching are positive features that could improve comparability across papers.

major comments (1)
  1. [Evaluation Protocol (abstract and § on cross-validation)] The headline protocol uses StratifiedKFold by row for threshold selection and reporting, while the StratifiedGroupKFold (parent-prompt id plus MinHash/LSH near-duplicate clusters) is described only as a parallel diagnostic. Because near-duplicates can cross folds under the row-wise split, the selected global operating point may overfit to leaked examples; this directly undermines the claim that the harness produces a leakage-controlled uniform threshold applicable across the 16 benchmarks.
minor comments (1)
  1. [Abstract] Clarify whether the Jaccard ≳ 0.8 threshold for LSH clustering is fixed or tuned, and report the exact number of clusters formed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and for identifying a key point about our evaluation protocol. We respond to the major comment below and outline the changes we will make.

read point-by-point responses
  1. Referee: [Evaluation Protocol (abstract and § on cross-validation)] The headline protocol uses StratifiedKFold by row for threshold selection and reporting, while the StratifiedGroupKFold (parent-prompt id plus MinHash/LSH near-duplicate clusters) is described only as a parallel diagnostic. Because near-duplicates can cross folds under the row-wise split, the selected global operating point may overfit to leaked examples; this directly undermines the claim that the harness produces a leakage-controlled uniform threshold applicable across the 16 benchmarks.

    Authors: We agree that the distinction between protocols requires clarification to support the leakage-control claim. The manuscript presents StratifiedKFold (by row) as the headline protocol because it follows conventional cross-validation practice and allows direct comparison with prior work, while the composite-key StratifiedGroupKFold serves as an explicit leakage diagnostic. However, the referee correctly notes that near-duplicates may still cross row-wise folds, potentially allowing the global threshold (max F1 at FPR ≤ 1%) to benefit from leakage. To strengthen the central claim, we will revise the abstract and the cross-validation section to designate the StratifiedGroupKFold results as the primary, leakage-controlled protocol for threshold selection and reporting. The row-wise results will be retained as a secondary comparison to quantify the leakage premium. This change directly addresses the risk of overfitting to leaked examples while preserving the uniform-threshold objective. revision: yes

Circularity Check

0 steps flagged

No significant circularity; methodology is self-contained

full rationale

The paper describes an evaluation harness using 5-fold cross-validation (StratifiedKFold headline, StratifiedGroupKFold diagnostic) to select a single global operating point (max F1 at FPR ≤1% on held-out folds) applied uniformly across 16 benchmarks. No equations, fitted parameters, or derivations are presented that reduce the claimed generalization or threshold selection to inputs by construction. No self-citations are load-bearing for uniqueness or ansatz, and the method is proposed independently without renaming known results or smuggling assumptions via citation. The central claim rests on the described procedure itself rather than reducing to its own outputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The methodology rests on standard cross-validation assumptions plus domain-specific choices for clustering and threshold selection; no new physical entities are introduced.

free parameters (2)
  • FPR cap = 1%
    Maximum false-positive rate of 1% used to select the operating point
  • Jaccard threshold = 0.8
    Similarity cutoff for defining near-duplicate clusters in the group fold
axioms (2)
  • standard math StratifiedKFold yields unbiased performance estimates when applied to the composite benchmark collection
    Invoked by the headline 5-fold pass
  • domain assumption The 16 public benchmarks collectively represent the distribution of prompt-injection attacks
    Required for the claim that uniform-threshold results generalize

pith-pipeline@v0.9.1-grok · 5777 in / 1417 out tokens · 26145 ms · 2026-06-28T15:03:18.394954+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 14 canonical work pages · 4 internal anchors

  1. [1]

    Palit, D

    S. Palit, D. Woods. Evaluating the efficacy of LLM Safety Solutions : The Palit Benchmark Dataset. arXiv:2505.13028. 2025.https://arxiv.org/abs/2505.13028

  2. [2]

    Datta, S

    Y . Datta, S. Rajasekar. JavelinGuard: Low-Cost Transformer Architectures for LLM Security. arXiv:2506.07330. 2025. https://arxiv.org/abs/2506.07330

  3. [3]

    F. A. Chitan. ILION: Deterministic Pre-Execution Safety Gates for Agentic AI Systems. arXiv:2603.13247. 2026. https://arxiv.org/abs/2603.13247 16

  4. [4]

    V . García. Which firewall best prevents prompt injection attacks? NeuralTrust blog. 2025.https://neuraltrust. ai/blog/prevent-prompt-injection-attacks-firewall-comparison

  5. [5]

    deepset/prompt-injections (community-labelled prompt-injection dataset)

    deepset. deepset/prompt-injections (community-labelled prompt-injection dataset). Hugging Face Datasets. 2023. https://huggingface.co/datasets/deepset/prompt-injections

  6. [6]

    prompt-injection-jailbreak-sentinel-v2 (model card)

    Rogue Security. prompt-injection-jailbreak-sentinel-v2 (model card). Hugging Face. 2025. https://huggingface. co/rogue-security/prompt-injection-jailbreak-sentinel-v2

  7. [7]

    Schulhoff et al

    S. Schulhoff et al. Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition. EMNLP 2023; project site. 2023.https://www.hackaprompt.com

  8. [8]

    jackhhao/jailbreak-classification (binary jailbreak vs benign classification dataset)

    jackhhao. jackhhao/jailbreak-classification (binary jailbreak vs benign classification dataset). Hugging Face Datasets. 2023.https://huggingface.co/datasets/jackhhao/jailbreak-classification

  9. [9]

    Abdelnabi et al

    S. Abdelnabi et al. LLMail-Inject: A Dataset from a Realistic Adaptive Prompt Injection Challenge. arXiv:2506.09956. 2025.https://arxiv.org/abs/2506.09956

  10. [10]

    I. Wu, M. Maslowski. CourtGuard: A Local, Multiagent Prompt Injection Classifier. arXiv:2510.19844. 2025. https: //arxiv.org/abs/2510.19844

  11. [11]

    H. Li, X. Liu. InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models. arXiv:2410.22770. 2024.https://arxiv.org/abs/2410.22770

  12. [12]

    L. E. Erdogan et al. safe-guard-prompt-injection (synthetic prompt-injection dataset, n=10,296). Hugging Face Datasets. 2024.https://huggingface.co/datasets/xTRam1/safe-guard-prompt-injection

  13. [13]

    Kasundra et al

    J. Kasundra et al. AprielGuard. arXiv:2512.20293. 2025.https://arxiv.org/abs/2512.20293

  14. [14]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    A. Zou et al. Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv:2307.15043. 2023. https://arxiv.org/abs/2307.15043

  15. [15]

    SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

    A. Robey et al. SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks. arXiv:2310.03684. 2023. https://arxiv.org/abs/2310.03684

  16. [16]

    Benchmarking and defending against indi- rect prompt injection attacks on large language models

    J. Yi et al. Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models. arXiv:2312.14197. 2023.https://arxiv.org/abs/2312.14197

  17. [17]

    Lakera/gandalf_ignore_instructions (embedding-filtered Gandalf RCT subset)

    Lakera AI. Lakera/gandalf_ignore_instructions (embedding-filtered Gandalf RCT subset). Hugging Face Datasets. 2023. https://huggingface.co/datasets/Lakera/gandalf_ignore_instructions

  18. [18]

    Li et al

    R. Li et al. GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks. arXiv:2409.19521. 2024.https://arxiv.org/abs/2409.19521

  19. [19]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    M. Mazeika et al. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. arXiv:2402.04249. 2024.https://arxiv.org/abs/2402.04249

  20. [20]

    WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

    S. Han et al. WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs. arXiv:2406.18495. 2024.https://arxiv.org/abs/2406.18495

  21. [21]

    arXiv preprint arXiv:2402.05044 , year=

    L. Li et al. SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models. arXiv:2402.05044. 2024.https://arxiv.org/abs/2402.05044

  22. [22]

    G. C. Cawley, N. L. C. Talbot. On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. Journal of Machine Learning Research 11. 2010. https://jmlr.org/papers/v11/cawley10a. html 17