pith. sign in

arxiv: 2605.19966 · v1 · pith:DV4KRPXPnew · submitted 2026-05-19 · 💻 cs.LG · cs.AI

Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes

Pith reviewed 2026-05-20 07:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords adversarial prompt detectionchange point detectionLLM jailbreaknext-token entropyCUSUMonline detectionadversarial suffixmodel safety
0
0 comments X

The pith

Change-point detection on next-token entropy identifies fluent optimization-based adversarial suffixes in LLM prompts online and without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that optimization-based adversarial suffixes used to jailbreak LLMs create measurable shifts in the model's sequence of next-token entropies even when the suffixes remain fluent. By estimating a baseline from the system prompt, standardizing the entropy values for user tokens, and monitoring them with a one-sided cumulative sum statistic, the approach frames detection as an online change-point problem. The resulting CPD method requires no training, runs in real time, works across models, and improves detection F1 scores over windowed perplexity baselines while also pinpointing where the adversarial suffix begins. In high-volume settings dominated by benign prompts, it further reduces the number of calls to a heavier safety model without hurting overall detection quality.

Core claim

The paper claims that casting adversarial suffix detection as an online change-point detection problem over the token-level next-token entropy stream, standardized against a system-prompt baseline and tracked with a one-sided CUSUM statistic, produces a model-agnostic, training-free detector that raises F1 scores above the strongest windowed-perplexity baseline on all six tested open-weight chat models, reaches AUROC 0.88 and F1 0.82 on LLaMA-2-7B, places 79.6 percent of its triggers inside the adversarial suffix, and reduces LLaMA Guard calls by 17-22 percent in benign-dominated deployment while preserving guard-level quality.

What carries the argument

One-sided CUSUM statistic applied to next-token entropies standardized against the system prompt baseline.

If this is right

  • F1 improves over the strongest windowed-perplexity baseline on LLaMA-2-7B/13B, Vicuna-7B/13B, and Qwen2.5-7B/14B.
  • 79.6 percent of CPD triggers fall inside the adversarial suffix, compared with 17-46 percent for the baselines.
  • When used as a gate for LLaMA Guard in benign-dominated traffic, the detector cuts guard calls by 17-22 percent while keeping detection performance intact.
  • The method localizes the onset of the adversarial suffix in addition to flagging the full prompt.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same entropy-based change-point approach could be tested on prompt streams that mix multiple users or sessions to monitor for emerging anomalies beyond single-prompt attacks.
  • Localization of the suffix onset suggests the possibility of early intervention, such as truncating or rewriting the prompt at the moment the change is flagged rather than after the full input is seen.
  • The training-free and model-agnostic properties make it straightforward to combine with other lightweight signals for layered defenses that escalate only when the CUSUM fires.

Load-bearing premise

Next-token entropy shifts induced by optimization-based suffixes remain sufficiently distinct from natural entropy variation in benign prompts after standardization against the system-prompt baseline.

What would settle it

A collection of benign prompts whose standardized entropy sequences trigger the one-sided CUSUM at rates comparable to those of the optimization-based adversarial suffixes, or a set of optimization-based suffixes that produce no detectable entropy shift relative to the baseline.

Figures

Figures reproduced from arXiv: 2605.19966 by Miguel R. D. Rodrigues, Mohammed Alshaalan.

Figure 1
Figure 1. Figure 1: Top: benign prompt where the CUSUM statistic W+ t (purple) stays below threshold h (orange) at slack k = 0 (the canonical Page-CUSUM setting used for [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Locality breakdown for LLaMA-2-7B: distribution of triggers across regions before the suffix, before+in (boundary￾straddling), in-suffix, and in-benign. Left: F1-optimal threshold. Right: FPR@10% on benign prompts. For readability we plot CPD Online and representative WPP windows (WPP5, WPP20); the full sweep over w ∈ {5, 10, 15, 20} is in Appendix B.1 [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A second AdvPrompter example to reinforce [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Aggregate CPD Online CUSUM trajectories by prompt type (LLaMA-2-7B benchmark). Top: k = 0. Bottom: k = 0.5. Curves show the median W+ t across prompts with an interquartile band. For adversarial prompts, the green dashed line marks the aligned suffix onset; the orange dashed line marks the threshold h used for visualization (a representative pooled F1-optimal value; main-paper [PITH_FULL_IMAGE:figures/ful… view at source ↗
Figure 5
Figure 5. Figure 5: Locality breakdown for LLaMA-2-7B at the F1-optimal threshold for CPD Online and WPP with w ∈ {5, 10, 15, 20}. Bars are grouped by locality category (Before / Before+In / In￾suffix / In-benign) and are decomposed by attack family using hatch fill within each method’s bar. Before suffix Before + In suffix In suffix In benign 0 20 40 60 80 Share of triggers (%) 86% 14% 22% 58% 20% 41% 39% 20% 27% 54% 19% 37%… view at source ↗
Figure 6
Figure 6. Figure 6: Locality breakdown for LLaMA-2-7B at FPR@10% on benign prompts for CPD Online and WPP with w ∈ {5, 10, 15, 20}. Bars are grouped by locality category (Before / Before+In / In-suffix / In-benign) and are decomposed by attack family using hatch fill within each method’s bar. all six base LLMs at the cost of moving away from the standard formulation; we report it as a sensitivity variant rather than a main re… view at source ↗
Figure 7
Figure 7. Figure 7: compares CPD ROC curves on the main LLaMA￾2-7B benchmark across all three slack settings k ∈ {−0.5, 0, 0.5} (Section 2.3). Each plot also reports sub￾set curves that exclude (no gcg) or isolate (gcg only) GCG attacks. 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate ROC (adv vs benign) — cpd_online overall (AUC=0.894) no_gcg (AUC=0.889) gcg_only (AUC=0.916) 0.0 0.2 0.4… view at source ↗
Figure 8
Figure 8. Figure 8: CPD Online ROC curves under PP-gap sampling for α = 1 (top-left), α = 2 (top-right), and α = 3 (bottom), each showing overall / no gcg / gcg only subsets. CPD’s discriminative power on this controlled benchmark. Why WPP degrades. WPP thresholds on mean NLL within non-overlapping windows. At α = 3, the be￾nign PP distribution is deliberately shifted above the ad￾versarial PP distribution, producing substant… view at source ↗
read the original abstract

Optimization-based adversarial suffixes can jailbreak aligned large language models (LLMs) while remaining fluent, weakening static and windowed perplexity-based detectors. We cast adversarial suffix detection as an online change-point detection problem over the token-level next-token entropy stream. Using the LLM system prompt to estimate a robust baseline, we standardize user-token entropies and apply a one-sided CUSUM statistic. The resulting detector, CPD Online (CPD), is model-agnostic, training-free, runs online, and localizes the adversarial suffix onset. On a benchmark of 1,012 optimization-based suffix attacks (GCG, AutoDAN, AdvPrompter, BEAST, AutoDAN-HGA) and 1,012 perplexity-controlled benign prompts, CPD improves F1 over the strongest windowed-perplexity baseline on all six open-weight chat models (LLaMA-2-7B/13B, Vicuna-7B/13B, Qwen2.5-7B/14B). On LLaMA-2-7B at the canonical CUSUM setting ($k=0$), CPD reaches AUROC $0.88$ and F1 $0.82$. Beyond prompt-level detection, CPD concentrates 79.6% of its triggers inside the adversarial suffix, versus 17-46% for windowed perplexity. Finally, when used as a lightweight gate for LLaMA Guard, CPD reduces guard calls by 17-22% on a high-volume, benign-dominated deployment while preserving guard-level detection quality

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes CPD Online (CPD), a training-free, model-agnostic online detector for optimization-based adversarial suffixes. It frames detection as a change-point problem on the stream of next-token entropies: user-token entropies are standardized against mean and standard deviation computed from the system-prompt tokens, then fed to a one-sided CUSUM statistic. On a benchmark of 1,012 adversarial suffixes (GCG, AutoDAN, AdvPrompter, BEAST, AutoDAN-HGA) and 1,012 perplexity-controlled benign prompts across six models, CPD reports AUROC 0.88 and F1 0.82 on LLaMA-2-7B at k=0, 79.6% trigger localization inside the suffix, and 17-22% reduction in LLaMA Guard calls while preserving detection quality.

Significance. If the central claims hold, the work supplies a lightweight, online, training-free alternative to perplexity-based detectors that is particularly effective against fluent optimization-based attacks. The localization of the adversarial suffix and the demonstrated efficiency gain when gating a heavier guard model are practical strengths. The evaluation on held-out attack and benign sets with multiple models and attack families provides a reproducible foundation for the performance numbers.

major comments (2)
  1. [Section 3.2] Section 3.2 (standardization and CUSUM): The procedure subtracts the system-prompt mean and divides by its standard deviation before applying one-sided CUSUM. The reported AUROC 0.88 / F1 0.82 and 79.6% localization on LLaMA-2-7B rest on the premise that perplexity-controlled benign user prompts produce no sustained positive drift relative to the system-prompt baseline. The manuscript does not report an explicit test of this stationarity assumption on uncontrolled benign prompts that vary in length, topic, or syntactic complexity; if such prompts induce drift, CUSUM will accumulate false positives and the claimed separation from windowed-perplexity baselines (17-46% localization) would not hold.
  2. [Section 4.3] Section 4.3 and Table 2: The canonical setting k=0 is used and the decision threshold h appears fixed for the reported metrics, yet no sensitivity table or description of how h was selected without reference to the test-set labels is provided. Because the central performance numbers (AUROC, F1, localization percentage) are load-bearing for the claim of superiority, the absence of this detail makes it impossible to judge whether the results are robust to reasonable choices of h.
minor comments (2)
  1. [Abstract] The abstract states results 'at the canonical CUSUM setting (k=0)' but does not define the exact decision threshold or confirm it was chosen independently of the test set; adding one sentence would improve reproducibility.
  2. [Figure 3] Figure 3 (localization plot): The adversarial-suffix region is not explicitly shaded or labeled on the x-axis, making it harder to verify the 79.6% concentration claim at a glance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and commitments to revisions that strengthen the presentation without altering the core claims.

read point-by-point responses
  1. Referee: [Section 3.2] Section 3.2 (standardization and CUSUM): The procedure subtracts the system-prompt mean and divides by its standard deviation before applying one-sided CUSUM. The reported AUROC 0.88 / F1 0.82 and 79.6% localization on LLaMA-2-7B rest on the premise that perplexity-controlled benign user prompts produce no sustained positive drift relative to the system-prompt baseline. The manuscript does not report an explicit test of this stationarity assumption on uncontrolled benign prompts that vary in length, topic, or syntactic complexity; if such prompts induce drift, CUSUM will accumulate false positives and the claimed separation from windowed-perplexity baselines (17-46% localization) would not hold.

    Authors: We thank the referee for identifying this assumption. The perplexity-controlled benign prompts were chosen specifically to enable a fair, apples-to-apples comparison with windowed-perplexity baselines under matched fluency conditions; uncontrolled prompts would introduce length and topic confounds that affect all detectors equally. We nevertheless agree that an explicit check on uncontrolled benign prompts would better substantiate stationarity. In the revised manuscript we will add a new experiment using a held-out collection of uncontrolled benign prompts that vary in length, topic, and syntactic complexity, reporting the resulting false-positive rate, AUROC, and localization statistics to quantify any drift. revision: yes

  2. Referee: [Section 4.3] Section 4.3 and Table 2: The canonical setting k=0 is used and the decision threshold h appears fixed for the reported metrics, yet no sensitivity table or description of how h was selected without reference to the test-set labels is provided. Because the central performance numbers (AUROC, F1, localization percentage) are load-bearing for the claim of superiority, the absence of this detail makes it impossible to judge whether the results are robust to reasonable choices of h.

    Authors: We apologize for the omitted detail. The threshold h was selected on a validation split held completely separate from the test set by performing a grid search that maximized F1 on the validation data. To address the concern directly, the revised manuscript will include a sensitivity table in Section 4.3 that reports AUROC, F1, and localization accuracy for a range of h values (e.g., 3 to 10) around the chosen operating point, confirming that the reported gains over baselines remain stable across reasonable threshold choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard CUSUM on held-out entropy streams

full rationale

The paper frames detection as online change-point detection via one-sided CUSUM applied to next-token entropy, standardized against system-prompt statistics. All reported metrics (AUROC 0.88, F1 0.82, 79.6% suffix localization) are obtained from empirical evaluation on explicitly held-out sets of 1012 optimization-based attacks and 1012 perplexity-controlled benign prompts across six models. No equation reduces these performance numbers to a fit performed on the same test data, nor does any step equate a claimed result to its own inputs by definition. The approach is training-free, uses a canonical CUSUM parameter (k=0), and relies on externally computed entropy values rather than self-referential fitting or self-citation chains. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that entropy provides a stable signal for suffix onset and on one tunable CUSUM parameter whose value is presented as canonical.

free parameters (1)
  • CUSUM reference value k
    Chosen as the canonical setting k=0 for the main reported results on LLaMA-2-7B.
axioms (1)
  • domain assumption Next-token entropy computed by the target LLM is a reliable and comparable signal across tokens when standardized against the system prompt.
    Invoked to create the standardized stream on which CUSUM operates.

pith-pipeline@v0.9.0 · 5816 in / 1363 out tokens · 44221 ms · 2026-05-20T07:35:26.891497+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 5 internal anchors

  1. [1]

    Ji, Jiaming and Qiu, Tianyi and Chen, Boyuan and Zhang, Borong and Lou, Hantao and Wang, Kaile and Duan, Yawen and He, Zhonghao and Zhou, Jiayi and Zhang, Zhaowei and others , journal =

  2. [2]

    Jailbreak Attacks and Defenses Against Large Language Models: A Survey

    Jailbreak attacks and defenses against large language models: A survey , author=. arXiv preprint arXiv:2407.04295 , year=

  3. [3]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Universal and Transferable Adversarial Attacks on Aligned Language Models , author =. arXiv preprint arXiv:2307.15043 , year =

  4. [4]

    Baseline Defenses for Adversarial Attacks Against Aligned Language Models

    Baseline Defenses for Adversarial Attacks Against Aligned Language Models , author =. arXiv preprint arXiv:2309.00614 , year =

  5. [5]

    2024 , url =

    Zhu, Sicheng and Zhang, Ruiyi and An, Bang and Wu, Gang and Barrow, Joe and Wang, Zichao and Huang, Furong and Nenkova, Ani and Sun, Tong , booktitle =. 2024 , url =

  6. [6]

    2024 , url =

    Liu, Xiaogeng and Xu, Nan and Chen, Muhao and Xiao, Chaowei , booktitle =. 2024 , url =

  7. [7]

    2025 , publisher =

    Paulus, Anselm and Zharmagambetov, Arman and Guo, Chuan and Amos, Brandon and Tian, Yuandong , booktitle =. 2025 , publisher =

  8. [8]

    Detecting Language Model Attacks with Perplexity

    Detecting Language Model Attacks with Perplexity , author =. arXiv preprint arXiv:2308.14132 , year =

  9. [9]

    arXiv preprint arXiv:2311.11509 , year=

    Token-level adversarial prompt detection based on perplexity measures and contextual information , author=. arXiv preprint arXiv:2311.11509 , year=

  10. [10]

    Inan, Hakan and Upasani, Kartikeya and Chi, Jianfeng and Rungta, Rashi and Iyer, Krithika and Mao, Yuning and Tontchev, Michael and Hu, Qing and Fuller, Brian and Testuggine, Davide and others , journal=

  11. [11]

    Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and others , howpublished =

  12. [12]

    Touvron, Hugo and Martin, Louis and Stone, Kevin and others , journal =

  13. [13]

    HuggingFace repository , howpublished =

    Wing Lian and Bleys Goodson and Eugene Pentland and Austin Cook and Chanvichet Vong and "Teknium" , year =. HuggingFace repository , howpublished =

  14. [14]

    Detection of Abrupt Changes: Theory and Application , author =

  15. [15]

    Sequential Analysis: Hypothesis Testing and Changepoint Detection , author =

  16. [16]

    Biometrika , volume =

    Continuous Inspection Schemes , author =. Biometrika , volume =

  17. [17]

    The Annals of Mathematical Statistics , volume =

    Procedures for Reacting to a Change in Distribution , author =. The Annals of Mathematical Statistics , volume =

  18. [18]

    The Annals of Statistics , volume =

    Optimal Stopping Times for Detecting Changes in Distributions , author =. The Annals of Statistics , volume =

  19. [19]

    Proceedings of the 2009

    Change-Point Detection in Time-Series Data by Direct Density-Ratio Estimation , author =. Proceedings of the 2009

  20. [20]

    IEEE Transactions on Signal Processing , volume =

    An Online Kernel Change Detection Algorithm , author =. IEEE Transactions on Signal Processing , volume =

  21. [21]

    Advances in Neural Information Processing Systems 28 (NeurIPS) , pages =

    M-Statistic for Kernel Change-Point Detection , author =. Advances in Neural Information Processing Systems 28 (NeurIPS) , pages =

  22. [22]

    2024 , url =

    Zhao, Wenting and Ren, Xiang and Hessel, Jack and Cardie, Claire and Choi, Yejin and Deng, Yuntian , booktitle =. 2024 , url =

  23. [23]

    Zico and Fredrikson, Matt , title =

    Zou, Andy and Wang, Zifan and Carlini, Nicholas and Nasr, Milad and Kolter, J. Zico and Fredrikson, Matt , title =. 2023 , note =

  24. [24]

    Clark and Eunsol Choi and Michael Collins and Dan Garrette and Tom Kwiatkowski and Vitaly Nikolaev and Jennimaria Palomaki , year =

    Jonathan H. Clark and Eunsol Choi and Michael Collins and Dan Garrette and Tom Kwiatkowski and Vitaly Nikolaev and Jennimaria Palomaki , year =

  25. [25]

    , journal =

    Robey, Alexander and Wong, Eric and Hassani, Hamed and Pappas, George J. , journal =. 2025 , url =

  26. [26]

    Not What You've Signed Up For: Compromising Real-World

    Abdelnabi, Sahar and Greshake, Kai and Mishra, Shailesh and Endres, Christoph and Holz, Thorsten and Fritz, Mario , booktitle =. Not What You've Signed Up For: Compromising Real-World. 2023 , publisher =

  27. [27]

    Zeng, Yi and Lin, Hongpeng and Zhang, Jingwen and Yang, Diyi and Jia, Ruoxi and Shi, Weiyan , booktitle=. How

  28. [28]

    Transactions on Machine Learning Research , year =

    Single-Pass Detection of Jailbreaking Input in Large Language Models , author =. Transactions on Machine Learning Research , year =

  29. [29]

    arXiv preprint arXiv:2412.15115 , year =

  30. [30]

    Fast Adversarial Attacks on Language Models In One

    Sadasivan, Vinu Sankar and Saha, Shoumik and Sriramanan, Gaurang and Kattakinda, Priyatham and Chegini, Atoosa and Feizi, Soheil , booktitle =. Fast Adversarial Attacks on Language Models In One