arxiv: 2604.14865 · v1 · submitted 2026-04-16 · 💻 cs.CL · cs.CR

Recognition: unknown

Segment-Level Coherence for Robust Harmful Intent Probing in LLMs

Xuanli He , Bilgehan Sel , Faizan Ali , Jenny Bao , Hoagy Cunningham , Jerry Wei

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:05 UTC · model grok-4.3

classification 💻 cs.CL cs.CR

keywords harmful intent detectionLLM safetystreaming probesjailbreak detectionactivation probingCBRN riskadversarial robustness

0 comments

The pith

Requiring multiple consistent evidence tokens in streaming probes raises true-positive rate for harmful LLM intent by 35.55 percent at 1 percent false-positive rate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that current streaming detectors for harmful content in LLMs often trigger on isolated sensitive words even in safe contexts. It introduces a coherence requirement so that a harmful-intent label is issued only when several tokens in a segment all point the same way. This change cuts false alarms while preserving detection of real jailbreaks, including those hidden by character-level ciphers. The gains hold when probing attention or MLP layers rather than the residual stream. The result is a more reliable real-time monitor that still works on models after adversarial fine-tuning.

Core claim

A streaming probing objective that demands multiple evidence tokens to agree on a harmful-intent prediction, rather than accepting isolated high scores, produces substantially fewer false positives on benign mentions of CBRN terms while raising true-positive rate by 35.55 percent at a fixed 1 percent false-positive rate and lifting AUROC even from already high baselines; attention and MLP activations outperform residual-stream features, and the resulting probes transfer plug-and-play to character-level cipher attacks with AUROC above 98.85 percent.

What carries the argument

Segment-level coherence objective that aggregates activation signals across multiple tokens instead of relying on single-token spikes.

If this is right

At fixed 1 percent false-positive rate the true-positive rate rises 35.55 percent relative to strong streaming baselines.
AUROC improves even when baseline performance is already near 97.40 percent.
Probes on attention or MLP activations outperform those on residual-stream features.
Probes trained on base models transfer directly to adversarially fine-tuned models that use novel ciphers, reaching over 98.85 percent AUROC.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same coherence filter could be applied to other real-time safety monitors that currently suffer single-token false alarms.
If intent is reliably distributed, future work could test whether shorter segments still suffice or whether longer windows become necessary.
The plug-and-play result suggests that harmful intent leaves detectable traces even after surface-level obfuscation, which could be checked on additional cipher families.

Load-bearing premise

Harmful intent always produces several consistent evidence tokens rather than a single strong cue or fully distributed signal.

What would settle it

A test set of harmful prompts whose intent is carried by one dominant token or by non-coherent activations; if the new probe misses them while a single-token baseline catches them, the coherence claim is falsified.

Figures

Figures reproduced from arXiv: 2604.14865 by Bilgehan Sel, Faizan Ali, Hoagy Cunningham, Jenny Bao, Jerry Wei, Xuanli He.

**Figure 1.** Figure 1: Keyword-seeking vs. Intent-aware detection. Baseline streaming probes (left) frequently misclassify benign prompts with sensitive biological terminology due to reliance on shortcut lexical cues. Our method (right) introduces an inductive bias encouraging the probe to aggregate evidence across multiple regions before signaling harmful intent, thereby reducing false positives in professional contexts. One m… view at source ↗

**Figure 2.** Figure 2: Probe performance across LLM backbones on bio-domain test sets. We report AUROC, TPR@ 1% FPR, and log-space AUROC. Top: Llama3.1-8B-Instruct; Bottom: Qwen3-8B. Training and Inference Configuration. All probes are trained using the AdamW optimizer (Loshchilov & Hutter, 2019) with a learning rate of 5 × 10−5 and a batch size of 128. Models trained on the bio-domain attack dataset are optimized for 10 epoc… view at source ↗

**Figure 3.** Figure 3: Log-space AUROC for different backbone models across evaluation domains. Llama3.1 refers to Llama-3.1-8B-Instruct, Qwen3 to Qwen3-8B, and Gemma2 to gemma2-9b-it. Bio reports the average of RT&WC and Bio-Conv, Chem the average across all chemical-domain test sets, and Generic the average across all generic high-stakes test sets. Detailed results for each dataset are provided in Appendix B. 4.2 Main Results… view at source ↗

**Figure 4.** Figure 4: Performance of SC-TopK using Residual, MLP, and Attention representations [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: AUROC of SC-TopK on the base model (Llama3.1) and its adversarially fine-tuned variants using different obfuscated encodings. No Cipher: performance on plain text. While prior research demonstrates that adversarial fine-tuning can circumvent the inherent safety guardrails of LLMs (Halawi et al., 2024; Volkov, 2024), the resilience of external monitoring layers remains under-explored. Because safety prob… view at source ↗

**Figure 6.** Figure 6: Prompt template for the bio-harm classifier used in evaluation. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt template for the chemical-harm classifier aligned with CWC criteria. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt template for the generic high-stakes classifier used in evaluation. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Log-space AUROC comparison of LoRA and full fine-tuning across three model [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: log-space AUROC of different models on RT&WC and Bio-Conv datasets. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: AUROC for different backbone models across evaluation domains. Llama3.1 refers to Llama-3.1-8B-Instruct, Qwen3 to Qwen3-8B, and Gemma2 to Gemma2-9B. Bio reports the average of RT&WC and Bio-Conv, Chem the average across all chemical-domain test sets, and Generic the average across all generic high-stakes test sets. TopK not only improves peak performance but does so consistently across domains and archite… view at source ↗

**Figure 12.** Figure 12: Comparison of aggregation methods across internal activations and backbone [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Hyperparameter analysis on Llama3.1. Top row: aggregation parameters ( [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: Variance of logits and probabilities for SWiM, SC-TopK (SegVar neg), and SC-TopK (SegVar all) across All, Positive (harmful), and Negative (harmless) exchanges on bio-domain test sets. neg: SegVar on negative instances only. all: SegVar on all instances. Importantly, performance remains stable across the evaluated range, indicating that the method is not overly sensitive to the precise choice of τs . Fi… view at source ↗

**Figure 15.** Figure 15: AUROC of different methods on the bio-domain keywords test set. [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: False negative rate of a negative instance and corresponding token-level acti [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗

**Figure 17.** Figure 17: False negative rate of a negative instance and corresponding token-level acti [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗

**Figure 18.** Figure 18: False negative rate of a negative instance and corresponding token-level acti [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗

**Figure 19.** Figure 19: False negative rate of a negative instance and corresponding token-level acti [PITH_FULL_IMAGE:figures/full_fig_p032_19.png] view at source ↗

**Figure 20.** Figure 20: False negative rate of a negative instance and corresponding token-level activa [PITH_FULL_IMAGE:figures/full_fig_p033_20.png] view at source ↗

**Figure 21.** Figure 21: False negative rate of a negative instance and corresponding token-level activa [PITH_FULL_IMAGE:figures/full_fig_p034_21.png] view at source ↗

**Figure 22.** Figure 22: False negative rate of a negative instance and corresponding token-level activa [PITH_FULL_IMAGE:figures/full_fig_p035_22.png] view at source ↗

**Figure 23.** Figure 23: False negative rate of a negative instance and corresponding token-level activa [PITH_FULL_IMAGE:figures/full_fig_p036_23.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly exposed to adaptive jailbreaking, particularly in high-stakes Chemical, Biological, Radiological, and Nuclear (CBRN) domains. Although streaming probes enable real-time monitoring, they still make systematic errors. We identify a core issue: existing methods often rely on a few high-scoring tokens, leading to false alarms when sensitive CBRN terms appear in benign contexts. To address this, we introduce a streaming probing objective that requires multiple evidence tokens to consistently support a prediction, rather than relying on isolated spikes. This encourages more robust detection based on aggregated signals instead of single-token cues. At a fixed 1% false-positive rate, our method improves the true-positive rate by 35.55% relative to strong streaming baselines. We further observe substantial gains in AUROC, even when starting from near-saturated baseline performance (AUROC = 97.40%). We also show that probing Attention or MLP activations consistently outperforms residual-stream features. Finally, even when adversarial fine-tuning enables novel character-level ciphers, harmful intent remains detectable: probes developed for the base LLMs can be applied ``plug-and-play'' to these obfuscated attacks, achieving an AUROC of over 98.85%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The segment-level coherence objective is a straightforward fix for single-token false positives in streaming harm probes, with reported gains that look usable if the experiments hold up.

read the letter

The paper's main move is to replace single-token spike detection with a requirement that multiple tokens in a segment give consistent support for a harmful intent label. This directly targets the problem of sensitive CBRN terms triggering alarms in otherwise normal text, and the abstract shows it delivering a 35% relative TPR improvement at 1% FPR plus AUROC gains even from already high baselines. The finding that attention and MLP activations beat residual-stream features is also useful, as is the plug-and-play transfer to character-level cipher attacks after adversarial fine-tuning, which reaches over 98% AUROC without retraining the probe. Those pieces are concrete and address real deployment constraints in streaming settings. The soft spot is that the gains could still be sensitive to how the test data distributes strong versus weak or single cues. If a non-trivial share of harmful examples rely on one high-signal token or diffuse signals, the coherence filter might increase misses without the paper showing per-subset breakdowns or ablations on cue concentration. The abstract is also thin on datasets, exact coherence implementation, and statistical tests, which makes it hard to judge how general the numbers are. This work is aimed at people building real-time monitors for restricted domains like CBRN safety. It has a clear, implementable idea plus empirical comparisons that justify sending it to referees, though any review should push for the missing ablations and failure-case analysis before the claims can be taken as settled.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces a segment-level coherence objective for streaming probes that detect harmful CBRN intent in LLMs. Instead of relying on isolated high-scoring tokens, the method requires multiple evidence tokens to provide consistent support for a harmful-intent prediction. The authors report a 35.55% relative TPR improvement at a fixed 1% FPR versus strong streaming baselines, AUROC gains even from a near-saturated baseline of 97.40%, superior performance when probing attention or MLP activations rather than residual-stream features, and plug-and-play transfer of base-model probes to adversarially fine-tuned models that use novel character-level ciphers, yielding AUROC > 98.85%.

Significance. If the reported gains are reproducible and do not trade off detection of subtle or single-cue harmful intents, the work offers a practical advance for real-time safety monitoring of LLMs in high-stakes domains. The plug-and-play robustness to obfuscated attacks is a notable strength that could reduce the need for per-model retraining. The emphasis on aggregated multi-token evidence rather than single-token spikes addresses a known failure mode of current streaming probes.

major comments (2)

[Abstract] Abstract: the central claim of a 35.55% relative TPR improvement at fixed 1% FPR rests on the assumption that harmful intents reliably produce multiple consistent evidence tokens. The skeptic note correctly identifies that no per-subset breakdown or ablation on cue concentration (single high-signal token vs. distributed weak cues) is provided, so it remains possible that the coherence filter increases false negatives on a non-negligible fraction of attacks; this directly affects whether the reported gains generalize.
[Abstract] Abstract / Results: the plug-and-play AUROC > 98.85% on adversarially fine-tuned ciphers is load-bearing for the robustness claim, yet the abstract supplies no information on probe training details, whether the same test distribution was used, or statistical significance of the gains; without these, the improvement cannot be assessed as a general property of the coherence objective rather than an artifact of the evaluated cases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of generalizability and clarity. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of a 35.55% relative TPR improvement at fixed 1% FPR rests on the assumption that harmful intents reliably produce multiple consistent evidence tokens. The skeptic note correctly identifies that no per-subset breakdown or ablation on cue concentration (single high-signal token vs. distributed weak cues) is provided, so it remains possible that the coherence filter increases false negatives on a non-negligible fraction of attacks; this directly affects whether the reported gains generalize.

Authors: We agree that the current manuscript lacks an explicit ablation on cue concentration, which limits assessment of whether gains hold for single-cue attacks. In the revised version, we will add a new results subsection with a per-subset breakdown: we stratify the test set by number of high-signal evidence tokens (single vs. distributed cues) and report TPR at 1% FPR for each stratum. This will quantify any potential increase in false negatives on concentrated-cue cases and directly test the generalizability of the coherence objective. revision: yes
Referee: [Abstract] Abstract / Results: the plug-and-play AUROC > 98.85% on adversarially fine-tuned ciphers is load-bearing for the robustness claim, yet the abstract supplies no information on probe training details, whether the same test distribution was used, or statistical significance of the gains; without these, the improvement cannot be assessed as a general property of the coherence objective rather than an artifact of the evaluated cases.

Authors: We acknowledge that the abstract is currently too terse on these points. We will revise the abstract to state that probes were trained on base-model activations from the standard training split, applied plug-and-play to the identical obfuscated test distribution, and that the AUROC gains are statistically significant (p < 0.01 via paired bootstrap). Full training hyperparameters, dataset splits, and significance testing procedures are already detailed in Sections 3.2 and 4.3; the abstract update will make these facts immediately visible while respecting length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical gains from new coherence objective rest on external baselines

full rationale

The paper defines a segment-level coherence objective that aggregates multiple evidence tokens rather than isolated spikes, then reports empirical improvements (TPR +35.55% at 1% FPR, AUROC gains, attention/MLP superiority, and plug-and-play cipher robustness) against strong streaming baselines. No equations or metrics are shown to reduce to the inputs by construction, no fitted parameters are relabeled as predictions, and no load-bearing claims depend on self-citations or uniqueness theorems from the same authors. The derivation chain is self-contained: the method is introduced by explicit design choice, evaluated on held-out test distributions, and compared to independent baselines. The skeptic concern about missing subtle single-cue intents is a question of coverage and test-set composition, not circularity in the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical machine learning paper. No free parameters, axioms, or invented entities beyond standard probing practices are described in the abstract.

pith-pipeline@v0.9.0 · 5530 in / 1071 out tokens · 65949 ms · 2026-05-10T11:05:01.054592+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 3 canonical work pages

[1]

URLhttps://openreview.net/forum?id=8YniJnJQ0P. Adrian Mirza, Nawaf Alampara, Marti˜no R´ıos-Garc´ıa, Mohamed Abdelalim, Jack Butler, Bethany Connolly, Tunca Dogan, Marianna Nezhurina, B¨unyamin S ¸en, Santosh Tiruna- gari, et al. Chempile: A 250gb diverse and curated dataset for chemical foundation models.arXiv preprint arXiv:2505.12534, 2025. Maximilian ...

work page arXiv 2025
[2]

arXiv preprint arXiv:2308.12833 (2023)

URLhttps://doi.org/10.48550/arXiv.2308.12833. Siddharth M Narayanan, James D Braza, Ryan-Rhys Griffiths, Albert Bou, Geemi Wellawatte, Mayk Caldas Ramos, Ludovico Mitchener, Samuel G Rodriques, and Andrew D White. Training a scientific reasoning model for chemistry.arXiv preprint arXiv:2506.17238, 2025. David Rein, Betty Li Hou, Asa Cooper Stickland, Jack...

work page doi:10.48550/arxiv.2308.12833 2025
[3]

Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming

URLhttps://aclanthology.org/2024.acl-long.828/. Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, Amanda Askell, Nathan Bailey, Joe Benton, Emma Bluemke, Samuel R. Bowman, Eric Christiansen, Hoagy Cunningham, Andy Dau, Anjali Gopal, Rob Gilson, Logan Graham, Logan Howard, Nimit K...

work page doi:10.48550/arxiv.2501.18837 2024
[4]

Neo-Ebola

with rank 64, α= 128, and adaptation applied to all linear layers. Training proceeds for 1000 steps with a batch size of 32 and a learning rate of 1 × 10−4. This procedure enables the model to process encoded inputs while preserving its conversational capabilities. 26 Preprint. Under review. Table 9: AUROC of probes across five cipher methods with and wit...