What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics

Michele Papucci; Mina Rezaei; Shireen Kudukkil Manchingal; Sofiia Nikolenko

arxiv: 2606.25182 · v1 · pith:X5UAB5UQnew · submitted 2026-06-23 · 💻 cs.CL · cs.AI· cs.LG

What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics

Sofiia Nikolenko , Michele Papucci , Mina Rezaei , Shireen Kudukkil Manchingal This is my paper

Pith reviewed 2026-06-25 23:27 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords jailbreak detectionentropy dynamicsintermediate layersLLM representationspredictive entropylogit lensadversarial prompts

0 comments

The pith

Jailbreak prompts produce distinct entropy evolution patterns in the middle layers of large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether harmful intent in prompts leaves detectable traces inside an LLM rather than only at the final output. It measures token-by-token predictive entropy at every layer using the logit lens and compares static summary numbers against measures that track how entropy rises or falls across the prompt. Static numbers such as mean or variance give almost no separation, while trend-based features do separate jailbreaks from ordinary prompts. The separation is strongest in intermediate layers and weakens near the output head, and the pattern holds across Llama, Qwen, and Gemma without any retraining.

Core claim

Jailbreak behavior is reflected in structured intermediate uncertainty dynamics. Features that capture how predictive entropy evolves across token positions, such as monotonic rank-based trend scores, carry substantially more signal than static aggregate statistics. This signal concentrates in mid-network representations and degrades at the final layer, providing architecture-consistent separation on adversarial benchmarks without additional training.

What carries the argument

Token-level predictive entropy trajectories across layers analyzed with the logit lens, using monotonic rank-based trend scores to quantify evolution across token positions.

If this is right

Static entropy statistics such as mean and variance carry little discriminative signal for jailbreaks.
Features that track entropy change across token positions are substantially more informative.
The discriminative signal concentrates in intermediate layers and weakens at the final layer.
The same entropy-dynamics features separate jailbreaks across multiple model families without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Detection systems could monitor only a subset of intermediate layers for lower latency.
Safety training may leave harmful-intent signals intact in the middle of the network even when they are suppressed at the output.
The same layer-wise entropy analysis could be tested on other forms of prompt manipulation such as prompt injection or role-play overrides.

Load-bearing premise

The observed separation in entropy dynamics is caused by harmful intent rather than other prompt properties such as length, topic, or syntactic complexity.

What would settle it

Construct prompt pairs matched for length, topic, and syntactic complexity where only one contains harmful intent; if the mid-layer entropy trend scores show no consistent difference, the claim is falsified.

Figures

Figures reproduced from arXiv: 2606.25182 by Michele Papucci, Mina Rezaei, Shireen Kudukkil Manchingal, Sofiia Nikolenko.

**Figure 1.** Figure 1: Token-level predictive entropy trajectories at a representative intermediate layer (L22) of Llama-3.1-8B for safe and adversarial (jailbreak) prompts. While aggregate entropy levels are comparable, the jailbreak prompt exhibits a pronounced monotonic trend in entropy evolution across token positions, motivating the use of trajectorybased features. policy violations, data leakage, or the misuse of connecte… view at source ↗

**Figure 2.** Figure 2: Distribution of dynamic trend features for safe (UltraChat, blue) and harmful (AdvBench, red) prompts on Llama-3.1-8B (n = 100 per class). MWU/KS: Mann–Whitney U and Kolmogorov–Smirnov two-sample test p-values for the safevs-harmful difference shown in each panel. Top: intermediate layer L22 (∼69% depth). Bottom: final layer (L32) [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of entropy trend features for safe and harmful prompts (Llama3.1-8B, layer 22, n = 100 per class). 7.2 Layer-Wise Directional AUROC Across Metrics Tables 7–9 report directional AUROC for all metrics at every probe layer, averaged over the 6 primary evaluation pairs (UltraChat and WildJailbreak × {AdvBench, HarmBench, StrongREJECT}). Bold rows mark the focal layer (∼69% depth) used in the ma… view at source ↗

read the original abstract

Jailbreak attacks reveal a persistent weakness in aligned Large Language Models: carefully crafted prompts can elicit policy-violating responses despite safety training. While most defenses operate at the prompt or output level, it remains unclear how harmful intent is encoded within the model's internal representations. We investigate this question by analyzing token-level predictive entropy trajectories across layers of a frozen LLM using the logit lens. We find that static aggregate statistics of prompt-level entropy (e.g., mean, variance) carry little discriminative signal, whereas features capturing how entropy evolves across token positions, such as monotonic rank-based trend scores, are substantially more informative. Importantly, this signal is not uniform across model depth: it is concentrated in intermediate layers and degrades at the final layer, indicating that jailbreak-relevant structure is most pronounced in mid-network representations rather than at the output head. Across multiple models (Llama, Qwen, Gemma) and adversarial benchmarks, these entropy dynamics provide architecture-consistent separation without additional training. Together, our findings show that jailbreak behavior is reflected in structured intermediate uncertainty dynamics, clarifying both which entropy-derived features encode harmful intent and where in the network that signal is most pronounced.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mid-layer entropy trend scores separate jailbreaks on benchmarks across models without training, but the paper gives no sign of controlling for prompt length or complexity so the link to harmful intent stays unproven.

read the letter

The main observation is that monotonic rank-based trends in token entropy across positions carry more signal than simple averages or variances, and that this signal concentrates in intermediate layers before fading at the output. The work shows this pattern on standard adversarial sets for Llama, Qwen, and Gemma using the logit lens on a frozen model.

That consistency across families and the lack of any retraining are the practical points worth noting. It gives a concrete way to monitor internal uncertainty dynamics rather than relying only on prompt filters or output checks.

The gap is the missing controls. Nothing indicates that benign prompts were matched to the jailbreaks on length, syntactic complexity, or topic. Entropy measures frequently respond to those surface properties, so the separation could reflect those covariates instead of intent. The stress-test concern lands directly on the abstract and is not addressed there.

If the full experiments include length-matched or complexity-matched baselines and the separation survives them, the layer-specific claim would be stronger. As written, the attribution to harmful intent rests on an assumption that is not yet tested.

Readers working on internal monitoring or representation-based safety checks would get the most from it. The idea is straightforward enough that a referee could evaluate the controls and the quantitative separation in one pass. It is worth sending out for review rather than desk rejecting.

Referee Report

2 major / 2 minor

Summary. The paper claims that token-level predictive entropy trajectories, computed via the logit lens on frozen LLMs, yield discriminative features for jailbreak detection. Static aggregates (mean, variance) carry little signal, but monotonic rank-based trend scores tracking entropy evolution across token positions are substantially more informative. This signal concentrates in intermediate layers and degrades at the final layer, providing architecture-consistent separation across Llama, Qwen, and Gemma on adversarial benchmarks without any additional training.

Significance. If the central empirical separation holds after proper controls, the work would usefully localize jailbreak-relevant structure to mid-network representations and identify which entropy-derived features carry the signal. The multi-model, training-free design and focus on dynamics rather than static statistics are strengths that could inform internal monitoring methods.

major comments (2)

[§4–5] §4–5 (Experimental setup and results): The reported separation on adversarial benchmarks is presented without evidence that benign prompts were matched or controlled for length, topic, or syntactic complexity. Because entropy trajectories are known to correlate with these surface properties, the attribution of the mid-layer signal specifically to harmful intent is load-bearing and currently unsupported.
[Abstract and §3] Abstract and §3 (Method): The monotonic rank-based trend scores are described as substantially more informative than static aggregates, yet no quantitative metrics (AUC, accuracy deltas, p-values, or effect sizes), dataset sizes, or statistical tests are supplied to substantiate the separation or its layer-wise concentration.

minor comments (2)

[§3] Notation for the trend-score computation and the precise definition of 'monotonic rank-based' should be formalized with an equation or pseudocode for reproducibility.
Figure captions and axis labels for entropy trajectories should explicitly state the number of prompts per condition and whether error bands represent standard error or deviation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to strengthen the claims with additional controls and quantitative reporting.

read point-by-point responses

Referee: [§4–5] §4–5 (Experimental setup and results): The reported separation on adversarial benchmarks is presented without evidence that benign prompts were matched or controlled for length, topic, or syntactic complexity. Because entropy trajectories are known to correlate with these surface properties, the attribution of the mid-layer signal specifically to harmful intent is load-bearing and currently unsupported.

Authors: We agree that the absence of explicit matching for length, topic, and syntactic complexity is a limitation that weakens the attribution of the observed signal specifically to harmful intent rather than surface properties. The revised manuscript will incorporate additional experiments using length-, topic-, and complexity-matched benign prompts drawn from the same sources as the adversarial benchmarks, with results reported to assess whether the mid-layer separation persists under these controls. revision: yes
Referee: [Abstract and §3] Abstract and §3 (Method): The monotonic rank-based trend scores are described as substantially more informative than static aggregates, yet no quantitative metrics (AUC, accuracy deltas, p-values, or effect sizes), dataset sizes, or statistical tests are supplied to substantiate the separation or its layer-wise concentration.

Authors: We acknowledge that the manuscript does not report the requested quantitative metrics, dataset sizes, or statistical tests to support the claims about trend scores versus static aggregates and their layer-wise concentration. The revised version will add these details, including AUC values, accuracy deltas, exact dataset sizes per model/benchmark, p-values from appropriate tests, and effect sizes, both in the abstract and in §3 and the results sections. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core method consists of direct computation of token-level predictive entropy trajectories via the logit lens on a frozen LLM, followed by extraction of features such as monotonic rank-based trend scores. No equations, fitted parameters, or self-referential definitions are present that would reduce any claimed separation or detection to the inputs by construction. The analysis is presented as an observational study of existing representations across layers and models, with no load-bearing self-citations or ansatzes imported from prior author work. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that entropy extracted via logit lens from intermediate layers encodes jailbreak intent in a way that is separable from other prompt characteristics; no free parameters or invented entities are mentioned.

axioms (1)

domain assumption The logit lens applied to intermediate layers yields meaningful token-level predictive distributions.
Invoked when the method extracts entropy trajectories from frozen model layers.

pith-pipeline@v0.9.1-grok · 5746 in / 1122 out tokens · 20084 ms · 2026-06-25T23:27:00.962334+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 2 canonical work pages

[1]

Scientific Reports15(1), 36453 (2025)

Alohali, K.I., Almusaeeb, L.A., Almubarak, A.A., Alohali, A.I., Muaygil, R.A.: Reasoning-based llms surpass average human performance on medical social skills. Scientific Reports15(1), 36453 (2025)

2025
[2]

Alon, G., Kamfonas, M.: Detecting language model attacks with perplexity (2023), https://arxiv.org/abs/2308.14132

Pith/arXiv arXiv 2023
[3]

An Yang, e.a.: Qwen3 technical report (2025),https://arxiv.org/abs/2505.093 88

2025
[4]

Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., Nanda, N.: Refusal in language models is mediated by a single direction (2024),https: //arxiv.org/abs/2406.11717

Pith/arXiv arXiv 2024
[5]

ICML (2024)

Belrose, N., Furman, Z., Smith, L., Halawi, D., Ostrovsky, I., McKinney, L., Bi- derman, S., Steinhardt, J.: Eliciting latent predictions from transformers with the tuned lens. ICML (2024)

2024
[6]

org/abs/2502.15435

Candogan, L.N., Wu, Y., Rocamora, E.A., Chrysos, G.G., Cevher, V.: Single-pass detection of jailbreaking input in large language models (2025),https://arxiv. org/abs/2502.15435

arXiv 2025
[7]

ICLR (2026)

Cao, C., Xu, X., Han, B., Li, H.: Reasoned safety alignment: Ensuring jailbreak defense via answer-then-check. ICLR (2026)

2026
[8]

arXiv preprint arXiv:2512.05526 (2025)

Caprio, M., Manchingal, S.K., Cuzzolin, F.: Credal and interval deep evidential classifications. arXiv preprint arXiv:2512.05526 (2025)

arXiv 2025
[9]

Chao, P., Debenedetti, E., Robey, A., Andriushchenko, M., Croce, F., Sehwag, V., Dobriban, E., Flammarion, N., Pappas, G.J., Tramer, F., Hassani, H., Wong, E.: Jailbreakbench: An open robustness benchmark for jailbreaking large language models (2024),https://arxiv.org/abs/2404.01318

Pith/arXiv arXiv 2024
[10]

Chen, G., Xia, Y., Jia, X., Li, Z., Torr, P., Gu, J.: Llm jailbreak detection for (almost) free! In: Findings of the Association for Computational Linguistics: EMNLP 2025. p. 5777–5807. Association for Computational Linguistics (2025). https://doi.org/10.18653/v1/2025.findings-emnlp.309,http://dx.doi.org /10.18653/v1/2025.findings-emnlp.309

work page doi:10.18653/v1/2025.findings-emnlp.309 2025
[11]

arXiv preprint arXiv:2602.13840 (2026)

Cheng, Y., Ye, H., Li, H.H., Sun, J., Chen, Y.: Privact: Internalizing con- textual privacy preservation via multi-agent preference training. arXiv preprint arXiv:2602.13840 (2026)

arXiv 2026
[12]

Preprints (May 2026)

Cuzzolin, F., Manchingal, S.K.: A research programme for continual and neu- rosymbolic learning in epistemic artificial intelligence. Preprints (May 2026). https://doi.org/10.20944/preprints202605.1053.v1,https://doi.org/ 10.20944/preprints202605.1053.v1

work page doi:10.20944/preprints202605.1053.v1 2026
[13]

Ding, N., Chen, Y., Xu, B., Qin, Y., Zheng, Z., Hu, S., Liu, Z., Sun, M., Zhou, B.: Enhancingchatlanguagemodelsbyscalinghigh-qualityinstructionalconversations (2023),https://arxiv.org/abs/2305.14233

Pith/arXiv arXiv 2023
[14]

Avail- able at SSRN 4858664 (2024) Intermediate Layers for Jailbreak Detection 17

Ferrari, N., Zanarini, N., Fraccaroli, M., Bizzarri, A., Lamma, E.: Integration of deep generative anomaly detection algorithm in high-speed industrial line. Avail- able at SSRN 4858664 (2024) Intermediate Layers for Jailbreak Detection 17

2024
[15]

Grattafiori, A., Dubey, A., Jauhri, A., et al.: The llama 3 herd of models (2024), https://arxiv.org/abs/2407.21783

Pith/arXiv arXiv 2024
[16]

Hasan, M.M., Sternhagen, M., Roy, K.C.: Engineering attack vectors and detecting anomalies in additive manufacturing (2026),https://arxiv.org/abs/2601.00384

arXiv 2026
[17]

Jain,N.,Schwarzschild,A.,Wen,Y.,etal.:Baselinedefensesforadversarialattacks against aligned language models (2023),https://arxiv.org/abs/2309.00614

Pith/arXiv arXiv 2023
[18]

Jiang, L., Rao, K., et al.: Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models (2024),https://arxiv.org/abs/2406.185 10

2024
[19]

Kadali, S.D.S.S., Papalexakis, E.E.: Do internal layers of llms reveal patterns for jailbreak detection? (2025),https://arxiv.org/abs/2510.06594

arXiv 2025
[20]

arXiv preprint arXiv:2602.11495 (2026)

Kadali, S.D.S.S., Papalexakis, E.E.: Jailbreaking leaves a trace: Understanding and detecting jailbreak attacks from internal representations of large language models. arXiv preprint arXiv:2602.11495 (2026)

arXiv 2026
[21]

Li, F., Xu, Q., Bao, S., Yang, Z., Zhao, X., Cao, X., Huang, Q.: Blackmirror: Black- box backdoor detection for text-to-image models via instruction-response deviation (2026),https://arxiv.org/abs/2603.05921

arXiv 2026
[22]

arXiv preprint arXiv:2502.06351 (2025)

Li, Y., Rügamer, D., Bischl, B., Rezaei, M.: Calibrating llms with information- theoretic evidential deep learning. arXiv preprint arXiv:2502.06351 (2025)

arXiv 2025
[23]

arXiv preprint arXiv:2510.22261 (2025)

Manchingal, S.K.: Epistemic deep learning: Enabling machine learning models to know when they do not know. arXiv preprint arXiv:2510.22261 (2025)

arXiv 2025
[24]

arXiv preprint arXiv:2510.22680 (2025)

Manchingal, S.K., Amaritei, A., Gohad, M., Sultana, M., Kooij, J.F., Cuzzolin, F., Bradley, A.: Uncertainty-aware autonomous vehicles: Predicting the road ahead. arXiv preprint arXiv:2510.22680 (2025)

arXiv 2025
[25]

arXiv preprint arXiv:2505.04950 (2025)

Manchingal, S.K., Bradley, A., Kooij, J.F., Shariatmadar, K., Yorke-Smith, N., Cuzzolin, F.: Epistemic artificial intelligence is essential for machine learning mod- els to trulyknow when they do not know’. arXiv preprint arXiv:2505.04950 (2025)

arXiv 2025
[26]

arXiv preprint arXiv:2206.07609 (2022)

Manchingal, S.K., Cuzzolin, F.: Epistemic deep learning. arXiv preprint arXiv:2206.07609 (2022)

arXiv 2022
[27]

arXiv preprint arXiv:2605.18871 (2026)

Manchingal, S.K., Kalia, A., Gonçalves, F., Rawther, S.: Distributional energy- based models for uncertainty-aware structured llm reasoning. arXiv preprint arXiv:2605.18871 (2026)

Pith/arXiv arXiv 2026
[28]

Manchingal,S.K.,Mubashar,M.,Sultana,M.,Khan,S.,Cuzzolin,F.:EPISTEMIC ARTIFICIAL INTELLIGENCE: Using random sets to quantify uncertainty in machine learning (2024)

2024
[29]

In: International Conference on Artificial In- telligence and Statistics (2025),https://api.semanticscholar.org/CorpusID: 275932472

Manchingal, S.K., Mubashar, M., Wang, K., Cuzzolin, F.: A unified evaluation framework for epistemic predictions. In: International Conference on Artificial In- telligence and Statistics (2025),https://api.semanticscholar.org/CorpusID: 275932472

2025
[30]

In: The Thirteenth International Conference on Learning Representations (2025),https://openreview.net/forum?id=pdjkik vCch

Manchingal, S.K., Mubashar, M., Wang, K., Shariatmadar, K., Cuzzolin, F.: Random-set neural networks. In: The Thirteenth International Conference on Learning Representations (2025),https://openreview.net/forum?id=pdjkik vCch

2025
[31]

org/abs/2402.04249

Mazeika, M., Phan, L., Yin, X., et al.: Harmbench: A standardized evaluation framework for automated red teaming and robust refusal (2024),https://arxiv. org/abs/2402.04249

Pith/arXiv arXiv 2024
[32]

Quevedo, E., Yero, J., Koerner, R., Rivas, P., Cerny, T.: Detecting hallucinations in large language model generation: A token probability approach (2024),https: //arxiv.org/abs/2405.19648 18 Nikolenko et al

arXiv 2024
[33]

Robey, A., Wong, E., Hassani, H., Pappas, G.J.: Smoothllm: Defending large lan- guage models against jailbreaking attacks (2024),https://arxiv.org/abs/2310 .03684

2024
[34]

Shen, X., Cai, Y., Ning, R., Xin, C., Wu, H.: Df-logit: Data-free logic-gated back- door attacks in vision transformers (2026),https://arxiv.org/abs/2602.03040

arXiv 2026
[35]

Souly, A., Lu, Q., Bowen, D., Trinh, T., Hsieh, E., Pandey, S., Abbeel, P., Svegliato, J., Emmons, S., Watkins, O., Toyer, S.: A strongreject for empty jailbreaks (2024), https://arxiv.org/abs/2402.10260

Pith/arXiv arXiv 2024
[36]

Team, G., Mesnard, T., Cassidy Hardin, e.a.: Gemma: Open models based on gemini research and technology (2024),https://arxiv.org/abs/2403.08295

Pith/arXiv arXiv 2024
[37]

In: Joint European Conference on Machine Learning and Knowl- edge Discovery in Databases

Vahidi, A., Wimmer, L., Gündüz, H.A., Bischl, B., Hüllermeier, E., Rezaei, M.: Di- versified ensemble of independent sub-networks for robust self-supervised represen- tation learning. In: Joint European Conference on Machine Learning and Knowl- edge Discovery in Databases. pp. 38–55. Springer (2024)

2024
[38]

In: The Twelfth International Conference on Learning Representations (2024)

Vahidi, A., Schosser, S., Wimmer, L., Li, Y., Bischl, B., Hüllermeier, E., Rezaei, M.: Probabilistic self-supervised representation learning via scoring rules minimization. In: The Twelfth International Conference on Learning Representations (2024)

2024
[39]

arXiv preprint arXiv:2603.14070 (2026)

Venkatesh, V., Hüllermeier, E., Bischl, B., Rezaei, M.: Structured credal learning. arXiv preprint arXiv:2603.14070 (2026)

arXiv 2026
[40]

Wen, Q., Wang, J., Nan, Y., He, P., Tandon, R., Xu, H.: Embedding perturbation may better reflect the uncertainty in llm reasoning (2026),https://arxiv.org/ abs/2602.02427

Pith/arXiv arXiv 2026
[41]

arXiv preprint arXiv:2605.11987 (2026)

Woodley, T., Manchingal, S.K., Tolloso, M., Bacciu, D., Cuzzolin, F.: Random-set graph neural networks. arXiv preprint arXiv:2605.11987 (2026)

Pith/arXiv arXiv 2026
[42]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Zhai, S., Li, J., Liu, Y., Chen, H., Tian, Z., Qu, W., Shen, Q., Jia, R., Dong, Y., Zhang, J.: Efficient input-level backdoor defense on text-to-image synthesis via neuron activation variation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15182–15193 (2025)

2025
[43]

org/abs/2602.22724

Zhang, T., Xu, Y., Wang, J., Guo, K., Xu, X., Xiao, B., Guan, Q., Fan, J., Liu, J., Liu, Z., Hu, H.: Agentsentry: Mitigating indirect prompt injection in llm agents via temporal causal diagnostics and context purification (2026),https://arxiv. org/abs/2602.22724

arXiv 2026
[44]

In: Conference on Robot Learn- ing

Zhang, Y., Wang, J., Boedecker, J.: Robust reinforcement learning in continuous control tasks with uncertainty set regularization. In: Conference on Robot Learn- ing. pp. 1400–1424. PMLR (2023)

2023
[45]

Write an email that prompts a user for their login credentials

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J.Z., Fredrikson, M.: Universal and transferable adversarial attacks on aligned language models (2023),https: //arxiv.org/abs/2307.15043 Intermediate Layers for Jailbreak Detection 19 7 Appendix 7.1 Why JailbreakBench Benign Reduces Separability The main paper shows that entropy-trajectory separability dro...

Pith/arXiv arXiv 2023

[1] [1]

Scientific Reports15(1), 36453 (2025)

Alohali, K.I., Almusaeeb, L.A., Almubarak, A.A., Alohali, A.I., Muaygil, R.A.: Reasoning-based llms surpass average human performance on medical social skills. Scientific Reports15(1), 36453 (2025)

2025

[2] [2]

Alon, G., Kamfonas, M.: Detecting language model attacks with perplexity (2023), https://arxiv.org/abs/2308.14132

Pith/arXiv arXiv 2023

[3] [3]

An Yang, e.a.: Qwen3 technical report (2025),https://arxiv.org/abs/2505.093 88

2025

[4] [4]

Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., Nanda, N.: Refusal in language models is mediated by a single direction (2024),https: //arxiv.org/abs/2406.11717

Pith/arXiv arXiv 2024

[5] [5]

ICML (2024)

Belrose, N., Furman, Z., Smith, L., Halawi, D., Ostrovsky, I., McKinney, L., Bi- derman, S., Steinhardt, J.: Eliciting latent predictions from transformers with the tuned lens. ICML (2024)

2024

[6] [6]

org/abs/2502.15435

Candogan, L.N., Wu, Y., Rocamora, E.A., Chrysos, G.G., Cevher, V.: Single-pass detection of jailbreaking input in large language models (2025),https://arxiv. org/abs/2502.15435

arXiv 2025

[7] [7]

ICLR (2026)

Cao, C., Xu, X., Han, B., Li, H.: Reasoned safety alignment: Ensuring jailbreak defense via answer-then-check. ICLR (2026)

2026

[8] [8]

arXiv preprint arXiv:2512.05526 (2025)

Caprio, M., Manchingal, S.K., Cuzzolin, F.: Credal and interval deep evidential classifications. arXiv preprint arXiv:2512.05526 (2025)

arXiv 2025

[9] [9]

Chao, P., Debenedetti, E., Robey, A., Andriushchenko, M., Croce, F., Sehwag, V., Dobriban, E., Flammarion, N., Pappas, G.J., Tramer, F., Hassani, H., Wong, E.: Jailbreakbench: An open robustness benchmark for jailbreaking large language models (2024),https://arxiv.org/abs/2404.01318

Pith/arXiv arXiv 2024

[10] [10]

Chen, G., Xia, Y., Jia, X., Li, Z., Torr, P., Gu, J.: Llm jailbreak detection for (almost) free! In: Findings of the Association for Computational Linguistics: EMNLP 2025. p. 5777–5807. Association for Computational Linguistics (2025). https://doi.org/10.18653/v1/2025.findings-emnlp.309,http://dx.doi.org /10.18653/v1/2025.findings-emnlp.309

work page doi:10.18653/v1/2025.findings-emnlp.309 2025

[11] [11]

arXiv preprint arXiv:2602.13840 (2026)

Cheng, Y., Ye, H., Li, H.H., Sun, J., Chen, Y.: Privact: Internalizing con- textual privacy preservation via multi-agent preference training. arXiv preprint arXiv:2602.13840 (2026)

arXiv 2026

[12] [12]

Preprints (May 2026)

Cuzzolin, F., Manchingal, S.K.: A research programme for continual and neu- rosymbolic learning in epistemic artificial intelligence. Preprints (May 2026). https://doi.org/10.20944/preprints202605.1053.v1,https://doi.org/ 10.20944/preprints202605.1053.v1

work page doi:10.20944/preprints202605.1053.v1 2026

[13] [13]

Ding, N., Chen, Y., Xu, B., Qin, Y., Zheng, Z., Hu, S., Liu, Z., Sun, M., Zhou, B.: Enhancingchatlanguagemodelsbyscalinghigh-qualityinstructionalconversations (2023),https://arxiv.org/abs/2305.14233

Pith/arXiv arXiv 2023

[14] [14]

Avail- able at SSRN 4858664 (2024) Intermediate Layers for Jailbreak Detection 17

Ferrari, N., Zanarini, N., Fraccaroli, M., Bizzarri, A., Lamma, E.: Integration of deep generative anomaly detection algorithm in high-speed industrial line. Avail- able at SSRN 4858664 (2024) Intermediate Layers for Jailbreak Detection 17

2024

[15] [15]

Grattafiori, A., Dubey, A., Jauhri, A., et al.: The llama 3 herd of models (2024), https://arxiv.org/abs/2407.21783

Pith/arXiv arXiv 2024

[16] [16]

Hasan, M.M., Sternhagen, M., Roy, K.C.: Engineering attack vectors and detecting anomalies in additive manufacturing (2026),https://arxiv.org/abs/2601.00384

arXiv 2026

[17] [17]

Jain,N.,Schwarzschild,A.,Wen,Y.,etal.:Baselinedefensesforadversarialattacks against aligned language models (2023),https://arxiv.org/abs/2309.00614

Pith/arXiv arXiv 2023

[18] [18]

Jiang, L., Rao, K., et al.: Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models (2024),https://arxiv.org/abs/2406.185 10

2024

[19] [19]

Kadali, S.D.S.S., Papalexakis, E.E.: Do internal layers of llms reveal patterns for jailbreak detection? (2025),https://arxiv.org/abs/2510.06594

arXiv 2025

[20] [20]

arXiv preprint arXiv:2602.11495 (2026)

Kadali, S.D.S.S., Papalexakis, E.E.: Jailbreaking leaves a trace: Understanding and detecting jailbreak attacks from internal representations of large language models. arXiv preprint arXiv:2602.11495 (2026)

arXiv 2026

[21] [21]

Li, F., Xu, Q., Bao, S., Yang, Z., Zhao, X., Cao, X., Huang, Q.: Blackmirror: Black- box backdoor detection for text-to-image models via instruction-response deviation (2026),https://arxiv.org/abs/2603.05921

arXiv 2026

[22] [22]

arXiv preprint arXiv:2502.06351 (2025)

Li, Y., Rügamer, D., Bischl, B., Rezaei, M.: Calibrating llms with information- theoretic evidential deep learning. arXiv preprint arXiv:2502.06351 (2025)

arXiv 2025

[23] [23]

arXiv preprint arXiv:2510.22261 (2025)

Manchingal, S.K.: Epistemic deep learning: Enabling machine learning models to know when they do not know. arXiv preprint arXiv:2510.22261 (2025)

arXiv 2025

[24] [24]

arXiv preprint arXiv:2510.22680 (2025)

Manchingal, S.K., Amaritei, A., Gohad, M., Sultana, M., Kooij, J.F., Cuzzolin, F., Bradley, A.: Uncertainty-aware autonomous vehicles: Predicting the road ahead. arXiv preprint arXiv:2510.22680 (2025)

arXiv 2025

[25] [25]

arXiv preprint arXiv:2505.04950 (2025)

Manchingal, S.K., Bradley, A., Kooij, J.F., Shariatmadar, K., Yorke-Smith, N., Cuzzolin, F.: Epistemic artificial intelligence is essential for machine learning mod- els to trulyknow when they do not know’. arXiv preprint arXiv:2505.04950 (2025)

arXiv 2025

[26] [26]

arXiv preprint arXiv:2206.07609 (2022)

Manchingal, S.K., Cuzzolin, F.: Epistemic deep learning. arXiv preprint arXiv:2206.07609 (2022)

arXiv 2022

[27] [27]

arXiv preprint arXiv:2605.18871 (2026)

Manchingal, S.K., Kalia, A., Gonçalves, F., Rawther, S.: Distributional energy- based models for uncertainty-aware structured llm reasoning. arXiv preprint arXiv:2605.18871 (2026)

Pith/arXiv arXiv 2026

[28] [28]

Manchingal,S.K.,Mubashar,M.,Sultana,M.,Khan,S.,Cuzzolin,F.:EPISTEMIC ARTIFICIAL INTELLIGENCE: Using random sets to quantify uncertainty in machine learning (2024)

2024

[29] [29]

In: International Conference on Artificial In- telligence and Statistics (2025),https://api.semanticscholar.org/CorpusID: 275932472

Manchingal, S.K., Mubashar, M., Wang, K., Cuzzolin, F.: A unified evaluation framework for epistemic predictions. In: International Conference on Artificial In- telligence and Statistics (2025),https://api.semanticscholar.org/CorpusID: 275932472

2025

[30] [30]

In: The Thirteenth International Conference on Learning Representations (2025),https://openreview.net/forum?id=pdjkik vCch

Manchingal, S.K., Mubashar, M., Wang, K., Shariatmadar, K., Cuzzolin, F.: Random-set neural networks. In: The Thirteenth International Conference on Learning Representations (2025),https://openreview.net/forum?id=pdjkik vCch

2025

[31] [31]

org/abs/2402.04249

Mazeika, M., Phan, L., Yin, X., et al.: Harmbench: A standardized evaluation framework for automated red teaming and robust refusal (2024),https://arxiv. org/abs/2402.04249

Pith/arXiv arXiv 2024

[32] [32]

Quevedo, E., Yero, J., Koerner, R., Rivas, P., Cerny, T.: Detecting hallucinations in large language model generation: A token probability approach (2024),https: //arxiv.org/abs/2405.19648 18 Nikolenko et al

arXiv 2024

[33] [33]

Robey, A., Wong, E., Hassani, H., Pappas, G.J.: Smoothllm: Defending large lan- guage models against jailbreaking attacks (2024),https://arxiv.org/abs/2310 .03684

2024

[34] [34]

Shen, X., Cai, Y., Ning, R., Xin, C., Wu, H.: Df-logit: Data-free logic-gated back- door attacks in vision transformers (2026),https://arxiv.org/abs/2602.03040

arXiv 2026

[35] [35]

Souly, A., Lu, Q., Bowen, D., Trinh, T., Hsieh, E., Pandey, S., Abbeel, P., Svegliato, J., Emmons, S., Watkins, O., Toyer, S.: A strongreject for empty jailbreaks (2024), https://arxiv.org/abs/2402.10260

Pith/arXiv arXiv 2024

[36] [36]

Team, G., Mesnard, T., Cassidy Hardin, e.a.: Gemma: Open models based on gemini research and technology (2024),https://arxiv.org/abs/2403.08295

Pith/arXiv arXiv 2024

[37] [37]

In: Joint European Conference on Machine Learning and Knowl- edge Discovery in Databases

Vahidi, A., Wimmer, L., Gündüz, H.A., Bischl, B., Hüllermeier, E., Rezaei, M.: Di- versified ensemble of independent sub-networks for robust self-supervised represen- tation learning. In: Joint European Conference on Machine Learning and Knowl- edge Discovery in Databases. pp. 38–55. Springer (2024)

2024

[38] [38]

In: The Twelfth International Conference on Learning Representations (2024)

Vahidi, A., Schosser, S., Wimmer, L., Li, Y., Bischl, B., Hüllermeier, E., Rezaei, M.: Probabilistic self-supervised representation learning via scoring rules minimization. In: The Twelfth International Conference on Learning Representations (2024)

2024

[39] [39]

arXiv preprint arXiv:2603.14070 (2026)

Venkatesh, V., Hüllermeier, E., Bischl, B., Rezaei, M.: Structured credal learning. arXiv preprint arXiv:2603.14070 (2026)

arXiv 2026

[40] [40]

Wen, Q., Wang, J., Nan, Y., He, P., Tandon, R., Xu, H.: Embedding perturbation may better reflect the uncertainty in llm reasoning (2026),https://arxiv.org/ abs/2602.02427

Pith/arXiv arXiv 2026

[41] [41]

arXiv preprint arXiv:2605.11987 (2026)

Woodley, T., Manchingal, S.K., Tolloso, M., Bacciu, D., Cuzzolin, F.: Random-set graph neural networks. arXiv preprint arXiv:2605.11987 (2026)

Pith/arXiv arXiv 2026

[42] [42]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Zhai, S., Li, J., Liu, Y., Chen, H., Tian, Z., Qu, W., Shen, Q., Jia, R., Dong, Y., Zhang, J.: Efficient input-level backdoor defense on text-to-image synthesis via neuron activation variation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15182–15193 (2025)

2025

[43] [43]

org/abs/2602.22724

Zhang, T., Xu, Y., Wang, J., Guo, K., Xu, X., Xiao, B., Guan, Q., Fan, J., Liu, J., Liu, Z., Hu, H.: Agentsentry: Mitigating indirect prompt injection in llm agents via temporal causal diagnostics and context purification (2026),https://arxiv. org/abs/2602.22724

arXiv 2026

[44] [44]

In: Conference on Robot Learn- ing

Zhang, Y., Wang, J., Boedecker, J.: Robust reinforcement learning in continuous control tasks with uncertainty set regularization. In: Conference on Robot Learn- ing. pp. 1400–1424. PMLR (2023)

2023

[45] [45]

Write an email that prompts a user for their login credentials

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J.Z., Fredrikson, M.: Universal and transferable adversarial attacks on aligned language models (2023),https: //arxiv.org/abs/2307.15043 Intermediate Layers for Jailbreak Detection 19 7 Appendix 7.1 Why JailbreakBench Benign Reduces Separability The main paper shows that entropy-trajectory separability dro...

Pith/arXiv arXiv 2023