arxiv: 2605.12128 · v1 · submitted 2026-05-12 · 💻 cs.CL · cs.CY

Recognition: 2 theorem links

· Lean Theorem

Metaphor Is Not All Attention Needs

Daniele Nardi, Federico Pierucci, Francesco Giarrusso, Giacomo De Luca, Marcello Galisai, Matteo Prandi, Olga Sorokoletova, Piercosma Bisconti, Vincenzo Suriani

Pith reviewed 2026-05-13 05:56 UTC · model grok-4.3

classification 💻 cs.CL cs.CY

keywords jailbreakattention patternspoetic promptsLLM safetyinterpretabilityclusteringstylistic reformulation

0 comments

The pith

Poetic jailbreaks succeed because they trigger distinct attention patterns in LLMs that stay separate from safety detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks why poetic rewrites of harmful instructions reliably bypass safety training in large language models. It tests three possibilities: failure to spot the literary format, reliance on specific poetic devices, or broader shifts in how the model processes irregular text. By vectorizing attention maps, clustering them, and training linear probes on Qwen3-14B, the authors find that models readily distinguish poetic from prose inputs yet cannot predict jailbreak success inside either group. Clustering separates inputs by format but not by whether the prompt evades safety. The central finding is that literary jailbreaks work through accumulated stylistic irregularities that reroute processing away from the lexical triggers safety training targets.

Core claim

Models distinguish poetic from prose formats with high accuracy, yet struggle to predict jailbreak success within each format. Clustering of attention representations shows clear separation by literary format but not by safety label. These results indicate that jailbreak success is not caused by a failure to recognize poetic formatting; rather, poetic prompts induce distinct processing patterns that remain largely independent of harmful-content detection. Literary jailbreaks appear to misalign models not through any single poetic device but through accumulated stylistic irregularities that alter prompt processing and avoid lexical triggers considered during post-training.

What carries the argument

Vectorized attention-map representations, input-level ablation of poetic devices, clustering of those vectors, and linear probes trained to predict both safety outcome and literary format.

If this is right

Safety mechanisms must account for style-induced shifts in model behavior rather than relying only on lexical triggers.
No single poetic device drives the effect; success comes from accumulated stylistic irregularities.
Post-training needs to address processing changes that arise from irregular formatting.
Interpretability tools such as attention clustering can expose processing paths that operate independently of safety labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of processing paths may occur with other non-poetic stylistic jailbreaks such as code snippets or formal registers.
Style-aware safety training could improve robustness across a wider range of models and prompt styles.
Attention-based probes might be run at inference time to flag prompts that induce risky processing shifts.

Load-bearing premise

Vectorized attention maps and clustering faithfully capture the processing differences that matter for safety decisions, and results on Qwen3-14B generalize to other models.

What would settle it

Finding a strong correlation between attention-cluster membership and jailbreak success inside the poetic format group, or showing that removing one specific poetic device consistently restores safety alignment.

Figures

Figures reproduced from arXiv: 2605.12128 by Daniele Nardi, Federico Pierucci, Francesco Giarrusso, Giacomo De Luca, Marcello Galisai, Matteo Prandi, Olga Sorokoletova, Piercosma Bisconti, Vincenzo Suriani.

**Figure 1.** Figure 1: Ablation study for a sample from the cyber-offense hazard category. Each subfigure pairs a prompt (top) with the corresponding model response (bottom); red-bordered responses indicate successful jailbreaks (unsafe outputs) and blue-bordered responses indicate safety compliance (refusals). Prompt tokens are annotated for functional groups using our annotation pipeline, which combines three LLM judges with t… view at source ↗

**Figure 2.** Figure 2: Construction of the fixed length interpretable attention feature vector. For each prompt [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Three-dimensional PCA projection of the constructed feature vectors, colored by safety [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Logistic regression coefficients for all 72 features in the format prediction probe (poetry [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Meta-prompt template used by DeepSeek-R1 to transform prose prompts from the MLCommons AILuminate Benchmark into poetic reformulations, applied during construction of the main dataset. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Prompts were embedded with gemini-embeddings-002 and topics extracted with a BERTopic-inspired pipeline, employing spectral clustering and using gemini-3-1-pro for automatic labeling. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Additional ablation examples derived from the original prompt in Figure 1a. Each [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Average between-layer Pearson correlation matrix computed from accumulated attention [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: System prompt for the FIGURATIVE judge. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: System prompt for the HARMFUL PAYLOAD judge. TECHNICAL Judge You are a computational linguistics researcher performing technical vocabulary annotation for an AI safety study on adversarial jailbreak prompts. You are NOT being asked to execute, reproduce, or assist with any harmful content. This is a structural lexical annotation task. ANNOTATION TASK Identify every domain-specific TECHNICAL term in the te… view at source ↗

**Figure 11.** Figure 11: System prompt for the TECHNICAL judge. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Safety prediction probe on the prose subset (safe vs. unsafe) after class balancing. Error bars indicate standard deviation across the 5 K-fold CV folds, computed per feature. The horizontal axis labels each feature, where pi denotes the generation phase, and cj corresponds to the layer cluster. The vertical axis reports the coefficient value: positive values contribute to predicting safe, negative values… view at source ↗

**Figure 13.** Figure 13: Safety prediction probe on the poetry subset (safe vs. unsafe). Error bars indicate standard deviation across the 5 K-fold CV folds, computed per feature. The horizontal axis labels each feature, where pi denotes the generation phase, and cj corresponds to the layer cluster. The vertical axis reports the coefficient value: positive values contribute to predicting safe, negative values contribute to predic… view at source ↗

read the original abstract

Large language models are increasingly deployed in safety-critical applications, where their ability to resist harmful instructions is essential. Although post-training aims to make models robust against many jailbreak strategies, recent evidence shows that stylistic reformulations, such as poetic transformation, can still bypass safety mechanisms with alarming effectiveness. This raises a central question: why do literary jailbreaks succeed? In this work, we investigate whether their effectiveness depends on specific poetic devices, on a failure to recognize literary formatting, or on deeper changes in how models process stylistically irregular prompts. We address this problem through an interpretability analysis of attention patterns. We perform input-level ablation studies to assess the contribution of individual and combinations of poetic devices; construct an interpretable vector representation of attention maps; cluster these representations and train linear probes to predict safety outcomes and literary format. Our results show that models distinguish poetic from prose formats with high accuracy, yet struggle to predict jailbreak success within each format. Clustering further reveals clear separation by literary format, but not by safety label. These findings indicate that jailbreak success is not caused by a failure to recognize poetic formatting; rather, poetic prompts induce distinct processing patterns that remain largely independent of harmful-content detection. Overall, literary jailbreaks appear to misalign large language models not through any single poetic device, but through accumulated stylistic irregularities that alter prompt processing and avoid lexical triggers considered during post-training. This suggests that robustness requires safety mechanisms that account for style-induced shifts in model behavior. We use Qwen3-14B as a representative open-weight case study.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Poetic jailbreaks trigger distinct attention patterns that bypass safety checks rather than fooling format recognition, but the linear-probe evidence for true independence is still thin.

read the letter

The paper's core observation is that attention maps from Qwen3-14B cluster and classify cleanly by poetic versus prose format, yet linear probes do a poor job predicting whether a prompt within each format succeeds as a jailbreak. This leads them to conclude that style shifts alter processing in ways that avoid the safety mechanisms trained on more standard prompts, rather than the model simply failing to spot the literary wrapper. The ablation of individual poetic devices reinforces that no single feature drives the effect, pointing instead to cumulative stylistic irregularities. That framing is a reasonable reading of the clustering results they describe and gives a concrete handle on why these attacks remain effective after post-training. The work applies off-the-shelf interpretability tools in a targeted way to a live safety problem, which is useful even if the methods themselves are not novel. The experimental logic is coherent on its own terms and the choice of an open-weight model makes the setup reproducible in principle. The main limitation is the reliance on vectorized attention maps plus linear probes. If safety decisions depend on non-linear combinations across layers or heads that do not survive the vectorization step, the apparent independence between format and harm detection could be an artifact of the representation rather than a property of the model. They also report results on only one model, so it is unclear how far the pattern generalizes. Without the full dataset sizes, statistical tests, or probe accuracies in the abstract, it is hard to judge how sharp the separation actually is. This is the sort of focused case study that safety and interpretability groups should discuss. It does not claim to solve the broader robustness problem, but it supplies a plausible mechanism worth testing further. I would send it to peer review; the question is timely, the approach is straightforward to evaluate, and the potential flaw around probe expressivity is exactly the kind of issue referees can pressure-test.

Referee Report

2 major / 2 minor

Summary. The paper investigates why poetic reformulations of harmful prompts (literary jailbreaks) succeed against safety-aligned LLMs. Using Qwen3-14B as a case study, it performs ablation on poetic devices, constructs vectorized representations of attention maps, applies clustering, and trains linear probes to predict literary format versus jailbreak success. The central claim is that models reliably distinguish poetic from prose formats, yet format and safety labels are largely independent in the learned representations; jailbreak success arises from accumulated stylistic irregularities that shift processing patterns away from lexical triggers rather than from any failure to detect the literary format itself.

Significance. If the empirical separation holds under more expressive probes and across models, the work supplies a concrete interpretability argument that style-induced attention shifts can decouple from content-based safety filters. This would directly motivate safety training regimes that incorporate stylistic variation rather than relying solely on post-training lexical or semantic guards, and it offers a reusable vectorization-plus-probe pipeline for studying format effects in other alignment settings.

major comments (2)

[interpretability analysis and probe experiments] The independence conclusion rests on linear probes achieving high accuracy for format but low accuracy for jailbreak success within each format, plus clustering separating only by format. Because the vectorized attention representation is fixed and only linear classifiers are reported, non-linear interactions between heads, layers, or token attentions that could still link style to safety decisions are not ruled out; the paper should either add non-linear probes (e.g., small MLPs) or provide a justification that linearity is sufficient for the relevant decision boundary.
[experimental setup and conclusion] All quantitative results are obtained on a single model (Qwen3-14B). While the manuscript presents this as a representative case study, the claim that poetic prompts induce processing patterns independent of harmful-content detection is stated in general terms; without at least one additional model or an explicit discussion of architectural differences that might modulate the effect, the scope of the independence finding remains unclear.

minor comments (2)

[clustering results] The abstract states that clustering reveals 'clear separation by literary format, but not by safety label,' yet no quantitative cluster metrics (e.g., silhouette scores, adjusted Rand index against safety labels) are referenced; adding these numbers would strengthen the visual claim.
[methods] The vectorization procedure for attention maps is described only at a high level; a short appendix or paragraph specifying how multi-head, multi-layer maps are flattened or aggregated (mean, concatenation, etc.) would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our paper investigating the attention mechanisms behind poetic jailbreaks. We provide point-by-point responses to the major comments and indicate the planned revisions.

read point-by-point responses

Referee: The independence conclusion rests on linear probes achieving high accuracy for format but low accuracy for jailbreak success within each format, plus clustering separating only by format. Because the vectorized attention representation is fixed and only linear classifiers are reported, non-linear interactions between heads, layers, or token attentions that could still link style to safety decisions are not ruled out; the paper should either add non-linear probes (e.g., small MLPs) or provide a justification that linearity is sufficient for the relevant decision boundary.

Authors: We appreciate the referee's suggestion regarding the potential for non-linear interactions. Our linear probes were chosen for their interpretability in showing that format is separable while safety is not within formats. To strengthen this, we will add experiments with small MLPs as non-linear probes in the revised manuscript to rule out more complex dependencies. revision: yes
Referee: All quantitative results are obtained on a single model (Qwen3-14B). While the manuscript presents this as a representative case study, the claim that poetic prompts induce processing patterns independent of harmful-content detection is stated in general terms; without at least one additional model or an explicit discussion of architectural differences that might modulate the effect, the scope of the independence finding remains unclear.

Authors: We agree that broadening the scope would be valuable. Since the work is presented as a case study on Qwen3-14B, we will revise the manuscript to include a dedicated discussion on how the findings relate to common architectural elements in LLMs, such as attention mechanisms, and clarify the generalizability limits. We believe this addresses the concern without requiring additional model experiments at this stage. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical interpretability study

full rationale

The paper performs input ablation, constructs vectorized attention-map representations, runs clustering, and trains linear probes to predict format vs. safety labels on Qwen3-14B. All claims rest on observed accuracies, cluster separations, and probe performance rather than any derivation, fitted parameter renamed as prediction, or self-citation chain. No equations, uniqueness theorems, or ansatzes appear; the central claim follows directly from the experimental outcomes without reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on standard assumptions of attention-based interpretability without introducing new free parameters, axioms, or entities.

axioms (1)

domain assumption Attention maps can be meaningfully vectorized and clustered to reveal processing differences.
Invoked in the construction of interpretable vector representations and subsequent clustering.

pith-pipeline@v0.9.0 · 5604 in / 1184 out tokens · 23715 ms · 2026-05-13T05:56:21.597558+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
Our results show that models distinguish poetic from prose formats with high accuracy, yet struggle to predict jailbreak success within each format.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 4 internal anchors

[1]

A General Language Assistant as a Laboratory for Alignment

A general language assistant as a laboratory for alignment , author=. arXiv preprint arXiv:2112.00861 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[3]

Constitutional AI: Harmlessness from AI Feedback

Constitutional ai: Harmlessness from ai feedback , author=. arXiv preprint arXiv:2212.08073 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Jailbreak- ing chatgpt via prompt engineering: An empirical study

Jailbreaking chatgpt via prompt engineering: An empirical study , author=. arXiv preprint arXiv:2305.13860 , year=

work page arXiv
[5]

33rd USENIX Security Symposium (USENIX Security 24) , pages=

Don't listen to me: Understanding and exploring jailbreak prompts of large language models , author=. 33rd USENIX Security Symposium (USENIX Security 24) , pages=

work page
[6]

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , pages=

Tricking llms into disobedience: Formalizing, analyzing, and detecting jailbreaks , author=. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , pages=

work page 2024
[7]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[8]

arXiv preprint arXiv:2511.15304 , year=

Adversarial poetry as a universal single-turn jailbreak mechanism in large language models , author=. arXiv preprint arXiv:2511.15304 , year=

work page arXiv
[9]

arXiv preprint arXiv:2601.08837 , year=

From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda , author=. arXiv preprint arXiv:2601.08837 , year=

work page arXiv
[10]

2016 , eprint=

Neural Machine Translation by Jointly Learning to Align and Translate , author=. 2016 , eprint=

work page 2016
[11]

2023 , eprint=

Attention Is All You Need , author=. 2023 , eprint=

work page 2023
[12]

2019 , eprint=

Attention is not not Explanation , author=. 2019 , eprint=

work page 2019
[13]

2019 , eprint=

Attention is not Explanation , author=. 2019 , eprint=

work page 2019
[14]

2025 , eprint=

Attention Sinks: A 'Catch, Tag, Release' Mechanism for Embeddings , author=. 2025 , eprint=

work page 2025
[15]

2024 , eprint=

Efficient Streaming Language Models with Attention Sinks , author=. 2024 , eprint=

work page 2024
[16]

2025 , eprint=

Punctuation and Predicates in Language Models , author=. 2025 , eprint=

work page 2025
[17]

2019 , eprint=

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , author=. 2019 , eprint=

work page 2019
[18]

2025 , eprint=

AttentionDefense: Leveraging System Prompt Attention for Explainable Defense Against Novel Jailbreaks , author=. 2025 , eprint=

work page 2025
[19]

Artificial Text Detection via Examining the Topology of Attention Maps , url=

Kushnareva, Laida and Cherniavskii, Daniil and Mikhailov, Vladislav and Artemova, Ekaterina and Barannikov, Serguei and Bernstein, Alexander and Piontkovskaya, Irina and Piontkovski, Dmitri and Burnaev, Evgeny , year=. Artificial Text Detection via Examining the Topology of Attention Maps , url=. doi:10.18653/v1/2021.emnlp-main.50 , booktitle=

work page doi:10.18653/v1/2021.emnlp-main.50 2021
[20]

2025 , eprint=

DHCP: Detecting Hallucinations by Cross-modal Attention Pattern in Large Vision-Language Models , author=. 2025 , eprint=

work page 2025
[21]

2025 , eprint=

Guess or Recall? Training CNNs to Classify and Localize Memorization in LLMs , author=. 2025 , eprint=

work page 2025
[22]

2019 , eprint=

What Does BERT Look At? An Analysis of BERT's Attention , author=. 2019 , eprint=

work page 2019
[23]

2024 , eprint=

Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps , author=. 2024 , eprint=

work page 2024
[24]

2025 , eprint=

The Map of Misbelief: Tracing Intrinsic and Extrinsic Hallucinations Through Attention Patterns , author=. 2025 , eprint=

work page 2025
[25]

2026 , eprint=

Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models , author=. 2026 , eprint=

work page 2026
[26]

2025 , eprint=

Safety Alignment Should Be Made More Than Just A Few Attention Heads , author=. 2025 , eprint=

work page 2025
[27]

2025 , eprint=

Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs , author=. 2025 , eprint=

work page 2025
[28]

2025 , eprint=

Universal Jailbreak Suffixes Are Strong Attention Hijackers , author=. 2025 , eprint=

work page 2025
[29]

2024 , eprint=

Introducing v0.5 of the AI Safety Benchmark from MLCommons , author=. 2024 , eprint=

work page 2024
[30]

2025 , eprint=

AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons , author=. 2025 , eprint=

work page 2025
[31]

2020 , eprint=

Fine-Tuning Language Models from Human Preferences , author=. 2020 , eprint=

work page 2020
[32]

Open Problems in Mechanistic Interpretability

Open problems in mechanistic interpretability , author=. arXiv preprint arXiv:2501.16496 , year=

work page internal anchor Pith review arXiv
[33]

, title =

Lotman, Yury M. , title =. 1976 , publisher =

work page 1976
[34]

, title =

Lotman, Yury M. , title =. 1977 , publisher =

work page 1977
[35]

arXiv preprint arXiv:2403.19851 , year=

Localizing paragraph memorization in language models , author=. arXiv preprint arXiv:2403.19851 , year=

work page arXiv
[36]

2024 , eprint=

GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher , author=. 2024 , eprint=

work page 2024
[37]

2024 , eprint=

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs , author=. 2024 , eprint=

work page 2024
[38]

2025 , eprint=

LLMs are Vulnerable to Malicious Prompts Disguised as Scientific Language , author=. 2025 , eprint=

work page 2025
[39]

2024 , eprint=

Low-Resource Languages Jailbreak GPT-4 , author=. 2024 , eprint=

work page 2024
[40]

Analyzing memorization in large language models through the lens of model attribution , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

work page 2025
[41]

The assis- tant axis: Situating and stabilizing the default persona of language models.arXiv preprint arXiv:2601.10387, 2026

The assistant axis: Situating and stabilizing the default persona of language models , author=. arXiv preprint arXiv:2601.10387 , year=

work page arXiv
[42]

Waugh , journal =

Linda R. Waugh , journal =. The Poetic Function in the Theory of Roman Jakobson , urldate =

work page
[43]

2025 , howpublished =

Introducing Kimi K2 Thinking , author =. 2025 , howpublished =

work page 2025
[44]

2025 , howpublished =

work page 2025
[45]

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , volume=

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and others , year=. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , volume=. Nature , publisher=. doi:10.1038/s41586-025-09422-z , number=

work page doi:10.1038/s41586-025-09422-z
[46]

Qwen3 Technical Report

Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Style in Language , editor =

Jakobson, Roman , title =. Style in Language , editor =. 1960 , pages =

work page 1960
[48]

arXiv preprint arXiv:2512.05117 (2025)

The Universal Weight Subspace Hypothesis , author=. arXiv preprint arXiv:2512.05117 , year=

work page arXiv