pith. machine review for the scientific record. sign in

arxiv: 2605.12128 · v1 · submitted 2026-05-12 · 💻 cs.CL · cs.CY

Recognition: 2 theorem links

· Lean Theorem

Metaphor Is Not All Attention Needs

Daniele Nardi, Federico Pierucci, Francesco Giarrusso, Giacomo De Luca, Marcello Galisai, Matteo Prandi, Olga Sorokoletova, Piercosma Bisconti, Vincenzo Suriani

Pith reviewed 2026-05-13 05:56 UTC · model grok-4.3

classification 💻 cs.CL cs.CY
keywords jailbreakattention patternspoetic promptsLLM safetyinterpretabilityclusteringstylistic reformulation
0
0 comments X

The pith

Poetic jailbreaks succeed because they trigger distinct attention patterns in LLMs that stay separate from safety detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks why poetic rewrites of harmful instructions reliably bypass safety training in large language models. It tests three possibilities: failure to spot the literary format, reliance on specific poetic devices, or broader shifts in how the model processes irregular text. By vectorizing attention maps, clustering them, and training linear probes on Qwen3-14B, the authors find that models readily distinguish poetic from prose inputs yet cannot predict jailbreak success inside either group. Clustering separates inputs by format but not by whether the prompt evades safety. The central finding is that literary jailbreaks work through accumulated stylistic irregularities that reroute processing away from the lexical triggers safety training targets.

Core claim

Models distinguish poetic from prose formats with high accuracy, yet struggle to predict jailbreak success within each format. Clustering of attention representations shows clear separation by literary format but not by safety label. These results indicate that jailbreak success is not caused by a failure to recognize poetic formatting; rather, poetic prompts induce distinct processing patterns that remain largely independent of harmful-content detection. Literary jailbreaks appear to misalign models not through any single poetic device but through accumulated stylistic irregularities that alter prompt processing and avoid lexical triggers considered during post-training.

What carries the argument

Vectorized attention-map representations, input-level ablation of poetic devices, clustering of those vectors, and linear probes trained to predict both safety outcome and literary format.

If this is right

  • Safety mechanisms must account for style-induced shifts in model behavior rather than relying only on lexical triggers.
  • No single poetic device drives the effect; success comes from accumulated stylistic irregularities.
  • Post-training needs to address processing changes that arise from irregular formatting.
  • Interpretability tools such as attention clustering can expose processing paths that operate independently of safety labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of processing paths may occur with other non-poetic stylistic jailbreaks such as code snippets or formal registers.
  • Style-aware safety training could improve robustness across a wider range of models and prompt styles.
  • Attention-based probes might be run at inference time to flag prompts that induce risky processing shifts.

Load-bearing premise

Vectorized attention maps and clustering faithfully capture the processing differences that matter for safety decisions, and results on Qwen3-14B generalize to other models.

What would settle it

Finding a strong correlation between attention-cluster membership and jailbreak success inside the poetic format group, or showing that removing one specific poetic device consistently restores safety alignment.

Figures

Figures reproduced from arXiv: 2605.12128 by Daniele Nardi, Federico Pierucci, Francesco Giarrusso, Giacomo De Luca, Marcello Galisai, Matteo Prandi, Olga Sorokoletova, Piercosma Bisconti, Vincenzo Suriani.

Figure 1
Figure 1. Figure 1: Ablation study for a sample from the cyber-offense hazard category. Each subfigure pairs a prompt (top) with the corresponding model response (bottom); red-bordered responses indicate successful jailbreaks (unsafe outputs) and blue-bordered responses indicate safety compliance (refusals). Prompt tokens are annotated for functional groups using our annotation pipeline, which combines three LLM judges with t… view at source ↗
Figure 2
Figure 2. Figure 2: Construction of the fixed length interpretable attention feature vector. For each prompt [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Three-dimensional PCA projection of the constructed feature vectors, colored by safety [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Logistic regression coefficients for all 72 features in the format prediction probe (poetry [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Meta-prompt template used by DeepSeek-R1 to transform prose prompts from the ML￾Commons AILuminate Benchmark into poetic reformulations, applied during construction of the main dataset. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompts were embedded with gemini-embeddings-002 and topics extracted with a BERTopic-inspired pipeline, employing spectral clustering and using gemini-3-1-pro for au￾tomatic labeling. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Additional ablation examples derived from the original prompt in Figure 1a. Each [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Average between-layer Pearson correlation matrix computed from accumulated attention [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: System prompt for the FIGURATIVE judge. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: System prompt for the HARMFUL PAYLOAD judge. TECHNICAL Judge You are a computational linguistics researcher performing technical vocabulary annotation for an AI safety study on adversarial jailbreak prompts. You are NOT being asked to execute, reproduce, or assist with any harmful content. This is a structural lexical annotation task. ANNOTATION TASK Identify every domain-specific TECHNICAL term in the te… view at source ↗
Figure 11
Figure 11. Figure 11: System prompt for the TECHNICAL judge. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Safety prediction probe on the prose subset (safe vs. unsafe) after class balancing. Error bars indicate standard deviation across the 5 K-fold CV folds, computed per feature. The horizontal axis labels each feature, where pi denotes the generation phase, and cj corresponds to the layer cluster. The vertical axis reports the coefficient value: positive values contribute to predicting safe, negative values… view at source ↗
Figure 13
Figure 13. Figure 13: Safety prediction probe on the poetry subset (safe vs. unsafe). Error bars indicate standard deviation across the 5 K-fold CV folds, computed per feature. The horizontal axis labels each feature, where pi denotes the generation phase, and cj corresponds to the layer cluster. The vertical axis reports the coefficient value: positive values contribute to predicting safe, negative values contribute to predic… view at source ↗
read the original abstract

Large language models are increasingly deployed in safety-critical applications, where their ability to resist harmful instructions is essential. Although post-training aims to make models robust against many jailbreak strategies, recent evidence shows that stylistic reformulations, such as poetic transformation, can still bypass safety mechanisms with alarming effectiveness. This raises a central question: why do literary jailbreaks succeed? In this work, we investigate whether their effectiveness depends on specific poetic devices, on a failure to recognize literary formatting, or on deeper changes in how models process stylistically irregular prompts. We address this problem through an interpretability analysis of attention patterns. We perform input-level ablation studies to assess the contribution of individual and combinations of poetic devices; construct an interpretable vector representation of attention maps; cluster these representations and train linear probes to predict safety outcomes and literary format. Our results show that models distinguish poetic from prose formats with high accuracy, yet struggle to predict jailbreak success within each format. Clustering further reveals clear separation by literary format, but not by safety label. These findings indicate that jailbreak success is not caused by a failure to recognize poetic formatting; rather, poetic prompts induce distinct processing patterns that remain largely independent of harmful-content detection. Overall, literary jailbreaks appear to misalign large language models not through any single poetic device, but through accumulated stylistic irregularities that alter prompt processing and avoid lexical triggers considered during post-training. This suggests that robustness requires safety mechanisms that account for style-induced shifts in model behavior. We use Qwen3-14B as a representative open-weight case study.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper investigates why poetic reformulations of harmful prompts (literary jailbreaks) succeed against safety-aligned LLMs. Using Qwen3-14B as a case study, it performs ablation on poetic devices, constructs vectorized representations of attention maps, applies clustering, and trains linear probes to predict literary format versus jailbreak success. The central claim is that models reliably distinguish poetic from prose formats, yet format and safety labels are largely independent in the learned representations; jailbreak success arises from accumulated stylistic irregularities that shift processing patterns away from lexical triggers rather than from any failure to detect the literary format itself.

Significance. If the empirical separation holds under more expressive probes and across models, the work supplies a concrete interpretability argument that style-induced attention shifts can decouple from content-based safety filters. This would directly motivate safety training regimes that incorporate stylistic variation rather than relying solely on post-training lexical or semantic guards, and it offers a reusable vectorization-plus-probe pipeline for studying format effects in other alignment settings.

major comments (2)
  1. [interpretability analysis and probe experiments] The independence conclusion rests on linear probes achieving high accuracy for format but low accuracy for jailbreak success within each format, plus clustering separating only by format. Because the vectorized attention representation is fixed and only linear classifiers are reported, non-linear interactions between heads, layers, or token attentions that could still link style to safety decisions are not ruled out; the paper should either add non-linear probes (e.g., small MLPs) or provide a justification that linearity is sufficient for the relevant decision boundary.
  2. [experimental setup and conclusion] All quantitative results are obtained on a single model (Qwen3-14B). While the manuscript presents this as a representative case study, the claim that poetic prompts induce processing patterns independent of harmful-content detection is stated in general terms; without at least one additional model or an explicit discussion of architectural differences that might modulate the effect, the scope of the independence finding remains unclear.
minor comments (2)
  1. [clustering results] The abstract states that clustering reveals 'clear separation by literary format, but not by safety label,' yet no quantitative cluster metrics (e.g., silhouette scores, adjusted Rand index against safety labels) are referenced; adding these numbers would strengthen the visual claim.
  2. [methods] The vectorization procedure for attention maps is described only at a high level; a short appendix or paragraph specifying how multi-head, multi-layer maps are flattened or aggregated (mean, concatenation, etc.) would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our paper investigating the attention mechanisms behind poetic jailbreaks. We provide point-by-point responses to the major comments and indicate the planned revisions.

read point-by-point responses
  1. Referee: The independence conclusion rests on linear probes achieving high accuracy for format but low accuracy for jailbreak success within each format, plus clustering separating only by format. Because the vectorized attention representation is fixed and only linear classifiers are reported, non-linear interactions between heads, layers, or token attentions that could still link style to safety decisions are not ruled out; the paper should either add non-linear probes (e.g., small MLPs) or provide a justification that linearity is sufficient for the relevant decision boundary.

    Authors: We appreciate the referee's suggestion regarding the potential for non-linear interactions. Our linear probes were chosen for their interpretability in showing that format is separable while safety is not within formats. To strengthen this, we will add experiments with small MLPs as non-linear probes in the revised manuscript to rule out more complex dependencies. revision: yes

  2. Referee: All quantitative results are obtained on a single model (Qwen3-14B). While the manuscript presents this as a representative case study, the claim that poetic prompts induce processing patterns independent of harmful-content detection is stated in general terms; without at least one additional model or an explicit discussion of architectural differences that might modulate the effect, the scope of the independence finding remains unclear.

    Authors: We agree that broadening the scope would be valuable. Since the work is presented as a case study on Qwen3-14B, we will revise the manuscript to include a dedicated discussion on how the findings relate to common architectural elements in LLMs, such as attention mechanisms, and clarify the generalizability limits. We believe this addresses the concern without requiring additional model experiments at this stage. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical interpretability study

full rationale

The paper performs input ablation, constructs vectorized attention-map representations, runs clustering, and trains linear probes to predict format vs. safety labels on Qwen3-14B. All claims rest on observed accuracies, cluster separations, and probe performance rather than any derivation, fitted parameter renamed as prediction, or self-citation chain. No equations, uniqueness theorems, or ansatzes appear; the central claim follows directly from the experimental outcomes without reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on standard assumptions of attention-based interpretability without introducing new free parameters, axioms, or entities.

axioms (1)
  • domain assumption Attention maps can be meaningfully vectorized and clustered to reveal processing differences.
    Invoked in the construction of interpretable vector representations and subsequent clustering.

pith-pipeline@v0.9.0 · 5604 in / 1184 out tokens · 23715 ms · 2026-05-13T05:56:21.597558+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 4 internal anchors

  1. [1]

    A General Language Assistant as a Laboratory for Alignment

    A general language assistant as a laboratory for alignment , author=. arXiv preprint arXiv:2112.00861 , year=

  2. [2]

    Advances in neural information processing systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

  3. [3]

    Constitutional AI: Harmlessness from AI Feedback

    Constitutional ai: Harmlessness from ai feedback , author=. arXiv preprint arXiv:2212.08073 , year=

  4. [4]

    Jailbreak- ing chatgpt via prompt engineering: An empirical study

    Jailbreaking chatgpt via prompt engineering: An empirical study , author=. arXiv preprint arXiv:2305.13860 , year=

  5. [5]

    33rd USENIX Security Symposium (USENIX Security 24) , pages=

    Don't listen to me: Understanding and exploring jailbreak prompts of large language models , author=. 33rd USENIX Security Symposium (USENIX Security 24) , pages=

  6. [6]

    Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , pages=

    Tricking llms into disobedience: Formalizing, analyzing, and detecting jailbreaks , author=. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , pages=

  7. [7]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  8. [8]

    arXiv preprint arXiv:2511.15304 , year=

    Adversarial poetry as a universal single-turn jailbreak mechanism in large language models , author=. arXiv preprint arXiv:2511.15304 , year=

  9. [9]

    arXiv preprint arXiv:2601.08837 , year=

    From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda , author=. arXiv preprint arXiv:2601.08837 , year=

  10. [10]

    2016 , eprint=

    Neural Machine Translation by Jointly Learning to Align and Translate , author=. 2016 , eprint=

  11. [11]

    2023 , eprint=

    Attention Is All You Need , author=. 2023 , eprint=

  12. [12]

    2019 , eprint=

    Attention is not not Explanation , author=. 2019 , eprint=

  13. [13]

    2019 , eprint=

    Attention is not Explanation , author=. 2019 , eprint=

  14. [14]

    2025 , eprint=

    Attention Sinks: A 'Catch, Tag, Release' Mechanism for Embeddings , author=. 2025 , eprint=

  15. [15]

    2024 , eprint=

    Efficient Streaming Language Models with Attention Sinks , author=. 2024 , eprint=

  16. [16]

    2025 , eprint=

    Punctuation and Predicates in Language Models , author=. 2025 , eprint=

  17. [17]

    2019 , eprint=

    Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , author=. 2019 , eprint=

  18. [18]

    2025 , eprint=

    AttentionDefense: Leveraging System Prompt Attention for Explainable Defense Against Novel Jailbreaks , author=. 2025 , eprint=

  19. [19]

    Artificial Text Detection via Examining the Topology of Attention Maps , url=

    Kushnareva, Laida and Cherniavskii, Daniil and Mikhailov, Vladislav and Artemova, Ekaterina and Barannikov, Serguei and Bernstein, Alexander and Piontkovskaya, Irina and Piontkovski, Dmitri and Burnaev, Evgeny , year=. Artificial Text Detection via Examining the Topology of Attention Maps , url=. doi:10.18653/v1/2021.emnlp-main.50 , booktitle=

  20. [20]

    2025 , eprint=

    DHCP: Detecting Hallucinations by Cross-modal Attention Pattern in Large Vision-Language Models , author=. 2025 , eprint=

  21. [21]

    2025 , eprint=

    Guess or Recall? Training CNNs to Classify and Localize Memorization in LLMs , author=. 2025 , eprint=

  22. [22]

    2019 , eprint=

    What Does BERT Look At? An Analysis of BERT's Attention , author=. 2019 , eprint=

  23. [23]

    2024 , eprint=

    Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps , author=. 2024 , eprint=

  24. [24]

    2025 , eprint=

    The Map of Misbelief: Tracing Intrinsic and Extrinsic Hallucinations Through Attention Patterns , author=. 2025 , eprint=

  25. [25]

    2026 , eprint=

    Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models , author=. 2026 , eprint=

  26. [26]

    2025 , eprint=

    Safety Alignment Should Be Made More Than Just A Few Attention Heads , author=. 2025 , eprint=

  27. [27]

    2025 , eprint=

    Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs , author=. 2025 , eprint=

  28. [28]

    2025 , eprint=

    Universal Jailbreak Suffixes Are Strong Attention Hijackers , author=. 2025 , eprint=

  29. [29]

    2024 , eprint=

    Introducing v0.5 of the AI Safety Benchmark from MLCommons , author=. 2024 , eprint=

  30. [30]

    2025 , eprint=

    AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons , author=. 2025 , eprint=

  31. [31]

    2020 , eprint=

    Fine-Tuning Language Models from Human Preferences , author=. 2020 , eprint=

  32. [32]

    Open Problems in Mechanistic Interpretability

    Open problems in mechanistic interpretability , author=. arXiv preprint arXiv:2501.16496 , year=

  33. [33]

    , title =

    Lotman, Yury M. , title =. 1976 , publisher =

  34. [34]

    , title =

    Lotman, Yury M. , title =. 1977 , publisher =

  35. [35]

    arXiv preprint arXiv:2403.19851 , year=

    Localizing paragraph memorization in language models , author=. arXiv preprint arXiv:2403.19851 , year=

  36. [36]

    2024 , eprint=

    GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher , author=. 2024 , eprint=

  37. [37]

    2024 , eprint=

    ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs , author=. 2024 , eprint=

  38. [38]

    2025 , eprint=

    LLMs are Vulnerable to Malicious Prompts Disguised as Scientific Language , author=. 2025 , eprint=

  39. [39]

    2024 , eprint=

    Low-Resource Languages Jailbreak GPT-4 , author=. 2024 , eprint=

  40. [40]

    Analyzing memorization in large language models through the lens of model attribution , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  41. [41]

    The assis- tant axis: Situating and stabilizing the default persona of language models.arXiv preprint arXiv:2601.10387, 2026

    The assistant axis: Situating and stabilizing the default persona of language models , author=. arXiv preprint arXiv:2601.10387 , year=

  42. [42]

    Waugh , journal =

    Linda R. Waugh , journal =. The Poetic Function in the Theory of Roman Jakobson , urldate =

  43. [43]

    2025 , howpublished =

    Introducing Kimi K2 Thinking , author =. 2025 , howpublished =

  44. [44]

    2025 , howpublished =

  45. [45]

    DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , volume=

    Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and others , year=. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , volume=. Nature , publisher=. doi:10.1038/s41586-025-09422-z , number=

  46. [46]

    Qwen3 Technical Report

    Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=

  47. [47]

    Style in Language , editor =

    Jakobson, Roman , title =. Style in Language , editor =. 1960 , pages =

  48. [48]

    arXiv preprint arXiv:2512.05117 (2025)

    The Universal Weight Subspace Hypothesis , author=. arXiv preprint arXiv:2512.05117 , year=