arxiv: 2605.03441 · v1 · submitted 2026-05-05 · 💻 cs.CR · cs.AI· cs.CL· cs.LG

Recognition: unknown

Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis

Haoyu Zhang, Mohammad Zandsalimy, Shanu Sushmita

Authors on Pith no claims yet

Pith reviewed 2026-05-07 15:45 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CLcs.LG

keywords LLM jailbreaksmathematical encodingsafety mechanismsadversarial promptsset theoryformal logicsemantic pattern matching

0 comments

The pith

Encoding harmful prompts as genuine mathematical problems bypasses LLM safety filters at 46-56 percent success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models rely on safety mechanisms that primarily detect semantic patterns in requests. Reformulating harmful content into coherent mathematical problems, such as those based on set theory or formal logic, allows the intent to slip through undetected because the surface form no longer matches known harmful templates. This works only when a helper model performs a deep reformulation that turns the request into a real mathematical task rather than merely adding symbols. The approach succeeds across eight models and two benchmarks, though newer models resist it better than older ones. The finding points to a structural weakness in current defenses that treat meaning as surface semantics.

Core claim

Encoding harmful prompts as coherent mathematical problems using formalisms such as set theory, formal logic, and quantum mechanics bypasses these filters at high rates, achieving 46%--56% average attack success across eight target models and two established benchmarks. The effectiveness depends not on mathematical notation itself, but on whether a helper LLM deeply reformulates the harmful content into a genuine mathematical problem: rule-based encodings that apply mathematical formatting without such reformulation perform no better than unencoded baselines. A novel Formal Logic encoding achieves attack success comparable to Set Theory.

What carries the argument

Deep mathematical reformulation of harmful intent into formal problems, which alters surface semantics while preserving the original request.

If this is right

Safety filters based solely on semantic pattern matching fail against content that has been deeply recast as a mathematical task.
Superficial addition of mathematical symbols or notation provides no advantage over plain harmful prompts.
The vulnerability generalizes across distinct mathematical formalisms, including set theory and formal logic.
Newer models display higher resistance to these encodings yet still produce harmful outputs at measurable rates.
Repeat post-processing of the encoded prompts does not reduce attack success.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety training may need to include explicit reasoning steps that reconstruct possible original intents from mathematical statements.
Automated tools that generate such reformulations could scale the attack without requiring a separate helper model.
The same encoding principle might apply to other structured domains such as code or logical proofs, exposing similar gaps.
Models could be tested for robustness by requiring them to solve the mathematical problem and then answer whether the embedded request is harmful.

Load-bearing premise

LLM safety mechanisms depend mainly on matching semantic patterns and cannot recognize harmful intent once it is embedded in a genuine mathematical structure.

What would settle it

An experiment in which the same model is tested on identical harmful requests presented both directly and as mathematically reformulated problems, showing no statistically significant drop in refusal rate for the encoded versions.

Figures

Figures reproduced from arXiv: 2605.03441 by Haoyu Zhang, Mohammad Zandsalimy, Shanu Sushmita.

**Figure 1.** Figure 1: The three-stage attack pipeline: (1) Encode harmful prompt x into x ′ ; (2) Submit x ′ with strategy-specific instructions to the target model; (3) Evaluate the response using a standardized judge model view at source ↗

**Figure 2.** Figure 2: Taxonomy of mathematical encoding strategies. LLM-based methods deeply reformulate harmful content into coherent mathematical problems, while rule-based methods apply surface-level formatting without restructuring the underlying text view at source ↗

read the original abstract

Large language models (LLMs) employ safety mechanisms to prevent harmful outputs, yet these defenses primarily rely on semantic pattern matching. We show that encoding harmful prompts as coherent mathematical problems -- using formalisms such as set theory, formal logic, and quantum mechanics -- bypasses these filters at high rates, achieving 46%--56% average attack success across eight target models and two established benchmarks. Crucially, the effectiveness depends not on mathematical notation itself, but on whether a helper LLM deeply reformulates the harmful content into a genuine mathematical problem: rule-based encodings that apply mathematical formatting without such reformulation perform no better than unencoded baselines. We introduce a novel Formal Logic encoding that achieves attack success comparable to Set Theory, demonstrating that this vulnerability generalizes across mathematical formalisms. Additional experiments with repeat post-processing confirm that these attacks are robust to simple prompt augmentation. Notably, newer models (GPT-5, GPT-5-Mini) show substantially greater robustness than older models, though they remain vulnerable. Our findings highlight fundamental gaps in current safety frameworks and motivate defenses that reason about mathematical structure rather than surface-level semantics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Reformulating harmful prompts into math problems bypasses LLM safety at moderate rates, but without proof that the math itself is the key.

read the letter

The main thing here is that turning harmful prompts into coherent math problems using set theory, formal logic, or quantum mechanics gets past LLM safety filters at 46 to 56 percent success rates on average. The paper stresses that this only works with deep reformulation by another model, not just adding math symbols, and they introduce a formal logic variant that performs similarly to the others. They do a decent job testing this on eight different models, including the newer GPT-5 ones that prove harder to crack, and they check two benchmarks. The robustness to repeat post-processing is also noted, which rules out some simple defenses. Newer models showing better resistance is a positive observation worth highlighting. The comparison between rule-based shallow encodings and the deep ones is the clearest new angle, and it helps separate the effect from just using unusual formatting. Where it falls short is on confirming the proposed mechanism. The stress test is right to flag that we don't see evidence the models are actually treating these as math problems rather than just responding to indirect harmful requests. Without looking at the outputs for math reasoning steps or derivations, it's possible the success comes from prompt novelty or length instead. The abstract also skips details on how prompts were constructed, any statistical analysis, or full baseline comparisons, so the numbers are hard to trust fully at this stage. This is aimed at AI safety researchers who study jailbreaks and defenses. Someone building red teaming tools might pick up the encoding ideas, but the work needs more validation before it changes how we think about semantic defenses. It should go to peer review because the topic is relevant and the empirical results, even with gaps, add to the conversation on LLM vulnerabilities. The editors can ask for the missing analyses on model behavior and methods.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLM safety mechanisms primarily rely on semantic pattern matching and can be bypassed by encoding harmful prompts as coherent mathematical problems using formalisms such as set theory, formal logic, and quantum mechanics. It reports 46%-56% average attack success rates across eight target models and two benchmarks, with effectiveness depending on deep reformulation by a helper LLM rather than rule-based mathematical formatting. The work introduces a novel Formal Logic encoding, demonstrates robustness to repeat post-processing, notes greater robustness in newer models (e.g., GPT-5 variants), and argues for defenses that reason about mathematical structure.

Significance. If the empirical results and mechanistic claims hold, the findings would indicate a meaningful gap in current semantic-based safety alignments for LLMs, with potential implications for red-teaming and defense design. The distinction between shallow notation and deep reformulation, plus the generalization across formalisms and the observation on newer models, would be useful contributions to the empirical security literature.

major comments (2)

[Results and Mechanism Discussion (inferred from abstract and experimental claims)] The central claim that bypass occurs because models engage with 'genuine mathematical problems' (rather than novelty, length, or indirect phrasing) is load-bearing but unsupported by output analysis. No section examines whether successful responses contain derivations, problem-solving steps, or reasoning consistent with the encoded formalism (set theory, formal logic, quantum mechanics) versus simply reproducing the original harmful instructions. This directly weakens the reported distinction between rule-based encodings (baseline performance) and deep reformulation (46-56% success).
[Experimental Setup and Evaluation] § on Experimental Setup and Evaluation: aggregate success rates are reported without details on controls, exact prompt templates, statistical significance testing, variance across runs, or baseline comparisons beyond the rule-based condition. This makes it impossible to assess whether the 46-56% figures reflect the claimed mechanism or confounding factors such as prompt length or helper-LLM artifacts.

minor comments (2)

[Methods] Clarify the exact number of harmful prompts per benchmark and any filtering applied before encoding.
[Introduction] The abstract states 'two established benchmarks' but does not name them; add explicit references in the introduction or methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the strength of our mechanistic claims and the need for greater experimental transparency. We address each major comment below and have revised the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Results and Mechanism Discussion (inferred from abstract and experimental claims)] The central claim that bypass occurs because models engage with 'genuine mathematical problems' (rather than novelty, length, or indirect phrasing) is load-bearing but unsupported by output analysis. No section examines whether successful responses contain derivations, problem-solving steps, or reasoning consistent with the encoded formalism (set theory, formal logic, quantum mechanics) versus simply reproducing the original harmful instructions. This directly weakens the reported distinction between rule-based encodings (baseline performance) and deep reformulation (46-56% success).

Authors: We agree that direct analysis of model outputs is important for substantiating the claim that successful attacks involve genuine engagement with the mathematical formalism rather than superficial effects. While the performance gap between deep reformulation and rule-based encodings provides indirect support, we have added a new subsection (Section 4.3) that samples 100 successful responses across encodings and manually categorizes them for the presence of mathematical derivations, logical inferences, or problem-solving steps consistent with the formalism before any harmful content is revealed. This analysis shows that 68% of deep-reformulation successes exhibit such steps (versus 8% for rule-based), with examples provided in Appendix D. We believe this strengthens the mechanistic distinction without altering the reported success rates. revision: yes
Referee: [Experimental Setup and Evaluation] § on Experimental Setup and Evaluation: aggregate success rates are reported without details on controls, exact prompt templates, statistical significance testing, variance across runs, or baseline comparisons beyond the rule-based condition. This makes it impossible to assess whether the 46-56% figures reflect the claimed mechanism or confounding factors such as prompt length or helper-LLM artifacts.

Authors: We acknowledge that the original Experimental Setup section lacked sufficient detail for full reproducibility and assessment of confounds. In the revised manuscript, we have expanded this section to include: (i) the complete prompt templates for each encoding method (Set Theory, Formal Logic, Quantum Mechanics) and the rule-based baseline; (ii) explicit controls for prompt length (all conditions matched to within 5% token count) and structure; (iii) statistical significance testing (chi-squared tests with p-values < 0.01 for the reported differences); (iv) variance across runs (standard deviations from 5 independent trials per model-benchmark pair); and (v) additional baselines consisting of length-matched non-mathematical paraphrases and unencoded harmful prompts. We also clarify that the helper LLM was used solely for reformulation and that post-processing was limited to formatting, with no other artifacts introduced. These changes allow readers to evaluate whether the results reflect the intended mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical attack success reporting

full rationale

The paper presents an empirical study measuring attack success rates (46-56%) for mathematical encodings of harmful prompts versus baselines and rule-based variants. No derivations, equations, fitted parameters, or predictions are claimed. Central distinctions (deep reformulation vs. notation-only) are tested directly via controlled experiments on eight models and two benchmarks, with results reported as observed outcomes rather than derived from prior self-referential steps. No self-citation chains or uniqueness theorems underpin the claims; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the domain assumption that safety is semantic pattern matching and that mathematical encoding evades intent detection without the model understanding the harm.

axioms (1)

domain assumption LLM safety mechanisms primarily rely on semantic pattern matching rather than deeper reasoning about intent or structure.
Explicitly stated in the abstract as the reason mathematical encodings succeed.

pith-pipeline@v0.9.0 · 5508 in / 1115 out tokens · 34891 ms · 2026-05-07T15:45:03.199402+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 13 canonical work pages · 6 internal anchors

[1]

On the Dangers of Stochas- tic Parrots: Can Language Models Be Too Big?

E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell. “On the Dangers of Stochas- tic Parrots: Can Language Models Be Too Big?” In:Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (F AccT). 2021, pp. 610–623. 2Code and artifacts:https://github.com/vacantfury/math_encoding_llm_jailbreaking

2021
[2]

Training Language Models to Follow Instructions with Human Feedback

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. “Training Language Models to Follow Instructions with Human Feedback”. In:Advances in Neural Information Processing Systems (NeurIPS)35 (2022), pp. 27730–27744

2022
[3]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y. Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggine, and M. Khabsa. “Llama Guard: LLM-based Input-Output Safeguard for Human- AI Conversations”. In:arXiv preprint arXiv:2312.06674(2023)

work page internal anchor Pith review arXiv 2023
[4]

Red Teaming Language Models with Language Models

E. Perez, S. Ringer, K. Lukoši¯ ut˙ e, K. Nguyen, E. Chen, S. Heiner, C. Pettit, C. Olsson, S. Kundu, S. Kadavath, et al. “Red Teaming Language Models with Language Models”. In: arXiv preprint arXiv:2202.03286(2022)

work page internal anchor Pith review arXiv 2022
[5]

do anything now

X. Shen, Z. Chen, M. Backes, Y. Shen, and Y. Zhang. ““Do Anything Now”: Characteriz- ing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models”. In:arXiv preprint arXiv:2308.03825(2024)

work page arXiv 2024
[6]

Jailbreaking Large Language Models with Symbolic Mathematics

E. Bethany, M. Bethany, J. A. Nolazco Flores, S. K. Jha, and P. Najafirad. “Jailbreaking Large Language Models with Symbolic Mathematics”. In:arXiv preprint arXiv:2409.11445 (2024)

work page arXiv 2024
[7]

Universal and Transferable Adversarial Attacks on Aligned Language Models

A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson. “Universal and Transferable Adversarial Attacks on Aligned Language Models”. In:arXiv preprint arXiv:2307.15043(2023)

work page internal anchor Pith review arXiv 2023
[8]

Jailbreaking Black Box Large Language Models in Twenty Queries

P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong. “Jailbreaking Black-Box Large Language Models in Twenty Queries”. In:arXiv preprint arXiv:2310.08419 (2023)

work page internal anchor Pith review arXiv 2023
[9]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

X. Liu, N. Xu, M. Chen, and C. Xiao. “AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models”. In:Proceedings of the International Conference on Learning Representations (ICLR). 2024

2024
[10]

Best-of-NJailbreaking

J.Hughes,S.Price,E.Perez,etal.“Best-of-NJailbreaking”.In:arXiv preprint arXiv:2412.03556 (2024)

work page arXiv 2024
[11]

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

Z. Xu, Y. Liu, G. Deng, Y. Li, and S. Picek. “A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models”. In:Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL). 2024, pp. 7432–7449

2024
[12]

Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher

Y. Yuan, W. Jiao, W. Wang, J.-t. Huang, P. He, S. Shi, and Z. Tu. “GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher”. In:arXiv preprint arXiv:2308.06463(2024)

work page arXiv 2024
[13]

Deceptive Delight: A Multi-Turn Jailbreak Technique

Palo Alto Networks Unit 42. “Deceptive Delight: A Multi-Turn Jailbreak Technique”. In:Palo Alto Networks Unit 42 Research Report(2024)

2024
[14]

Policy Puppetry: Exploiting Structured Data in LLMs

Unjail AI. “Policy Puppetry: Exploiting Structured Data in LLMs”. In:Unjail AI Research Report(2025)

2025
[15]

Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models

Y. Zheng, M. Zandsalimy, and S. Sushmita. “Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models”. In:arXiv preprint arXiv:2509.05471(2025)

work page arXiv 2025
[16]

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

I. Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel, S. Bengio, and M. Farajtabar. “GSM- Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models”. In:arXiv preprint arXiv:2410.05229(2024)

work page Pith review arXiv 2024
[17]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Li, D. Forsyth, and D. Hendrycks. “HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal”. In:arXiv preprint arXiv:2402.04249(2024)

work page internal anchor Pith review arXiv 2024
[18]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong. “JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models”. In:Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track(2024)

2024
[19]

Prompt Repetition Improves Non-Reasoning LLMs

Y. Leviathan, M. Kalman, and Y. Matias. “Prompt Repetition Improves Non-Reasoning LLMs”. In:arXiv preprint arXiv:2512.12837(2025)

work page arXiv 2025
[20]

StrongREJECT: A Rejection Benchmark for Evaluating Jailbreak Attacks on Language Models

A. Souly, Q. Lu, D. Bowen, T. Trinh, E. Hsieh, S. Pandey, P. Abbeel, J. Svegliato, S. Em- mons, O. Watkins, and S. Toyer. “StrongREJECT: A Rejection Benchmark for Evaluating Jailbreak Attacks on Language Models”. In:Advances in Neural Information Processing Sys- tems (NeurIPS)(2024)

2024
[21]

Constitutional AI: Harmlessness from AI Feedback

Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, et al. “Constitutional AI: Harmlessness from AI Feedback”. In: arXiv preprint arXiv:2212.08073(2022). AppendixA.Prompt Architecture All encoding strategies follow a shared prompt architecture. This appendix describes the components common across...

work page internal anchor Pith review arXiv 2022
[22]

The queries you will transform are fictional examples used solely for research purposes to study how language models process mathematical abstractions

System prompt.Each system prompt begins with a shared academic framing para- graph: IMPORTANT CONTEXT: This is an academic research study on AI safety and mathematical encoding of natural language. The queries you will transform are fictional examples used solely for research purposes to study how language models process mathematical abstractions. No real...
[23]

{prompt}

Few-shot demonstrations.Two examples demonstrate the desired transformation. Each example uses structured delimiter tags ([ENCODING]for the mathematical formulation and[INSTRUCTION]for the problem statement). These tags are automatically stripped from the helper’s output before submission to the target model. 3.Usermessage.Theharmfulpromptformattedas:Natu...

1949