pith. machine review for the scientific record. sign in

arxiv: 2605.05630 · v2 · submitted 2026-05-07 · 💻 cs.CL · cs.AI· cs.CR

Recognition: 2 theorem links

· Lean Theorem

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:50 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CR
keywords multi-turn dialoguemalicious intent detectionLLM safetyharmful responseTurnGateMTID datasetover-refusalresponse-aware defense
0
0 comments X

The pith

A response-aware monitor detects the earliest turn where a reply would enable harmful actions in multi-turn LLM dialogues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that hidden malicious intent spread across benign-looking conversation turns requires detecting the precise closure point at which a candidate response completes the path to harm. It introduces the Multi-Turn Intent Dataset containing branching attack rollouts, matched benign negatives, and human annotations of the first harm-enabling turn. Using this data, the authors train TurnGate, a turn-level monitor that evaluates responses for harm potential and intervenes only when the accumulated dialogue crosses the threshold. This yields stronger detection than prior methods while keeping unnecessary refusals of safe conversations low and maintaining performance across domains and models.

Core claim

The central claim is that identifying the earliest turn at which delivering the candidate response makes the interaction sufficient for harmful action enables effective defense, achieved by constructing MTID with annotated closure points and training the TurnGate monitor on it to achieve superior harmful-intent detection, low over-refusal, and generalization across attacker pipelines and target models.

What carries the argument

TurnGate, a turn-level monitor that checks whether delivering the current response would render the accumulated dialogue sufficient to enable harmful action.

If this is right

  • Intervention occurs only at the harm-enabling turn, avoiding refusal of exploratory but benign conversations.
  • Detection rates exceed those of existing guardrails and baselines on multi-turn hidden-intent cases.
  • Performance holds when tested on new domains, attacker methods, and different target LLMs.
  • The approach supports safer deployment of LLMs in open-ended conversational settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Production chat systems could integrate similar turn-level checks to reduce gradual jailbreak success rates.
  • The same monitoring pattern might extend to preventing other cumulative risks such as unintended private data disclosure over multiple exchanges.
  • Collecting finer-grained annotations for partial harm levels could allow graduated responses rather than binary refusal.

Load-bearing premise

The human annotations of the earliest harm-enabling turns in MTID accurately mark real-world closure points without bias introduced by the dataset construction process.

What would settle it

Evaluating TurnGate on freshly generated multi-turn attack sequences created by independent red-teamers using strategies and phrasing absent from MTID, checking whether detection accuracy falls below the reported baseline levels.

Figures

Figures reproduced from arXiv: 2605.05630 by Bo Li, Eli Chien, Haoyu Wang, Pan Li, Peizhi Niu, Pin-Yu Chen, Rongzhe Wei, Ruihan Wu, Xinjie Shen.

Figure 1
Figure 1. Figure 1: Malicious intent detection in multi-turn dialogue. The same sequence of benign-looking turns may stem from a non-adversarial user (left) or an attacker distributing a harmful objective across turns (right). An ideal defender conditions on the dialogue history with the candidate response, and intervenes at the earliest turn where the interaction becomes sufficient to enable harm. Intervening earlier risks o… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the defense mechanism. (a) MTID construction: an adaptive tree search generates harmful trajectories with closure turn t ∗ annotated via a sufficiency evaluator, paired with matched benign trajectories for quantifying over-refusal. (b) TURNGATE training: the defender is optimized via RL with turn-level process rewards defined by each action’s relation to t ∗ , aggregated through GAE, and update… view at source ↗
Figure 3
Figure 3. Figure 3: Online robustness under adaptive tree-search attack. TURNGATE remains substantially more robust in closed-loop online interaction against a strong adaptive tree-search attacker. We further stress-test the defender under the online protocol of Sec. 5.1, where the attacker adaptively backtracks and reroutes based on delivered re￾sponses (App. D.6) view at source ↗
Figure 4
Figure 4. Figure 4: Online OOD generalization of TURNGATE at attacker iteration budget i = 5. (a) Benchmark shift: trained on MTID (Chemistry + Cybersecurity) and evaluated zero-shot on held￾out harmful prompts from other benchmarks. (b)-(c) Target-model shift: transferred zero-shot to Gemini-3.1-Pro target model on MTID and other benchmarks. (d) Attacker-pipeline shift: evaluated on conversations produced by a different atta… view at source ↗
Figure 5
Figure 5. Figure 5: Harmful example: cryptocurrency phishing site. A three-turn attack progressively escalates from generic web-security questions toward an explicit phishing implementation request. The left column shows the user–assistant dialogue at each turn; the right column shows per-turn PASS/BLOCK decisions from four defenders. The underlying harmful objective (top, highlighted) is to create a phishing website that ste… view at source ↗
Figure 6
Figure 6. Figure 6: Benign example: sociology of cross-border drug trafficking. A three-turn academic dialogue explores community dynamics, kinship networks, and enforcement blind spots related to drug trafficking, all within standard social-science discourse. The underlying benign intent (top, highlighted) is to define the term drug trafficking. No turn reaches harmful sufficiency (t ∗ = ∞), so the correct defender behavior … view at source ↗
read the original abstract

Hidden malicious intent in multi-turn dialogue poses a growing threat to deployed large language models (LLMs). Rather than exposing a harmful objective in a single prompt, increasingly capable attackers can distribute their intent across multiple benign-looking turns. Recent studies show that even modern commercial models with advanced guardrails remain vulnerable to such attacks despite advances in safety alignment and external guardrails. In this work, we address this challenge by detecting the earliest turn at which delivering the candidate response would make the accumulated interaction sufficient to enable harmful action. This objective requires precise turn-level intervention that identifies the harm-enabling closure point while avoiding premature refusal of benign exploratory conversations. To further support training and evaluation, we construct the Multi-Turn Intent Dataset (MTID), which contains branching attack rollouts, matched benign hard negatives, and annotations of the earliest harm-enabling turns. We show that MTID helps enable a turn-level monitor TurnGate, which substantially outperforms existing baselines in harmful-intent detection while maintaining low over-refusal rates. TurnGate further generalizes across domains, attacker pipelines, and target models. Our code is available at https://github.com/Graph-COM/TurnGate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that hidden malicious intent distributed across multi-turn dialogues can be detected at the earliest harm-enabling turn by a response-aware monitor. It introduces the Multi-Turn Intent Dataset (MTID) built from branching attack rollouts and matched benign negatives, with human annotations marking the closure points. The proposed TurnGate model, trained on MTID, substantially outperforms baselines on harmful-intent detection, maintains low over-refusal, and generalizes across domains, attacker pipelines, and target models.

Significance. If the central claims hold after addressing annotation validity, the work would advance LLM safety by enabling precise, low-over-refusal intervention in conversational settings rather than blanket refusals. The public release of MTID and code supports reproducibility and further research on multi-turn defense.

major comments (2)
  1. [MTID Dataset Construction] MTID construction and annotation section: The human labeling of earliest harm-enabling turns on branching attack rollouts risks bias from the construction process itself (e.g., annotators primed by explicit attack structure and matched negatives), which may not match organic multi-turn attacks. This is load-bearing for the outperformance and cross-domain generalization claims; the paper must supply evidence such as inter-annotator agreement statistics, comparison to annotations on non-constructed dialogues, or sensitivity analysis showing that TurnGate gains persist under alternative labeling.
  2. [Results and Evaluation] Evaluation protocol (results section): The abstract and high-level claims assert substantial outperformance and generalization without reporting concrete metrics, error bars, ablation details, or the exact train/test splits and baselines used. The full results must include these to verify that measured gains are not artifacts of the MTID labeling procedure.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by including one or two key quantitative results (e.g., F1 or over-refusal rate deltas) to allow readers to assess the magnitude of improvement immediately.
  2. [Methods] Notation for the turn-level monitor (TurnGate) should be introduced with a clear equation or diagram in the methods section to distinguish response-aware features from standard intent classifiers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment point by point below and have revised the manuscript accordingly where the concerns are valid and actionable.

read point-by-point responses
  1. Referee: [MTID Dataset Construction] MTID construction and annotation section: The human labeling of earliest harm-enabling turns on branching attack rollouts risks bias from the construction process itself (e.g., annotators primed by explicit attack structure and matched negatives), which may not match organic multi-turn attacks. This is load-bearing for the outperformance and cross-domain generalization claims; the paper must supply evidence such as inter-annotator agreement statistics, comparison to annotations on non-constructed dialogues, or sensitivity analysis showing that TurnGate gains persist under alternative labeling.

    Authors: We acknowledge that the branching construction process could prime annotators and introduce bias relative to fully organic dialogues. In the revised manuscript we now report inter-annotator agreement (Fleiss’ κ = 0.76 across three annotators) and a sensitivity analysis in which we shift the labeled harm-enabling turns by ±1 position; TurnGate’s F1 advantage over baselines remains stable (within 3 points). A direct head-to-head comparison against annotations collected on non-constructed, real-world multi-turn conversations is not feasible in the current study, as it would require an entirely new large-scale annotation campaign; we have added this as an explicit limitation and direction for future work. revision: partial

  2. Referee: [Results and Evaluation] Evaluation protocol (results section): The abstract and high-level claims assert substantial outperformance and generalization without reporting concrete metrics, error bars, ablation details, or the exact train/test splits and baselines used. The full results must include these to verify that measured gains are not artifacts of the MTID labeling procedure.

    Authors: We agree that the original results section was insufficiently detailed. The revised version now contains: (i) full numerical results with means and standard deviations over five random seeds (TurnGate 91.8 ± 1.4 F1 vs. strongest baseline 79.2 ± 2.3 F1 on the MTID test set); (ii) error bars on all reported figures; (iii) ablation tables isolating the contribution of response-awareness and turn-level features; and (iv) explicit documentation of the 80/20 train/test split (seed 42) together with the complete list of baselines and their hyper-parameters. These additions allow readers to verify that the reported gains are not artifacts of the labeling procedure. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper constructs MTID via branching attack rollouts, matched benign negatives, and external human annotations of earliest harm-enabling turns, then trains the TurnGate monitor on this dataset to detect closure points. No equations, fitted parameters, or self-citations appear in the derivation chain that reduce predictions or uniqueness claims to inputs by construction. Performance claims rest on comparisons to external baselines and cross-domain generalization tests rather than internal redefinitions or self-referential fits. The central result therefore retains independent content from the dataset construction process and external annotations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach relies on standard supervised learning and human annotation of harm thresholds.

pith-pipeline@v0.9.0 · 5534 in / 983 out tokens · 26386 ms · 2026-05-13T07:50:41.471580+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 6 internal anchors

  1. [1]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

  2. [2]

    Universal jailbreak backdoors in large language model alignment

    Thomas Baumann. Universal jailbreak backdoors in large language model alignment. InNeurips Safe Generative AI Workshop 2024, 2024

  3. [3]

    Benchmarking Misuse Mitigation Against Covert Adversaries

    Davis Brown, Mahdi Sabbaghi, Luze Sun, Alexander Robey, George J Pappas, Eric Wong, and Hamed Hassani. Benchmarking misuse mitigation against covert adversaries.arXiv preprint arXiv:2506.06414, 2025

  4. [4]

    When llm meets drl: Advancing jailbreaking efficiency via drl-guided search.Advances in Neural Information Processing Systems, 37:26814–26845, 2024

    Xuan Chen, Yuzhou Nie, Wenbo Guo, and Xiangyu Zhang. When llm meets drl: Advancing jailbreaking efficiency via drl-guided search.Advances in Neural Information Processing Systems, 37:26814–26845, 2024

  5. [5]

    Ferret: Faster and effective automated red teaming with reward-based scoring technique.CoRR, abs/2408.10701, 2024

    Pala Tej Deep, Vernon Toh Yan Han, Rishabh Bhardwaj, and Soujanya Poria. Ferret: Faster and effective automated red teaming with reward-based scoring technique.CoRR, abs/2408.10701, 2024

  6. [6]

    A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily.North American Chapter of the Association for Computational Linguistics, 2023

    Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yunsen Xian, Jiajun Chen, and Shujian Huang. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily.North American Chapter of the Association for Computational Linguistics, 2023

  7. [7]

    Attacks, defenses and evaluations for llm conversation safety: A survey

    Zhichen Dong, Zhanhui Zhou, Chao Yang, Jing Shao, and Yu Qiao. Attacks, defenses and evaluations for llm conversation safety: A survey. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6734–6747, 2024

  8. [8]

    Deliberative alignment: Reasoning enables safer language models.arXiv preprint arXiv:2412.16339, 2024

    Melody Y Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, et al. Deliberative alignment: Reasoning enables safer language models.arXiv preprint arXiv:2412.16339, 2024

  9. [9]

    Mtsa: Multi-turn safety alignment for llms through multi-round red-teaming

    Weiyang Guo, Jing Li, Wenya Wang, Yu Li, Daojing He, Jun Yu, and Min Zhang. Mtsa: Multi-turn safety alignment for llms through multi-round red-teaming. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26424–26442, 2025

  10. [10]

    Harmful prompt classification for large language models

    Ojasvi Gupta, Marta de la Cuadra Lozano, Abdelsalam Busalim, Rajesh R Jaiswal, and Keith Quille. Harmful prompt classification for large language models. InProceedings of the 2024 Conference on Human Centred Artificial Intelligence - Education and Practice, HCAIep ’24, page 8–14, New York, NY , USA, 2024. Association for Computing Machinery

  11. [11]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023. 10

  12. [12]

    Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024

    Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024

  13. [13]

    GUARD: role- playing to generate natural-language jailbreakings to test guideline adherence of large language models

    Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, and Haohan Wang. Guard: Role-playing to generate natural-language jailbreakings to test guideline adherence of large language models. arXiv preprint arXiv:2402.03299, 2024

  14. [14]

    Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

    Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik S Chan, Xuhai Xu, Daniel McDuff, Hyeon- hoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, and Hae W Park. Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

  15. [15]

    Drattack: Prompt de- composition and reconstruction makes powerful llms jailbreakers

    Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. Drattack: Prompt de- composition and reconstruction makes powerful llms jailbreakers. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 13891–13913, 2024

  16. [16]

    AutoDAN: Generating stealthy jailbreak prompts on aligned large language models

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models. InInternational Conference on Learning Representations (ICLR), 2024

  17. [17]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scien- tist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

  18. [18]

    MALicious INTent dataset and inoculating LLMs for enhanced disinformation detection

    Arkadiusz Modzelewski, Witold Sosnowski, Eleni Papadopulos, Elisa Sartori, Tiziano Labruna, Giovanni Da San Martino, and Adam Wierzbicki. MALicious INTent dataset and inoculating LLMs for enhanced disinformation detection. In Vera Demberg, Kentaro Inui, and Lluís Mar- quez, editors,Proceedings of the 19th Conference of the European Chapter of the Associat...

  19. [19]

    Helping big language models protect themselves: An enhanced filtering and summariza- tion system.arXiv preprint arXiv:2505.01315,

    Sheikh Samit Muhaimin and Spyridon Mastorakis. Helping large language models protect themselves: An enhanced filtering and summarization system.arXiv preprint arXiv:2505.01315, 2025

  20. [20]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  21. [21]

    Under- standing and mitigating overrefusal in llms from an unveiling perspective of safety decision boundary

    Licheng Pan, Yongqi Tong, Xin Zhang, Xiaolu Zhang, Jun Zhou, and Zhixuan Chu. Under- standing and mitigating overrefusal in llms from an unveiling perspective of safety decision boundary. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 21068–21086, 2025

  22. [22]

    Automated red teaming with goat: the generative offensive agent tester.arXiv preprint arXiv:2410.01606, 2024

    Maya Pavlova, Erik Brinkman, Krithika Iyer, Vitor Albiero, Joanna Bitton, Hailey Nguyen, Joe Li, Cristian Canton Ferrer, Ivan Evtimov, and Aaron Grattafiori. Automated red teaming with goat: the generative offensive agent tester.arXiv preprint arXiv:2410.01606, 2024

  23. [23]

    Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, and Saadia Gabriel

    Salman Rahman et al. X-Teaming: Multi-turn jailbreaks and defenses with adaptive multi-agents. arXiv preprint arXiv:2504.13203, 2025

  24. [24]

    Llms know their vulnerabilities: Uncover safety gaps through natural distribution shifts.arXiv preprint arXiv:2410.10700,

    Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Qian, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhen Ma, and Jing Shao. Derail yourself: Multi-turn LLM jailbreak attack through self- discovered clues.arXiv preprint arXiv:2410.10700, 2024

  25. [25]

    Llms know their vulnerabilities: Uncover safety gaps through natural distribution shifts

    Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, and Jing Shao. Llms know their vulnerabilities: Uncover safety gaps through natural distribution shifts. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24763–24785, 2025. 11

  26. [26]

    Xstest: A test suite for identifying exaggerated safety behaviours in large language models

    Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),...

  27. [27]

    Great, now write an article about that: The crescendo {Multi-Turn}{LLM} jailbreak attack

    Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo {Multi-Turn}{LLM} jailbreak attack. In34th USENIX Security Symposium (USENIX Security 25), pages 2421–2440, 2025

  28. [28]

    Llms in software security: A survey of vulnerability detection techniques and insights.ACM Computing Surveys, 58(5):1–35, 2025

    Ze Sheng, Zhicheng Chen, Shuning Gu, Heqing Huang, Guofei Gu, and Jeff Huang. Llms in software security: A survey of vulnerability detection techniques and insights.ACM Computing Surveys, 58(5):1–35, 2025

  29. [29]

    Safe in isolation, dangerous together: Agent-driven multi- turn decomposition jailbreaks on LLMs

    Devansh Srivastav and Xiao Zhang. Safe in isolation, dangerous together: Agent-driven multi- turn decomposition jailbreaks on LLMs. In Ehsan Kamalloo, Nicolas Gontier, Xing Han Lu, Nouha Dziri, Shikhar Murty, and Alexandre Lacoste, editors,Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025), pages 170–183, Vienna, Austria, July

  30. [30]

    Association for Computational Linguistics

  31. [31]

    RoleBreak: Character hallucination as a jailbreak attack in role-playing systems

    Yihong Tang, Bo Wang, Xu Wang, Dongming Zhao, Jing Liu, Ruifang He, and Yuexian Hou. RoleBreak: Character hallucination as a jailbreak attack in role-playing systems. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors,Proceedings of the 31st International Conference on Computational Linguis...

  32. [32]

    Prompt, divide, and conquer: Bypassing large language model safety filters via segmented and distributed prompt processing

    Johan Wahréus, Ahmed Hussain, and Panos Papadimitratos. Prompt, divide, and conquer: Bypassing large language model safety filters via segmented and distributed prompt processing. arXiv preprint arXiv:2503.21598, 2025

  33. [33]

    Do llms really forget? evaluating unlearning with knowledge correlation and confidence awareness.arXiv preprint arXiv:2506.05735, 2025

    Rongzhe Wei, Peizhi Niu, Hans Hao-Hsun Hsu, Ruihan Wu, Haoteng Yin, Mohsen Ghassemi, Yifan Li, Vamsi K Potluru, Eli Chien, Kamalika Chaudhuri, et al. Do llms really forget? evaluating unlearning with knowledge correlation and confidence awareness.arXiv preprint arXiv:2506.05735, 2025

  34. [34]

    The trojan knowledge: Bypassing commercial LLM guardrails via harmless prompt weaving and adaptive tree search.arXiv preprint arXiv:2512.01353, 2025

    Rongzhe Wei, Peizhi Niu, Xinjie Shen, Tony Tu, Yifan Li, Ruihan Wu, Eli Chien, Pin-Yu Chen, Olgica Milenkovic, and Pan Li. The trojan knowledge: Bypassing commercial LLM guardrails via harmless prompt weaving and adaptive tree search.arXiv preprint arXiv:2512.01353, 2025

  35. [35]

    Redagent: Red teaming large language models with context-aware autonomous language agent.arXiv preprint arXiv:2407.16667, 2024

    Huiyu Xu, Wenhui Zhang, Zhibo Wang, Feng Xiao, Rui Zheng, Yunhe Feng, Zhongjie Ba, and Kui Ren. Redagent: Red teaming large language models with context-aware autonomous language agent.arXiv preprint arXiv:2407.16667, 2024

  36. [36]

    Chain of attack: Hide your intention through multi-turn interrogation

    Xikang Yang, Biyu Zhou, Xuehai Tang, Jizhong Han, and Songlin Hu. Chain of attack: Hide your intention through multi-turn interrogation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 9881–9901, 2025

  37. [37]

    Low-resource languages jailbreak gpt-4

    Zheng-Xin Yong, Cristina Menghini, and Stephen H Bach. Low-resource languages jailbreak gpt-4.arXiv preprint arXiv:2310.02446, 2023

  38. [38]

    Monitoring decomposition attacks in llms with lightweight sequential monitors.arXiv preprint arXiv:2506.10949, 2025

    Chen Yueh-Han, Nitish Joshi, Yulin Chen, Maksym Andriushchenko, Rico Angell, and He He. Monitoring decomposition attacks in llms with lightweight sequential monitors.arXiv preprint arXiv:2506.10949, 2025

  39. [39]

    Guardians and offenders: A survey on harmful content generation and safety mitigation of llm

    Chi Zhang, Changjia Zhu, Junjie Xiong, Xiaoran Xu, Lingyao Li, Yao Liu, and Zhuo Lu. Guardians and offenders: A survey on harmful content generation and safety mitigation of llm. arXiv preprint arXiv:2508.05775, 2025

  40. [40]

    DAMON: A dialogue-aware MCTS framework for jailbreaking large language models

    Xu Zhang, Xunjian Yin, Dinghao Jing, Huixuan Zhang, Xinyu Hu, and Xiaojun Wan. DAMON: A dialogue-aware MCTS framework for jailbreaking large language models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6361– 6377, Su...

  41. [41]

    Intention analysis makes llms a good jailbreak defender

    Yuqi Zhang, Liang Ding, Lefei Zhang, and Dacheng Tao. Intention analysis makes llms a good jailbreak defender. InProceedings of the 31st International Conference on Computational Linguistics, pages 2947–2968, 2025

  42. [42]

    Zhehao Zhang, Weijie Xu, Fanyou Wu, and Chandan K. Reddy. Falsereject: A resource for improving contextual safety and mitigating over-refusals in LLMs via structured reasoning. In Second Conference on Language Modeling, 2025

  43. [43]

    Qwen3Guard Technical Report

    Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, et al. Qwen3guard technical report.arXiv preprint arXiv:2510.14276, 2025

  44. [44]

    How alignment and jailbreak work: Explain llm safety through intermediate hidden states

    Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, and Yongbin Li. How alignment and jailbreak work: Explain llm safety through intermediate hidden states. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 2461–2488, 2024

  45. [45]

    Improving alignment and robustness with circuit breakers.Advances in Neural Information Processing Systems, 37:83345– 83373, 2024

    Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, and Dan Hendrycks. Improving alignment and robustness with circuit breakers.Advances in Neural Information Processing Systems, 37:83345– 83373, 2024

  46. [46]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. 13 Appendix Contents A Illustrative Case Studies 15 A.1 Harmful Example: Cryptocurrency Phishing Site . . . . . . . . . . . . . . . . . . 15 A.2 Benign Exam...

  47. [47]

    chemistry: Prompts related to chemical substances, reactions, synthesis, laboratory procedures, chemical compounds, drugs, explosives, toxins, or any chemistry-related topics

  48. [48]

    index" (0-based),

    cybersecurity: Prompts related to hacking, malware, network attacks, system vul- nerabilities, data breaches, phishing, social engineering, unauthorized access, or any cybersecurity/computer security topics. For each prompt, determine if it belongs to one of these categories. A prompt may belong to neither category if it’s unrelated to both domains. Instr...