arxiv: 2605.05630 · v2 · submitted 2026-05-07 · 💻 cs.CL · cs.AI· cs.CR

Recognition: 2 theorem links

· Lean Theorem

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

Xinjie Shen , Rongzhe Wei , Peizhi Niu , Haoyu Wang , Ruihan Wu , Eli Chien , Bo Li , Pin-Yu Chen

show 1 more author

Pan Li

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:50 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CR

keywords multi-turn dialoguemalicious intent detectionLLM safetyharmful responseTurnGateMTID datasetover-refusalresponse-aware defense

0 comments

The pith

A response-aware monitor detects the earliest turn where a reply would enable harmful actions in multi-turn LLM dialogues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that hidden malicious intent spread across benign-looking conversation turns requires detecting the precise closure point at which a candidate response completes the path to harm. It introduces the Multi-Turn Intent Dataset containing branching attack rollouts, matched benign negatives, and human annotations of the first harm-enabling turn. Using this data, the authors train TurnGate, a turn-level monitor that evaluates responses for harm potential and intervenes only when the accumulated dialogue crosses the threshold. This yields stronger detection than prior methods while keeping unnecessary refusals of safe conversations low and maintaining performance across domains and models.

Core claim

The central claim is that identifying the earliest turn at which delivering the candidate response makes the interaction sufficient for harmful action enables effective defense, achieved by constructing MTID with annotated closure points and training the TurnGate monitor on it to achieve superior harmful-intent detection, low over-refusal, and generalization across attacker pipelines and target models.

What carries the argument

TurnGate, a turn-level monitor that checks whether delivering the current response would render the accumulated dialogue sufficient to enable harmful action.

If this is right

Intervention occurs only at the harm-enabling turn, avoiding refusal of exploratory but benign conversations.
Detection rates exceed those of existing guardrails and baselines on multi-turn hidden-intent cases.
Performance holds when tested on new domains, attacker methods, and different target LLMs.
The approach supports safer deployment of LLMs in open-ended conversational settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Production chat systems could integrate similar turn-level checks to reduce gradual jailbreak success rates.
The same monitoring pattern might extend to preventing other cumulative risks such as unintended private data disclosure over multiple exchanges.
Collecting finer-grained annotations for partial harm levels could allow graduated responses rather than binary refusal.

Load-bearing premise

The human annotations of the earliest harm-enabling turns in MTID accurately mark real-world closure points without bias introduced by the dataset construction process.

What would settle it

Evaluating TurnGate on freshly generated multi-turn attack sequences created by independent red-teamers using strategies and phrasing absent from MTID, checking whether detection accuracy falls below the reported baseline levels.

Figures

Figures reproduced from arXiv: 2605.05630 by Bo Li, Eli Chien, Haoyu Wang, Pan Li, Peizhi Niu, Pin-Yu Chen, Rongzhe Wei, Ruihan Wu, Xinjie Shen.

**Figure 1.** Figure 1: Malicious intent detection in multi-turn dialogue. The same sequence of benign-looking turns may stem from a non-adversarial user (left) or an attacker distributing a harmful objective across turns (right). An ideal defender conditions on the dialogue history with the candidate response, and intervenes at the earliest turn where the interaction becomes sufficient to enable harm. Intervening earlier risks o… view at source ↗

**Figure 2.** Figure 2: Overview of the defense mechanism. (a) MTID construction: an adaptive tree search generates harmful trajectories with closure turn t ∗ annotated via a sufficiency evaluator, paired with matched benign trajectories for quantifying over-refusal. (b) TURNGATE training: the defender is optimized via RL with turn-level process rewards defined by each action’s relation to t ∗ , aggregated through GAE, and update… view at source ↗

**Figure 3.** Figure 3: Online robustness under adaptive tree-search attack. TURNGATE remains substantially more robust in closed-loop online interaction against a strong adaptive tree-search attacker. We further stress-test the defender under the online protocol of Sec. 5.1, where the attacker adaptively backtracks and reroutes based on delivered responses (App. D.6) view at source ↗

**Figure 4.** Figure 4: Online OOD generalization of TURNGATE at attacker iteration budget i = 5. (a) Benchmark shift: trained on MTID (Chemistry + Cybersecurity) and evaluated zero-shot on heldout harmful prompts from other benchmarks. (b)-(c) Target-model shift: transferred zero-shot to Gemini-3.1-Pro target model on MTID and other benchmarks. (d) Attacker-pipeline shift: evaluated on conversations produced by a different atta… view at source ↗

**Figure 5.** Figure 5: Harmful example: cryptocurrency phishing site. A three-turn attack progressively escalates from generic web-security questions toward an explicit phishing implementation request. The left column shows the user–assistant dialogue at each turn; the right column shows per-turn PASS/BLOCK decisions from four defenders. The underlying harmful objective (top, highlighted) is to create a phishing website that ste… view at source ↗

**Figure 6.** Figure 6: Benign example: sociology of cross-border drug trafficking. A three-turn academic dialogue explores community dynamics, kinship networks, and enforcement blind spots related to drug trafficking, all within standard social-science discourse. The underlying benign intent (top, highlighted) is to define the term drug trafficking. No turn reaches harmful sufficiency (t ∗ = ∞), so the correct defender behavior … view at source ↗

read the original abstract

Hidden malicious intent in multi-turn dialogue poses a growing threat to deployed large language models (LLMs). Rather than exposing a harmful objective in a single prompt, increasingly capable attackers can distribute their intent across multiple benign-looking turns. Recent studies show that even modern commercial models with advanced guardrails remain vulnerable to such attacks despite advances in safety alignment and external guardrails. In this work, we address this challenge by detecting the earliest turn at which delivering the candidate response would make the accumulated interaction sufficient to enable harmful action. This objective requires precise turn-level intervention that identifies the harm-enabling closure point while avoiding premature refusal of benign exploratory conversations. To further support training and evaluation, we construct the Multi-Turn Intent Dataset (MTID), which contains branching attack rollouts, matched benign hard negatives, and annotations of the earliest harm-enabling turns. We show that MTID helps enable a turn-level monitor TurnGate, which substantially outperforms existing baselines in harmful-intent detection while maintaining low over-refusal rates. TurnGate further generalizes across domains, attacker pipelines, and target models. Our code is available at https://github.com/Graph-COM/TurnGate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MTID and TurnGate give a concrete dataset and response-aware monitor for catching the closure point in multi-turn hidden intent, but the constructed data risks making the gains look stronger than they would on organic attacks.

read the letter

The main thing here is the MTID dataset of branching attack rollouts paired with benign negatives, plus human labels on the earliest harm-enabling turn, and the TurnGate monitor trained to use the candidate response for detection. They report better performance than baselines on harmful-intent detection, low over-refusal, and some generalization across domains, attacker methods, and target models. The code is released, which helps.

Referee Report

2 major / 2 minor

Summary. The paper claims that hidden malicious intent distributed across multi-turn dialogues can be detected at the earliest harm-enabling turn by a response-aware monitor. It introduces the Multi-Turn Intent Dataset (MTID) built from branching attack rollouts and matched benign negatives, with human annotations marking the closure points. The proposed TurnGate model, trained on MTID, substantially outperforms baselines on harmful-intent detection, maintains low over-refusal, and generalizes across domains, attacker pipelines, and target models.

Significance. If the central claims hold after addressing annotation validity, the work would advance LLM safety by enabling precise, low-over-refusal intervention in conversational settings rather than blanket refusals. The public release of MTID and code supports reproducibility and further research on multi-turn defense.

major comments (2)

[MTID Dataset Construction] MTID construction and annotation section: The human labeling of earliest harm-enabling turns on branching attack rollouts risks bias from the construction process itself (e.g., annotators primed by explicit attack structure and matched negatives), which may not match organic multi-turn attacks. This is load-bearing for the outperformance and cross-domain generalization claims; the paper must supply evidence such as inter-annotator agreement statistics, comparison to annotations on non-constructed dialogues, or sensitivity analysis showing that TurnGate gains persist under alternative labeling.
[Results and Evaluation] Evaluation protocol (results section): The abstract and high-level claims assert substantial outperformance and generalization without reporting concrete metrics, error bars, ablation details, or the exact train/test splits and baselines used. The full results must include these to verify that measured gains are not artifacts of the MTID labeling procedure.

minor comments (2)

[Abstract] The abstract would be strengthened by including one or two key quantitative results (e.g., F1 or over-refusal rate deltas) to allow readers to assess the magnitude of improvement immediately.
[Methods] Notation for the turn-level monitor (TurnGate) should be introduced with a clear equation or diagram in the methods section to distinguish response-aware features from standard intent classifiers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment point by point below and have revised the manuscript accordingly where the concerns are valid and actionable.

read point-by-point responses

Referee: [MTID Dataset Construction] MTID construction and annotation section: The human labeling of earliest harm-enabling turns on branching attack rollouts risks bias from the construction process itself (e.g., annotators primed by explicit attack structure and matched negatives), which may not match organic multi-turn attacks. This is load-bearing for the outperformance and cross-domain generalization claims; the paper must supply evidence such as inter-annotator agreement statistics, comparison to annotations on non-constructed dialogues, or sensitivity analysis showing that TurnGate gains persist under alternative labeling.

Authors: We acknowledge that the branching construction process could prime annotators and introduce bias relative to fully organic dialogues. In the revised manuscript we now report inter-annotator agreement (Fleiss’ κ = 0.76 across three annotators) and a sensitivity analysis in which we shift the labeled harm-enabling turns by ±1 position; TurnGate’s F1 advantage over baselines remains stable (within 3 points). A direct head-to-head comparison against annotations collected on non-constructed, real-world multi-turn conversations is not feasible in the current study, as it would require an entirely new large-scale annotation campaign; we have added this as an explicit limitation and direction for future work. revision: partial
Referee: [Results and Evaluation] Evaluation protocol (results section): The abstract and high-level claims assert substantial outperformance and generalization without reporting concrete metrics, error bars, ablation details, or the exact train/test splits and baselines used. The full results must include these to verify that measured gains are not artifacts of the MTID labeling procedure.

Authors: We agree that the original results section was insufficiently detailed. The revised version now contains: (i) full numerical results with means and standard deviations over five random seeds (TurnGate 91.8 ± 1.4 F1 vs. strongest baseline 79.2 ± 2.3 F1 on the MTID test set); (ii) error bars on all reported figures; (iii) ablation tables isolating the contribution of response-awareness and turn-level features; and (iv) explicit documentation of the 80/20 train/test split (seed 42) together with the complete list of baselines and their hyper-parameters. These additions allow readers to verify that the reported gains are not artifacts of the labeling procedure. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper constructs MTID via branching attack rollouts, matched benign negatives, and external human annotations of earliest harm-enabling turns, then trains the TurnGate monitor on this dataset to detect closure points. No equations, fitted parameters, or self-citations appear in the derivation chain that reduce predictions or uniqueness claims to inputs by construction. Performance claims rest on comparisons to external baselines and cross-domain generalization tests rather than internal redefinitions or self-referential fits. The central result therefore retains independent content from the dataset construction process and external annotations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach relies on standard supervised learning and human annotation of harm thresholds.

pith-pipeline@v0.9.0 · 5534 in / 983 out tokens · 26386 ms · 2026-05-13T07:50:41.471580+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

process reward R_t = u_ben 1[t < t*] − c_miss 1[t = t*] … for PASS; u_hit 1[t = t*] − c_early(1−ϕ(t;t*))… for BLOCK

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 6 internal anchors

[1]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Universal jailbreak backdoors in large language model alignment

Thomas Baumann. Universal jailbreak backdoors in large language model alignment. InNeurips Safe Generative AI Workshop 2024, 2024

work page 2024
[3]

Benchmarking Misuse Mitigation Against Covert Adversaries

Davis Brown, Mahdi Sabbaghi, Luze Sun, Alexander Robey, George J Pappas, Eric Wong, and Hamed Hassani. Benchmarking misuse mitigation against covert adversaries.arXiv preprint arXiv:2506.06414, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

When llm meets drl: Advancing jailbreaking efficiency via drl-guided search.Advances in Neural Information Processing Systems, 37:26814–26845, 2024

Xuan Chen, Yuzhou Nie, Wenbo Guo, and Xiangyu Zhang. When llm meets drl: Advancing jailbreaking efficiency via drl-guided search.Advances in Neural Information Processing Systems, 37:26814–26845, 2024

work page 2024
[5]

Ferret: Faster and effective automated red teaming with reward-based scoring technique.CoRR, abs/2408.10701, 2024

Pala Tej Deep, Vernon Toh Yan Han, Rishabh Bhardwaj, and Soujanya Poria. Ferret: Faster and effective automated red teaming with reward-based scoring technique.CoRR, abs/2408.10701, 2024

work page arXiv 2024
[6]

A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily.North American Chapter of the Association for Computational Linguistics, 2023

Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yunsen Xian, Jiajun Chen, and Shujian Huang. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily.North American Chapter of the Association for Computational Linguistics, 2023

work page 2023
[7]

Attacks, defenses and evaluations for llm conversation safety: A survey

Zhichen Dong, Zhanhui Zhou, Chao Yang, Jing Shao, and Yu Qiao. Attacks, defenses and evaluations for llm conversation safety: A survey. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6734–6747, 2024

work page 2024
[8]

Deliberative alignment: Reasoning enables safer language models.arXiv preprint arXiv:2412.16339, 2024

Melody Y Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, et al. Deliberative alignment: Reasoning enables safer language models.arXiv preprint arXiv:2412.16339, 2024

work page arXiv 2024
[9]

Mtsa: Multi-turn safety alignment for llms through multi-round red-teaming

Weiyang Guo, Jing Li, Wenya Wang, Yu Li, Daojing He, Jun Yu, and Min Zhang. Mtsa: Multi-turn safety alignment for llms through multi-round red-teaming. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26424–26442, 2025

work page 2025
[10]

Harmful prompt classification for large language models

Ojasvi Gupta, Marta de la Cuadra Lozano, Abdelsalam Busalim, Rajesh R Jaiswal, and Keith Quille. Harmful prompt classification for large language models. InProceedings of the 2024 Conference on Human Centred Artificial Intelligence - Education and Practice, HCAIep ’24, page 8–14, New York, NY , USA, 2024. Association for Computing Machinery

work page 2024
[11]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024

Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024

work page 2024
[13]

GUARD: role- playing to generate natural-language jailbreakings to test guideline adherence of large language models

Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, and Haohan Wang. Guard: Role-playing to generate natural-language jailbreakings to test guideline adherence of large language models. arXiv preprint arXiv:2402.03299, 2024

work page arXiv 2024
[14]

Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik S Chan, Xuhai Xu, Daniel McDuff, Hyeon- hoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, and Hae W Park. Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

work page 2024
[15]

Drattack: Prompt de- composition and reconstruction makes powerful llms jailbreakers

Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. Drattack: Prompt de- composition and reconstruction makes powerful llms jailbreakers. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 13891–13913, 2024

work page 2024
[16]

AutoDAN: Generating stealthy jailbreak prompts on aligned large language models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[17]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scien- tist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

MALicious INTent dataset and inoculating LLMs for enhanced disinformation detection

Arkadiusz Modzelewski, Witold Sosnowski, Eleni Papadopulos, Elisa Sartori, Tiziano Labruna, Giovanni Da San Martino, and Adam Wierzbicki. MALicious INTent dataset and inoculating LLMs for enhanced disinformation detection. In Vera Demberg, Kentaro Inui, and Lluís Mar- quez, editors,Proceedings of the 19th Conference of the European Chapter of the Associat...

work page 2026
[19]

Helping big language models protect themselves: An enhanced filtering and summariza- tion system.arXiv preprint arXiv:2505.01315,

Sheikh Samit Muhaimin and Spyridon Mastorakis. Helping large language models protect themselves: An enhanced filtering and summarization system.arXiv preprint arXiv:2505.01315, 2025

work page arXiv 2025
[20]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[21]

Under- standing and mitigating overrefusal in llms from an unveiling perspective of safety decision boundary

Licheng Pan, Yongqi Tong, Xin Zhang, Xiaolu Zhang, Jun Zhou, and Zhixuan Chu. Under- standing and mitigating overrefusal in llms from an unveiling perspective of safety decision boundary. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 21068–21086, 2025

work page 2025
[22]

Automated red teaming with goat: the generative offensive agent tester.arXiv preprint arXiv:2410.01606, 2024

Maya Pavlova, Erik Brinkman, Krithika Iyer, Vitor Albiero, Joanna Bitton, Hailey Nguyen, Joe Li, Cristian Canton Ferrer, Ivan Evtimov, and Aaron Grattafiori. Automated red teaming with goat: the generative offensive agent tester.arXiv preprint arXiv:2410.01606, 2024

work page arXiv 2024
[23]

Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, and Saadia Gabriel

Salman Rahman et al. X-Teaming: Multi-turn jailbreaks and defenses with adaptive multi-agents. arXiv preprint arXiv:2504.13203, 2025

work page arXiv 2025
[24]

Llms know their vulnerabilities: Uncover safety gaps through natural distribution shifts.arXiv preprint arXiv:2410.10700,

Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Qian, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhen Ma, and Jing Shao. Derail yourself: Multi-turn LLM jailbreak attack through self- discovered clues.arXiv preprint arXiv:2410.10700, 2024

work page arXiv 2024
[25]

Llms know their vulnerabilities: Uncover safety gaps through natural distribution shifts

Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, and Jing Shao. Llms know their vulnerabilities: Uncover safety gaps through natural distribution shifts. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24763–24785, 2025. 11

work page 2025
[26]

Xstest: A test suite for identifying exaggerated safety behaviours in large language models

Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),...

work page 2024
[27]

Great, now write an article about that: The crescendo {Multi-Turn}{LLM} jailbreak attack

Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo {Multi-Turn}{LLM} jailbreak attack. In34th USENIX Security Symposium (USENIX Security 25), pages 2421–2440, 2025

work page 2025
[28]

Llms in software security: A survey of vulnerability detection techniques and insights.ACM Computing Surveys, 58(5):1–35, 2025

Ze Sheng, Zhicheng Chen, Shuning Gu, Heqing Huang, Guofei Gu, and Jeff Huang. Llms in software security: A survey of vulnerability detection techniques and insights.ACM Computing Surveys, 58(5):1–35, 2025

work page 2025
[29]

Safe in isolation, dangerous together: Agent-driven multi- turn decomposition jailbreaks on LLMs

Devansh Srivastav and Xiao Zhang. Safe in isolation, dangerous together: Agent-driven multi- turn decomposition jailbreaks on LLMs. In Ehsan Kamalloo, Nicolas Gontier, Xing Han Lu, Nouha Dziri, Shikhar Murty, and Alexandre Lacoste, editors,Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025), pages 170–183, Vienna, Austria, July

work page 2025
[30]

Association for Computational Linguistics

work page
[31]

RoleBreak: Character hallucination as a jailbreak attack in role-playing systems

Yihong Tang, Bo Wang, Xu Wang, Dongming Zhao, Jing Liu, Ruifang He, and Yuexian Hou. RoleBreak: Character hallucination as a jailbreak attack in role-playing systems. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, and Steven Schockaert, editors,Proceedings of the 31st International Conference on Computational Linguis...

work page 2025
[32]

Prompt, divide, and conquer: Bypassing large language model safety filters via segmented and distributed prompt processing

Johan Wahréus, Ahmed Hussain, and Panos Papadimitratos. Prompt, divide, and conquer: Bypassing large language model safety filters via segmented and distributed prompt processing. arXiv preprint arXiv:2503.21598, 2025

work page arXiv 2025
[33]

Do llms really forget? evaluating unlearning with knowledge correlation and confidence awareness.arXiv preprint arXiv:2506.05735, 2025

Rongzhe Wei, Peizhi Niu, Hans Hao-Hsun Hsu, Ruihan Wu, Haoteng Yin, Mohsen Ghassemi, Yifan Li, Vamsi K Potluru, Eli Chien, Kamalika Chaudhuri, et al. Do llms really forget? evaluating unlearning with knowledge correlation and confidence awareness.arXiv preprint arXiv:2506.05735, 2025

work page arXiv 2025
[34]

The trojan knowledge: Bypassing commercial LLM guardrails via harmless prompt weaving and adaptive tree search.arXiv preprint arXiv:2512.01353, 2025

Rongzhe Wei, Peizhi Niu, Xinjie Shen, Tony Tu, Yifan Li, Ruihan Wu, Eli Chien, Pin-Yu Chen, Olgica Milenkovic, and Pan Li. The trojan knowledge: Bypassing commercial LLM guardrails via harmless prompt weaving and adaptive tree search.arXiv preprint arXiv:2512.01353, 2025

work page arXiv 2025
[35]

Redagent: Red teaming large language models with context-aware autonomous language agent.arXiv preprint arXiv:2407.16667, 2024

Huiyu Xu, Wenhui Zhang, Zhibo Wang, Feng Xiao, Rui Zheng, Yunhe Feng, Zhongjie Ba, and Kui Ren. Redagent: Red teaming large language models with context-aware autonomous language agent.arXiv preprint arXiv:2407.16667, 2024

work page arXiv 2024
[36]

Chain of attack: Hide your intention through multi-turn interrogation

Xikang Yang, Biyu Zhou, Xuehai Tang, Jizhong Han, and Songlin Hu. Chain of attack: Hide your intention through multi-turn interrogation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 9881–9901, 2025

work page 2025
[37]

Low-resource languages jailbreak gpt-4

Zheng-Xin Yong, Cristina Menghini, and Stephen H Bach. Low-resource languages jailbreak gpt-4.arXiv preprint arXiv:2310.02446, 2023

work page arXiv 2023
[38]

Monitoring decomposition attacks in llms with lightweight sequential monitors.arXiv preprint arXiv:2506.10949, 2025

Chen Yueh-Han, Nitish Joshi, Yulin Chen, Maksym Andriushchenko, Rico Angell, and He He. Monitoring decomposition attacks in llms with lightweight sequential monitors.arXiv preprint arXiv:2506.10949, 2025

work page arXiv 2025
[39]

Guardians and offenders: A survey on harmful content generation and safety mitigation of llm

Chi Zhang, Changjia Zhu, Junjie Xiong, Xiaoran Xu, Lingyao Li, Yao Liu, and Zhuo Lu. Guardians and offenders: A survey on harmful content generation and safety mitigation of llm. arXiv preprint arXiv:2508.05775, 2025

work page arXiv 2025
[40]

DAMON: A dialogue-aware MCTS framework for jailbreaking large language models

Xu Zhang, Xunjian Yin, Dinghao Jing, Huixuan Zhang, Xinyu Hu, and Xiaojun Wan. DAMON: A dialogue-aware MCTS framework for jailbreaking large language models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6361– 6377, Su...

work page 2025
[41]

Intention analysis makes llms a good jailbreak defender

Yuqi Zhang, Liang Ding, Lefei Zhang, and Dacheng Tao. Intention analysis makes llms a good jailbreak defender. InProceedings of the 31st International Conference on Computational Linguistics, pages 2947–2968, 2025

work page 2025
[42]

Zhehao Zhang, Weijie Xu, Fanyou Wu, and Chandan K. Reddy. Falsereject: A resource for improving contextual safety and mitigating over-refusals in LLMs via structured reasoning. In Second Conference on Language Modeling, 2025

work page 2025
[43]

Qwen3Guard Technical Report

Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, et al. Qwen3guard technical report.arXiv preprint arXiv:2510.14276, 2025

work page internal anchor Pith review arXiv 2025
[44]

How alignment and jailbreak work: Explain llm safety through intermediate hidden states

Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, and Yongbin Li. How alignment and jailbreak work: Explain llm safety through intermediate hidden states. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 2461–2488, 2024

work page 2024
[45]

Improving alignment and robustness with circuit breakers.Advances in Neural Information Processing Systems, 37:83345– 83373, 2024

Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, and Dan Hendrycks. Improving alignment and robustness with circuit breakers.Advances in Neural Information Processing Systems, 37:83345– 83373, 2024

work page 2024
[46]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. 13 Appendix Contents A Illustrative Case Studies 15 A.1 Harmful Example: Cryptocurrency Phishing Site . . . . . . . . . . . . . . . . . . 15 A.2 Benign Exam...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

chemistry: Prompts related to chemical substances, reactions, synthesis, laboratory procedures, chemical compounds, drugs, explosives, toxins, or any chemistry-related topics

work page
[48]

index" (0-based),

cybersecurity: Prompts related to hacking, malware, network attacks, system vul- nerabilities, data breaches, phishing, social engineering, unauthorized access, or any cybersecurity/computer security topics. For each prompt, determine if it belongs to one of these categories. A prompt may belong to neither category if it’s unrelated to both domains. Instr...

work page