Investigating and Alleviating Harm Amplification in LLM Interactions

Alan Ritter; Ruohao Guo; Wei Xu

arxiv: 2606.02423 · v1 · pith:Z7VDRASKnew · submitted 2026-06-01 · 💻 cs.CL · cs.LG

Investigating and Alleviating Harm Amplification in LLM Interactions

Ruohao Guo , Wei Xu , Alan Ritter This is my paper

Pith reviewed 2026-06-28 14:30 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords harm amplificationLLM safetymulti-turn interactionsTrajSafeHarmAmp benchmarkrisk mitigationAI alignmentconversation monitoring

0 comments

The pith

TrajSafe monitors multi-turn LLM paths and intervenes to cut harm amplification while keeping normal use intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that LLMs can amplify harm when users extend conversations over multiple turns, both by giving novices access to specialized knowledge and by enabling large-scale harmful actions. To measure this, it builds HarmAmp, a benchmark of twelve real-world grounded scenarios chosen for their need for extended dialogue. It then introduces TrajSafe, which watches the conversation trajectory, probes unclear user intent, and steers the model toward safer endings. Experiments indicate this approach lowers measured harmfulness across the benchmark scenarios without raising refusal rates on safe queries or damaging the model's other abilities. If correct, the work points to a practical way to address safety issues that appear only after several back-and-forth exchanges.

Core claim

LLMs act as harm amplifiers in multi-turn settings by democratizing domain expertise and scaling operations; HarmAmp provides a benchmark of twelve risk categories meeting criteria of substantive amplification, operational specificity, and multi-turn necessity; TrajSafe, a proactive monitor, anticipates harmful trajectories and intervenes via intent probing and safer steering, producing large reductions in harmfulness while preserving low over-refusal and general capabilities.

What carries the argument

TrajSafe, a proactive monitor that anticipates harmful trajectories and intervenes through actions such as probing users' genuine intents and steering the models towards safer completion.

If this is right

Harmfulness drops markedly in the twelve multi-turn scenarios tested.
Over-refusal rate on safe queries stays low.
Target model capabilities remain largely unchanged.
The approach applies across the full set of risk categories in the benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety systems may need to shift focus from single-turn refusals to ongoing trajectory monitoring.
Benchmarks limited to single prompts could miss the main amplification risks that appear only after several turns.
Similar monitoring could be tested on other generative models beyond the ones evaluated here.

Load-bearing premise

The twelve risk categories and chosen scenarios are representative enough of real-world harm amplification to let the benchmark measure the actual risk.

What would settle it

Deploy TrajSafe in live multi-turn interactions with users attempting real harmful goals and check whether harm rates drop substantially or over-refusal on benign queries rises sharply.

Figures

Figures reproduced from arXiv: 2606.02423 by Alan Ritter, Ruohao Guo, Wei Xu.

**Figure 1.** Figure 1: Top: Prior work targets general, single-turn harmful requests. Bottom: We instead study multi-turn harm amplification, where LLMs compound assistance across turns to enable more specific and scalable harm. tion as harm amplifiers, enabling malicious users to achieve harmful outcomes that exceed their own capabilities. Despite the severity of these risks, current safety research leaves two questions underex… view at source ↗

**Figure 2.** Figure 2: Distribution of turn-level feedback categories [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the screening process of H [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Screening filtering prompt used to decide whether an instance can be expanded or should be rewritten. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Annotation guideline used by LLM annotators to convert single-turn harmful scenarios into multi-step [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Example of a multi-turn capability acquisition scenario where the model is asked to correct deliberately [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

read the original abstract

Large language models (LLMs) can serve as helpful assistants, yet they can equally function as harm amplifiers that enable malicious users to achieve harmful outcomes beyond their capabilities through extended interactions. This risk manifests along two axes, i.e., democratizing domain expertise that allows novices to produce specialized harmful content, and scaling harmful operations at volumes that manual effort cannot match. Existing works, however, often overlook how LLMs compound harm across multi-turn conversations. We introduce HarmAmp, a new benchmark for multi-turn harm amplification scenarios spanning twelve risk categories. Each scenario is grounded in real-world threats and satisfies rigorous criteria, i.e., substantive amplification, operational specificity, and multi-turn necessity. We further propose TrajSafe, a proactive monitor that anticipates harmful trajectories and intervenes through actions such as probing users' genuine intents and steering the models towards safer completion. Our extensive experiments demonstrate that TrajSafe significantly reduces the harmfulness incurred in multi-turn interactions while preserving a low over-refusal rate and the target model's general capabilities. Our work offers a promising paradigm to alleviate the nuanced safety risks in LLM interactions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HarmAmp benchmark and TrajSafe monitor target multi-turn LLM risks but rest on author-defined criteria and thin experimental reporting.

read the letter

The paper introduces HarmAmp, a benchmark of multi-turn scenarios across 12 risk categories, and TrajSafe, a monitor that watches conversation trajectories and intervenes by probing intent or steering outputs. This moves the safety discussion from isolated prompts to how models can compound harm over extended exchanges, which matches real deployment patterns in chat systems.

It does a reasonable job framing two concrete axes: giving novices access to specialized harmful knowledge and enabling higher-volume operations than manual effort allows. The three selection criteria (substantive amplification, operational specificity, multi-turn necessity) are explicit and try to keep the benchmark from drifting into generic toxicity tests.

The main weakness is that the abstract gives almost no information on measurement. There are no details on how harmfulness was scored, which baselines were run, whether differences reached statistical significance, or how over-refusal and capability preservation were quantified. The scenarios are described as meeting the three criteria, but the text does not report inter-annotator agreement, external expert review, or direct comparison against single-turn harm sets to show the criteria actually hold or that the scenarios are representative. Without those checks, any reported reduction from TrajSafe is difficult to interpret.

This work is aimed at researchers and engineers who build safety layers for conversational LLMs. A reader already working on trajectory-level monitoring or multi-turn red-teaming could extract useful scenario ideas, but anyone needing reproducible numbers would have to wait for the full methods and data.

The idea is timely enough that it should go to peer review rather than desk rejection; the referees can check whether the benchmark construction and evaluation details support the claims.

Referee Report

2 major / 1 minor

Summary. The paper introduces HarmAmp, a benchmark of multi-turn harm amplification scenarios across twelve risk categories, each asserted to meet criteria of substantive amplification, operational specificity, and multi-turn necessity. It proposes TrajSafe, a proactive monitor that anticipates harmful trajectories and intervenes (e.g., via intent probing or steering). Experiments claim TrajSafe significantly reduces harmfulness in multi-turn interactions while preserving low over-refusal rates and the target model's general capabilities.

Significance. If the central empirical claims hold after validation of the benchmark, the work addresses an important gap in LLM safety by focusing on multi-turn harm amplification rather than single-turn refusals. The introduction of a dedicated benchmark and a trajectory-aware intervention method could provide a useful paradigm for future safety research, particularly if the scenarios prove representative of real-world risks.

major comments (2)

[Benchmark construction] Benchmark construction section: The assertion that the twelve scenarios satisfy substantive amplification, operational specificity, and multi-turn necessity is presented without inter-annotator agreement scores, expert validation, or comparison against existing single-turn harm benchmarks. This is load-bearing for the central claim because TrajSafe's reported harm reduction is measured exclusively on HarmAmp; if the scenarios do not genuinely require multi-turn interaction or amplify harm beyond single-turn baselines, the measured improvement cannot be attributed to trajectory monitoring.
[Experimental evaluation] Experimental evaluation section: No details are provided on the operationalization of harmfulness (e.g., annotation protocol, LLM-as-judge prompts, or human evaluation), choice of baselines, statistical significance testing, or quantitative metrics for over-refusal rate and capability preservation. These omissions make it impossible to assess whether the reported positive outcomes are robust or reproducible.

minor comments (1)

[Abstract] The abstract states positive experimental outcomes but defers all measurement details to later sections; a brief summary of the harmfulness metric and evaluation protocol in the abstract would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify areas where the manuscript can be strengthened. We address each major comment below and commit to revisions that provide the requested details without altering the core claims.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction section: The assertion that the twelve scenarios satisfy substantive amplification, operational specificity, and multi-turn necessity is presented without inter-annotator agreement scores, expert validation, or comparison against existing single-turn harm benchmarks. This is load-bearing for the central claim because TrajSafe's reported harm reduction is measured exclusively on HarmAmp; if the scenarios do not genuinely require multi-turn interaction or amplify harm beyond single-turn baselines, the measured improvement cannot be attributed to trajectory monitoring.

Authors: We acknowledge that the current manuscript states the criteria but does not report inter-annotator agreement, explicit expert validation steps, or direct comparisons to single-turn benchmarks. The scenarios were constructed from documented real-world threats with the three criteria applied during selection, yet we agree this process requires more transparent documentation. In revision we will add a dedicated subsection detailing the construction protocol, including any multi-author review process used and a qualitative comparison table against representative single-turn harm datasets to demonstrate multi-turn necessity. We will also note limitations where quantitative IAA was not collected. revision: yes
Referee: [Experimental evaluation] Experimental evaluation section: No details are provided on the operationalization of harmfulness (e.g., annotation protocol, LLM-as-judge prompts, or human evaluation), choice of baselines, statistical significance testing, or quantitative metrics for over-refusal rate and capability preservation. These omissions make it impossible to assess whether the reported positive outcomes are robust or reproducible.

Authors: We agree the experimental section is underspecified. The manuscript reports aggregate harm reduction, low over-refusal, and capability preservation but omits the precise annotation protocol, judge prompts, baseline selection rationale, statistical tests, and exact metrics. In the revised version we will expand the evaluation subsection to include: (1) full LLM-as-judge prompts and any human validation protocol, (2) justification for chosen baselines, (3) results of statistical significance tests, and (4) explicit quantitative definitions and values for over-refusal rate and capability metrics (e.g., MMLU or equivalent). revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation on author-defined benchmark

full rationale

The paper introduces the HarmAmp benchmark with author-specified criteria (substantive amplification, operational specificity, multi-turn necessity) and evaluates TrajSafe empirically on it. No equations, derivations, fitted parameters, or self-citation chains appear in the provided text that would reduce any central claim to its inputs by construction. The benchmark definition is standard for new datasets and does not create a self-definitional loop where measured reductions are forced by the criteria themselves. The work is self-contained as an empirical study against its own scenarios.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper introduces two new constructs (benchmark and monitor) whose validity rests on domain assumptions about what constitutes harm amplification rather than on prior literature or external validation.

axioms (1)

domain assumption Harm amplification can be meaningfully operationalized through scenarios that meet substantive amplification, operational specificity, and multi-turn necessity criteria.
Invoked when constructing the HarmAmp benchmark scenarios.

invented entities (2)

HarmAmp benchmark no independent evidence
purpose: To evaluate multi-turn harm amplification across twelve risk categories
Newly defined testbed introduced in the paper.
TrajSafe monitor no independent evidence
purpose: To anticipate harmful trajectories and intervene via probing or steering
Newly proposed proactive safety component.

pith-pipeline@v0.9.1-grok · 5718 in / 1365 out tokens · 23601 ms · 2026-06-28T14:30:03.075281+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 11 canonical work pages

[1]

2024 , volume =

Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan , booktitle =. 2024 , volume =

2024
[2]

2025 , month =

Moix, Alex and Lebedev, Ken and Klein, Jacob , title =. 2025 , month =

2025
[3]

The Twelfth International Conference on Learning Representations , year=

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! , author=. The Twelfth International Conference on Learning Representations , year=
[4]

AgentHarm: A Benchmark for Measuring Harmfulness of

Maksym Andriushchenko and Alexandra Souly and Mateusz Dziemian and Derek Duenas and Maxwell Lin and Justin Wang and Dan Hendrycks and Andy Zou and J Zico Kolter and Matt Fredrikson and Yarin Gal and Xander Davies , booktitle=. AgentHarm: A Benchmark for Measuring Harmfulness of. 2025 , url=

2025
[5]

The Fourteenth International Conference on Learning Representations , year=

Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks , author=. The Fourteenth International Conference on Learning Representations , year=
[6]

Proceedings of the Twentieth European Conference on Computer Systems , pages=

Hybridflow: A flexible and efficient rlhf framework , author=. Proceedings of the Twentieth European Conference on Computer Systems , pages=
[7]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants? , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[8]

How to Protect Yourself from 5 G Radiation? Investigating LLM Responses to Implicit Misinformation

Guo, Ruohao and Xu, Wei and Ritter, Alan. How to Protect Yourself from 5 G Radiation? Investigating LLM Responses to Implicit Misinformation. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1468

work page doi:10.18653/v1/2025.emnlp-main.1468 2025
[9]

The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models , author=. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
[10]

Proceedings of the 34th USENIX Conference on Security Symposium , articleno =

Russinovich, Mark and Salem, Ahmed and Eldan, Ronen , title =. Proceedings of the 34th USENIX Conference on Security Symposium , articleno =. 2025 , isbn =

2025
[11]

2024 , eprint=

Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues , author=. 2024 , eprint=

2024
[12]

2024 , eprint=

LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet , author=. 2024 , eprint=

2024
[13]

Red Queen: Exposing Latent Multi-Turn Risks in Large Language Models

Jiang, Yifan and Aggarwal, Kriti and Laud, Tanmay and Munir, Kashif and Pujara, Jay and Mukherjee, Subhabrata. Red Queen: Exposing Latent Multi-Turn Risks in Large Language Models. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1311

work page doi:10.18653/v1/2025.findings-acl.1311 2025
[14]

C o S afe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference

Yu, Erxin and Li, Jing and Liao, Ming and Wang, Siqi and Zuchen, Gao and Mi, Fei and Hong, Lanqing. C o S afe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.968

work page doi:10.18653/v1/2024.emnlp-main.968 2024
[15]

Meta-Tuning LLM s to Leverage Lexical Knowledge for Generalizable Language Style Understanding

Guo, Ruohao and Xu, Wei and Ritter, Alan. Meta-Tuning LLM s to Leverage Lexical Knowledge for Generalizable Language Style Understanding. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.740

work page doi:10.18653/v1/2024.acl-long.740 2024
[16]

The Fourteenth International Conference on Learning Representations , year=

SafeDialBench: A Fine-Grained Safety Evaluation Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks , author=. The Fourteenth International Conference on Learning Representations , year=
[17]

Second Conference on Language Modeling , year=

X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents , author=. Second Conference on Language Modeling , year=
[18]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Improving Alignment and Robustness with Circuit Breakers , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[19]

arXiv preprint arXiv:2509.07430 , year=

The choice of divergence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward , author=. arXiv preprint arXiv:2509.07430 , year=

arXiv
[20]

X -Boundary: Establishing Exact Safety Boundary to Shield LLM s from Jailbreak Attacks without Compromising Usability

Lu, Xiaoya and Liu, Dongrui and Yu, Yi and Xu, Luxin and Shao, Jing. X -Boundary: Establishing Exact Safety Boundary to Shield LLM s from Jailbreak Attacks without Compromising Usability. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.282

work page doi:10.18653/v1/2025.findings-emnlp.282 2025
[21]

Röttger, H

R. XST est: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.301

work page doi:10.18653/v1/2024.naacl-long.301 2024
[22]

2025 , url=

Justin Cui and Wei-Lin Chiang and Ion Stoica and Cho-Jui Hsieh , booktitle=. 2025 , url=

2025
[23]

S afe S witch: Steering Unsafe LLM Behavior via Internal Activation Signals

Han, Peixuan and Qian, Cheng and Chen, Xiusi and Zhang, Yuji and Ji, Heng and Zhang, Denghui. S afe S witch: Steering Unsafe LLM Behavior via Internal Activation Signals. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.366

work page doi:10.18653/v1/2025.findings-emnlp.366 2025
[24]

arXiv preprint arXiv:2508.09224 , year=

From hard refusals to safe-completions: Toward output-centric safety training , author=. arXiv preprint arXiv:2508.09224 , year=

arXiv
[25]

arXiv preprint arXiv:2305.06972 , year=

Spear phishing with large language models , author=. arXiv preprint arXiv:2305.06972 , year=

arXiv
[26]

Wisniewski and Jin-Hee Cho and Sang Won Lee and Ruoxi Jia and Lifu Huang , booktitle=

Minqian Liu and Zhiyang Xu and Xinyi Zhang and Heajun An and Sarvech Qadir and Qi Zhang and Pamela J. Wisniewski and Jin-Hee Cho and Sang Won Lee and Ruoxi Jia and Lifu Huang , booktitle=. 2025 , url=

2025
[27]

arXiv preprint arXiv:2107.03374 , year=

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

Pith/arXiv arXiv
[28]

Measuring and Benchmarking Large Language Models' Capabilities to Generate Persuasive Language

Pauli, Amalie Brogaard and Augenstein, Isabelle and Assent, Ira. Measuring and Benchmarking Large Language Models' Capabilities to Generate Persuasive Language. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/...

work page doi:10.18653/v1/2025.naacl-long.506 2025
[29]

2024 , url =

Esin Durmus and Liane Lovitt and Alex Tamkin and Stuart Ritchie and Jack Clark and Deep Ganguli , title =. 2024 , url =

2024
[30]

arXiv preprint arXiv:2303.08774 , year=

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

Pith/arXiv arXiv
[31]

2025 , url=

Qwen3Guard Technical Report , author=. 2025 , url=

2025
[32]

arXiv preprint arXiv:2402.03300 , year=

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

Pith/arXiv arXiv
[33]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv
[34]

arXiv preprint arXiv:2601.03267 , year=

Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

Pith/arXiv arXiv
[35]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv
[36]

arXiv preprint arXiv:2510.14276 , year=

Qwen3guard technical report , author=. arXiv preprint arXiv:2510.14276 , year=

Pith/arXiv arXiv
[37]

Advances in neural information processing systems , volume=

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms , author=. Advances in neural information processing systems , volume=
[38]

International Conference on Learning Representations , year=

Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=
[39]

TruthfulQA: Measuring how models mimic human false- hoods

Lin, Stephanie and Hilton, Jacob and Evans, Owain. T ruthful QA : Measuring How Models Mimic Human Falsehoods. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.229

work page doi:10.18653/v1/2022.acl-long.229 2022
[40]

and Yue, Summer and Xing, Chen

Deshpande, Kaustubh and Sirdeshmukh, Ved and Mols, Johannes Baptist and Jin, Lifeng and Hernandez-Cardona, Ed-Yeremai and Lee, Dean and Kritz, Jeremy and Primack, Willow E. and Yue, Summer and Xing, Chen. M ulti C hallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLM s. Findings of the Association for Computational...

work page doi:10.18653/v1/2025.findings-acl.958 2025
[41]

2025 , howpublished =

OpenAI , title =. 2025 , howpublished =

2025
[42]

2025 , howpublished =

Meta , title =. 2025 , howpublished =

2025
[43]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
[44]

Advances in neural information processing systems , volume=

Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=
[45]

MTSA : Multi-turn Safety Alignment for LLM s through Multi-round Red-teaming

Guo, Weiyang and Li, Jing and Wang, Wenya and Li, Yu and He, Daojing and Yu, Jun and Zhang, Min. MTSA : Multi-turn Safety Alignment for LLM s through Multi-round Red-teaming. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1282

work page doi:10.18653/v1/2025.acl-long.1282 2025
[46]

The Thirteenth International Conference on Learning Representations , year=

HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models , author=. The Thirteenth International Conference on Learning Representations , year=
[47]

arXiv preprint arXiv:2605.05630 , year=

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue , author=. arXiv preprint arXiv:2605.05630 , year=

Pith/arXiv arXiv

[1] [1]

2024 , volume =

Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan , booktitle =. 2024 , volume =

2024

[2] [2]

2025 , month =

Moix, Alex and Lebedev, Ken and Klein, Jacob , title =. 2025 , month =

2025

[3] [3]

The Twelfth International Conference on Learning Representations , year=

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! , author=. The Twelfth International Conference on Learning Representations , year=

[4] [4]

AgentHarm: A Benchmark for Measuring Harmfulness of

Maksym Andriushchenko and Alexandra Souly and Mateusz Dziemian and Derek Duenas and Maxwell Lin and Justin Wang and Dan Hendrycks and Andy Zou and J Zico Kolter and Matt Fredrikson and Yarin Gal and Xander Davies , booktitle=. AgentHarm: A Benchmark for Measuring Harmfulness of. 2025 , url=

2025

[5] [5]

The Fourteenth International Conference on Learning Representations , year=

Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks , author=. The Fourteenth International Conference on Learning Representations , year=

[6] [6]

Proceedings of the Twentieth European Conference on Computer Systems , pages=

Hybridflow: A flexible and efficient rlhf framework , author=. Proceedings of the Twentieth European Conference on Computer Systems , pages=

[7] [7]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants? , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025

[8] [8]

How to Protect Yourself from 5 G Radiation? Investigating LLM Responses to Implicit Misinformation

Guo, Ruohao and Xu, Wei and Ritter, Alan. How to Protect Yourself from 5 G Radiation? Investigating LLM Responses to Implicit Misinformation. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1468

work page doi:10.18653/v1/2025.emnlp-main.1468 2025

[9] [9]

The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models , author=. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

[10] [10]

Proceedings of the 34th USENIX Conference on Security Symposium , articleno =

Russinovich, Mark and Salem, Ahmed and Eldan, Ronen , title =. Proceedings of the 34th USENIX Conference on Security Symposium , articleno =. 2025 , isbn =

2025

[11] [11]

2024 , eprint=

Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues , author=. 2024 , eprint=

2024

[12] [12]

2024 , eprint=

LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet , author=. 2024 , eprint=

2024

[13] [13]

Red Queen: Exposing Latent Multi-Turn Risks in Large Language Models

Jiang, Yifan and Aggarwal, Kriti and Laud, Tanmay and Munir, Kashif and Pujara, Jay and Mukherjee, Subhabrata. Red Queen: Exposing Latent Multi-Turn Risks in Large Language Models. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1311

work page doi:10.18653/v1/2025.findings-acl.1311 2025

[14] [14]

C o S afe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference

Yu, Erxin and Li, Jing and Liao, Ming and Wang, Siqi and Zuchen, Gao and Mi, Fei and Hong, Lanqing. C o S afe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.968

work page doi:10.18653/v1/2024.emnlp-main.968 2024

[15] [15]

Meta-Tuning LLM s to Leverage Lexical Knowledge for Generalizable Language Style Understanding

Guo, Ruohao and Xu, Wei and Ritter, Alan. Meta-Tuning LLM s to Leverage Lexical Knowledge for Generalizable Language Style Understanding. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.740

work page doi:10.18653/v1/2024.acl-long.740 2024

[16] [16]

The Fourteenth International Conference on Learning Representations , year=

SafeDialBench: A Fine-Grained Safety Evaluation Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks , author=. The Fourteenth International Conference on Learning Representations , year=

[17] [17]

Second Conference on Language Modeling , year=

X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents , author=. Second Conference on Language Modeling , year=

[18] [18]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Improving Alignment and Robustness with Circuit Breakers , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

[19] [19]

arXiv preprint arXiv:2509.07430 , year=

The choice of divergence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward , author=. arXiv preprint arXiv:2509.07430 , year=

arXiv

[20] [20]

X -Boundary: Establishing Exact Safety Boundary to Shield LLM s from Jailbreak Attacks without Compromising Usability

Lu, Xiaoya and Liu, Dongrui and Yu, Yi and Xu, Luxin and Shao, Jing. X -Boundary: Establishing Exact Safety Boundary to Shield LLM s from Jailbreak Attacks without Compromising Usability. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.282

work page doi:10.18653/v1/2025.findings-emnlp.282 2025

[21] [21]

Röttger, H

R. XST est: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.301

work page doi:10.18653/v1/2024.naacl-long.301 2024

[22] [22]

2025 , url=

Justin Cui and Wei-Lin Chiang and Ion Stoica and Cho-Jui Hsieh , booktitle=. 2025 , url=

2025

[23] [23]

S afe S witch: Steering Unsafe LLM Behavior via Internal Activation Signals

Han, Peixuan and Qian, Cheng and Chen, Xiusi and Zhang, Yuji and Ji, Heng and Zhang, Denghui. S afe S witch: Steering Unsafe LLM Behavior via Internal Activation Signals. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.366

work page doi:10.18653/v1/2025.findings-emnlp.366 2025

[24] [24]

arXiv preprint arXiv:2508.09224 , year=

From hard refusals to safe-completions: Toward output-centric safety training , author=. arXiv preprint arXiv:2508.09224 , year=

arXiv

[25] [25]

arXiv preprint arXiv:2305.06972 , year=

Spear phishing with large language models , author=. arXiv preprint arXiv:2305.06972 , year=

arXiv

[26] [26]

Wisniewski and Jin-Hee Cho and Sang Won Lee and Ruoxi Jia and Lifu Huang , booktitle=

Minqian Liu and Zhiyang Xu and Xinyi Zhang and Heajun An and Sarvech Qadir and Qi Zhang and Pamela J. Wisniewski and Jin-Hee Cho and Sang Won Lee and Ruoxi Jia and Lifu Huang , booktitle=. 2025 , url=

2025

[27] [27]

arXiv preprint arXiv:2107.03374 , year=

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

Pith/arXiv arXiv

[28] [28]

Measuring and Benchmarking Large Language Models' Capabilities to Generate Persuasive Language

Pauli, Amalie Brogaard and Augenstein, Isabelle and Assent, Ira. Measuring and Benchmarking Large Language Models' Capabilities to Generate Persuasive Language. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/...

work page doi:10.18653/v1/2025.naacl-long.506 2025

[29] [29]

2024 , url =

Esin Durmus and Liane Lovitt and Alex Tamkin and Stuart Ritchie and Jack Clark and Deep Ganguli , title =. 2024 , url =

2024

[30] [30]

arXiv preprint arXiv:2303.08774 , year=

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

Pith/arXiv arXiv

[31] [31]

2025 , url=

Qwen3Guard Technical Report , author=. 2025 , url=

2025

[32] [32]

arXiv preprint arXiv:2402.03300 , year=

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

Pith/arXiv arXiv

[33] [33]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv

[34] [34]

arXiv preprint arXiv:2601.03267 , year=

Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

Pith/arXiv arXiv

[35] [35]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv

[36] [36]

arXiv preprint arXiv:2510.14276 , year=

Qwen3guard technical report , author=. arXiv preprint arXiv:2510.14276 , year=

Pith/arXiv arXiv

[37] [37]

Advances in neural information processing systems , volume=

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms , author=. Advances in neural information processing systems , volume=

[38] [38]

International Conference on Learning Representations , year=

Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=

[39] [39]

TruthfulQA: Measuring how models mimic human false- hoods

Lin, Stephanie and Hilton, Jacob and Evans, Owain. T ruthful QA : Measuring How Models Mimic Human Falsehoods. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.229

work page doi:10.18653/v1/2022.acl-long.229 2022

[40] [40]

and Yue, Summer and Xing, Chen

Deshpande, Kaustubh and Sirdeshmukh, Ved and Mols, Johannes Baptist and Jin, Lifeng and Hernandez-Cardona, Ed-Yeremai and Lee, Dean and Kritz, Jeremy and Primack, Willow E. and Yue, Summer and Xing, Chen. M ulti C hallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLM s. Findings of the Association for Computational...

work page doi:10.18653/v1/2025.findings-acl.958 2025

[41] [41]

2025 , howpublished =

OpenAI , title =. 2025 , howpublished =

2025

[42] [42]

2025 , howpublished =

Meta , title =. 2025 , howpublished =

2025

[43] [43]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

[44] [44]

Advances in neural information processing systems , volume=

Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=

[45] [45]

MTSA : Multi-turn Safety Alignment for LLM s through Multi-round Red-teaming

Guo, Weiyang and Li, Jing and Wang, Wenya and Li, Yu and He, Daojing and Yu, Jun and Zhang, Min. MTSA : Multi-turn Safety Alignment for LLM s through Multi-round Red-teaming. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1282

work page doi:10.18653/v1/2025.acl-long.1282 2025

[46] [46]

The Thirteenth International Conference on Learning Representations , year=

HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models , author=. The Thirteenth International Conference on Learning Representations , year=

[47] [47]

arXiv preprint arXiv:2605.05630 , year=

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue , author=. arXiv preprint arXiv:2605.05630 , year=

Pith/arXiv arXiv