Investigating and Alleviating Harm Amplification in LLM Interactions
Pith reviewed 2026-06-28 14:30 UTC · model grok-4.3
The pith
TrajSafe monitors multi-turn LLM paths and intervenes to cut harm amplification while keeping normal use intact.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLMs act as harm amplifiers in multi-turn settings by democratizing domain expertise and scaling operations; HarmAmp provides a benchmark of twelve risk categories meeting criteria of substantive amplification, operational specificity, and multi-turn necessity; TrajSafe, a proactive monitor, anticipates harmful trajectories and intervenes via intent probing and safer steering, producing large reductions in harmfulness while preserving low over-refusal and general capabilities.
What carries the argument
TrajSafe, a proactive monitor that anticipates harmful trajectories and intervenes through actions such as probing users' genuine intents and steering the models towards safer completion.
If this is right
- Harmfulness drops markedly in the twelve multi-turn scenarios tested.
- Over-refusal rate on safe queries stays low.
- Target model capabilities remain largely unchanged.
- The approach applies across the full set of risk categories in the benchmark.
Where Pith is reading between the lines
- Safety systems may need to shift focus from single-turn refusals to ongoing trajectory monitoring.
- Benchmarks limited to single prompts could miss the main amplification risks that appear only after several turns.
- Similar monitoring could be tested on other generative models beyond the ones evaluated here.
Load-bearing premise
The twelve risk categories and chosen scenarios are representative enough of real-world harm amplification to let the benchmark measure the actual risk.
What would settle it
Deploy TrajSafe in live multi-turn interactions with users attempting real harmful goals and check whether harm rates drop substantially or over-refusal on benign queries rises sharply.
Figures
read the original abstract
Large language models (LLMs) can serve as helpful assistants, yet they can equally function as harm amplifiers that enable malicious users to achieve harmful outcomes beyond their capabilities through extended interactions. This risk manifests along two axes, i.e., democratizing domain expertise that allows novices to produce specialized harmful content, and scaling harmful operations at volumes that manual effort cannot match. Existing works, however, often overlook how LLMs compound harm across multi-turn conversations. We introduce HarmAmp, a new benchmark for multi-turn harm amplification scenarios spanning twelve risk categories. Each scenario is grounded in real-world threats and satisfies rigorous criteria, i.e., substantive amplification, operational specificity, and multi-turn necessity. We further propose TrajSafe, a proactive monitor that anticipates harmful trajectories and intervenes through actions such as probing users' genuine intents and steering the models towards safer completion. Our extensive experiments demonstrate that TrajSafe significantly reduces the harmfulness incurred in multi-turn interactions while preserving a low over-refusal rate and the target model's general capabilities. Our work offers a promising paradigm to alleviate the nuanced safety risks in LLM interactions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HarmAmp, a benchmark of multi-turn harm amplification scenarios across twelve risk categories, each asserted to meet criteria of substantive amplification, operational specificity, and multi-turn necessity. It proposes TrajSafe, a proactive monitor that anticipates harmful trajectories and intervenes (e.g., via intent probing or steering). Experiments claim TrajSafe significantly reduces harmfulness in multi-turn interactions while preserving low over-refusal rates and the target model's general capabilities.
Significance. If the central empirical claims hold after validation of the benchmark, the work addresses an important gap in LLM safety by focusing on multi-turn harm amplification rather than single-turn refusals. The introduction of a dedicated benchmark and a trajectory-aware intervention method could provide a useful paradigm for future safety research, particularly if the scenarios prove representative of real-world risks.
major comments (2)
- [Benchmark construction] Benchmark construction section: The assertion that the twelve scenarios satisfy substantive amplification, operational specificity, and multi-turn necessity is presented without inter-annotator agreement scores, expert validation, or comparison against existing single-turn harm benchmarks. This is load-bearing for the central claim because TrajSafe's reported harm reduction is measured exclusively on HarmAmp; if the scenarios do not genuinely require multi-turn interaction or amplify harm beyond single-turn baselines, the measured improvement cannot be attributed to trajectory monitoring.
- [Experimental evaluation] Experimental evaluation section: No details are provided on the operationalization of harmfulness (e.g., annotation protocol, LLM-as-judge prompts, or human evaluation), choice of baselines, statistical significance testing, or quantitative metrics for over-refusal rate and capability preservation. These omissions make it impossible to assess whether the reported positive outcomes are robust or reproducible.
minor comments (1)
- [Abstract] The abstract states positive experimental outcomes but defers all measurement details to later sections; a brief summary of the harmfulness metric and evaluation protocol in the abstract would improve readability.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify areas where the manuscript can be strengthened. We address each major comment below and commit to revisions that provide the requested details without altering the core claims.
read point-by-point responses
-
Referee: [Benchmark construction] Benchmark construction section: The assertion that the twelve scenarios satisfy substantive amplification, operational specificity, and multi-turn necessity is presented without inter-annotator agreement scores, expert validation, or comparison against existing single-turn harm benchmarks. This is load-bearing for the central claim because TrajSafe's reported harm reduction is measured exclusively on HarmAmp; if the scenarios do not genuinely require multi-turn interaction or amplify harm beyond single-turn baselines, the measured improvement cannot be attributed to trajectory monitoring.
Authors: We acknowledge that the current manuscript states the criteria but does not report inter-annotator agreement, explicit expert validation steps, or direct comparisons to single-turn benchmarks. The scenarios were constructed from documented real-world threats with the three criteria applied during selection, yet we agree this process requires more transparent documentation. In revision we will add a dedicated subsection detailing the construction protocol, including any multi-author review process used and a qualitative comparison table against representative single-turn harm datasets to demonstrate multi-turn necessity. We will also note limitations where quantitative IAA was not collected. revision: yes
-
Referee: [Experimental evaluation] Experimental evaluation section: No details are provided on the operationalization of harmfulness (e.g., annotation protocol, LLM-as-judge prompts, or human evaluation), choice of baselines, statistical significance testing, or quantitative metrics for over-refusal rate and capability preservation. These omissions make it impossible to assess whether the reported positive outcomes are robust or reproducible.
Authors: We agree the experimental section is underspecified. The manuscript reports aggregate harm reduction, low over-refusal, and capability preservation but omits the precise annotation protocol, judge prompts, baseline selection rationale, statistical tests, and exact metrics. In the revised version we will expand the evaluation subsection to include: (1) full LLM-as-judge prompts and any human validation protocol, (2) justification for chosen baselines, (3) results of statistical significance tests, and (4) explicit quantitative definitions and values for over-refusal rate and capability metrics (e.g., MMLU or equivalent). revision: yes
Circularity Check
No significant circularity; empirical evaluation on author-defined benchmark
full rationale
The paper introduces the HarmAmp benchmark with author-specified criteria (substantive amplification, operational specificity, multi-turn necessity) and evaluates TrajSafe empirically on it. No equations, derivations, fitted parameters, or self-citation chains appear in the provided text that would reduce any central claim to its inputs by construction. The benchmark definition is standard for new datasets and does not create a self-definitional loop where measured reductions are forced by the criteria themselves. The work is self-contained as an empirical study against its own scenarios.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Harm amplification can be meaningfully operationalized through scenarios that meet substantive amplification, operational specificity, and multi-turn necessity criteria.
invented entities (2)
-
HarmAmp benchmark
no independent evidence
-
TrajSafe monitor
no independent evidence
Reference graph
Works this paper leans on
-
[1]
2024 , volume =
Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan , booktitle =. 2024 , volume =
2024
-
[2]
2025 , month =
Moix, Alex and Lebedev, Ken and Klein, Jacob , title =. 2025 , month =
2025
-
[3]
The Twelfth International Conference on Learning Representations , year=
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! , author=. The Twelfth International Conference on Learning Representations , year=
-
[4]
AgentHarm: A Benchmark for Measuring Harmfulness of
Maksym Andriushchenko and Alexandra Souly and Mateusz Dziemian and Derek Duenas and Maxwell Lin and Justin Wang and Dan Hendrycks and Andy Zou and J Zico Kolter and Matt Fredrikson and Yarin Gal and Xander Davies , booktitle=. AgentHarm: A Benchmark for Measuring Harmfulness of. 2025 , url=
2025
-
[5]
The Fourteenth International Conference on Learning Representations , year=
Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks , author=. The Fourteenth International Conference on Learning Representations , year=
-
[6]
Proceedings of the Twentieth European Conference on Computer Systems , pages=
Hybridflow: A flexible and efficient rlhf framework , author=. Proceedings of the Twentieth European Conference on Computer Systems , pages=
-
[7]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants? , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
2025
-
[8]
How to Protect Yourself from 5 G Radiation? Investigating LLM Responses to Implicit Misinformation
Guo, Ruohao and Xu, Wei and Ritter, Alan. How to Protect Yourself from 5 G Radiation? Investigating LLM Responses to Implicit Misinformation. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1468
-
[9]
The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models , author=. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
-
[10]
Proceedings of the 34th USENIX Conference on Security Symposium , articleno =
Russinovich, Mark and Salem, Ahmed and Eldan, Ronen , title =. Proceedings of the 34th USENIX Conference on Security Symposium , articleno =. 2025 , isbn =
2025
-
[11]
2024 , eprint=
Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues , author=. 2024 , eprint=
2024
-
[12]
2024 , eprint=
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet , author=. 2024 , eprint=
2024
-
[13]
Red Queen: Exposing Latent Multi-Turn Risks in Large Language Models
Jiang, Yifan and Aggarwal, Kriti and Laud, Tanmay and Munir, Kashif and Pujara, Jay and Mukherjee, Subhabrata. Red Queen: Exposing Latent Multi-Turn Risks in Large Language Models. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1311
-
[14]
C o S afe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference
Yu, Erxin and Li, Jing and Liao, Ming and Wang, Siqi and Zuchen, Gao and Mi, Fei and Hong, Lanqing. C o S afe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.968
-
[15]
Meta-Tuning LLM s to Leverage Lexical Knowledge for Generalizable Language Style Understanding
Guo, Ruohao and Xu, Wei and Ritter, Alan. Meta-Tuning LLM s to Leverage Lexical Knowledge for Generalizable Language Style Understanding. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.740
-
[16]
The Fourteenth International Conference on Learning Representations , year=
SafeDialBench: A Fine-Grained Safety Evaluation Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks , author=. The Fourteenth International Conference on Learning Representations , year=
-
[17]
Second Conference on Language Modeling , year=
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents , author=. Second Conference on Language Modeling , year=
-
[18]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
Improving Alignment and Robustness with Circuit Breakers , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[19]
arXiv preprint arXiv:2509.07430 , year=
The choice of divergence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward , author=. arXiv preprint arXiv:2509.07430 , year=
-
[20]
Lu, Xiaoya and Liu, Dongrui and Yu, Yi and Xu, Luxin and Shao, Jing. X -Boundary: Establishing Exact Safety Boundary to Shield LLM s from Jailbreak Attacks without Compromising Usability. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.282
-
[21]
R. XST est: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.301
-
[22]
2025 , url=
Justin Cui and Wei-Lin Chiang and Ion Stoica and Cho-Jui Hsieh , booktitle=. 2025 , url=
2025
-
[23]
S afe S witch: Steering Unsafe LLM Behavior via Internal Activation Signals
Han, Peixuan and Qian, Cheng and Chen, Xiusi and Zhang, Yuji and Ji, Heng and Zhang, Denghui. S afe S witch: Steering Unsafe LLM Behavior via Internal Activation Signals. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.366
-
[24]
arXiv preprint arXiv:2508.09224 , year=
From hard refusals to safe-completions: Toward output-centric safety training , author=. arXiv preprint arXiv:2508.09224 , year=
-
[25]
arXiv preprint arXiv:2305.06972 , year=
Spear phishing with large language models , author=. arXiv preprint arXiv:2305.06972 , year=
-
[26]
Wisniewski and Jin-Hee Cho and Sang Won Lee and Ruoxi Jia and Lifu Huang , booktitle=
Minqian Liu and Zhiyang Xu and Xinyi Zhang and Heajun An and Sarvech Qadir and Qi Zhang and Pamela J. Wisniewski and Jin-Hee Cho and Sang Won Lee and Ruoxi Jia and Lifu Huang , booktitle=. 2025 , url=
2025
-
[27]
arXiv preprint arXiv:2107.03374 , year=
Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=
-
[28]
Measuring and Benchmarking Large Language Models' Capabilities to Generate Persuasive Language
Pauli, Amalie Brogaard and Augenstein, Isabelle and Assent, Ira. Measuring and Benchmarking Large Language Models' Capabilities to Generate Persuasive Language. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/...
-
[29]
2024 , url =
Esin Durmus and Liane Lovitt and Alex Tamkin and Stuart Ritchie and Jack Clark and Deep Ganguli , title =. 2024 , url =
2024
-
[30]
arXiv preprint arXiv:2303.08774 , year=
Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=
-
[31]
2025 , url=
Qwen3Guard Technical Report , author=. 2025 , url=
2025
-
[32]
arXiv preprint arXiv:2402.03300 , year=
Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=
-
[33]
arXiv preprint arXiv:2407.21783 , year=
The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=
-
[34]
arXiv preprint arXiv:2601.03267 , year=
Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=
-
[35]
arXiv preprint arXiv:2505.09388 , year=
Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=
-
[36]
arXiv preprint arXiv:2510.14276 , year=
Qwen3guard technical report , author=. arXiv preprint arXiv:2510.14276 , year=
-
[37]
Advances in neural information processing systems , volume=
Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms , author=. Advances in neural information processing systems , volume=
-
[38]
International Conference on Learning Representations , year=
Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=
-
[39]
TruthfulQA: Measuring how models mimic human false- hoods
Lin, Stephanie and Hilton, Jacob and Evans, Owain. T ruthful QA : Measuring How Models Mimic Human Falsehoods. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.229
-
[40]
and Yue, Summer and Xing, Chen
Deshpande, Kaustubh and Sirdeshmukh, Ved and Mols, Johannes Baptist and Jin, Lifeng and Hernandez-Cardona, Ed-Yeremai and Lee, Dean and Kritz, Jeremy and Primack, Willow E. and Yue, Summer and Xing, Chen. M ulti C hallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLM s. Findings of the Association for Computational...
-
[41]
2025 , howpublished =
OpenAI , title =. 2025 , howpublished =
2025
-
[42]
2025 , howpublished =
Meta , title =. 2025 , howpublished =
2025
-
[43]
Advances in neural information processing systems , volume=
Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
-
[44]
Advances in neural information processing systems , volume=
Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=
-
[45]
MTSA : Multi-turn Safety Alignment for LLM s through Multi-round Red-teaming
Guo, Weiyang and Li, Jing and Wang, Wenya and Li, Yu and He, Daojing and Yu, Jun and Zhang, Min. MTSA : Multi-turn Safety Alignment for LLM s through Multi-round Red-teaming. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1282
-
[46]
The Thirteenth International Conference on Learning Representations , year=
HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models , author=. The Thirteenth International Conference on Learning Representations , year=
-
[47]
arXiv preprint arXiv:2605.05630 , year=
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue , author=. arXiv preprint arXiv:2605.05630 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.