arxiv: 2604.04060 · v1 · submitted 2026-04-05 · 💻 cs.CR · cs.AI

Recognition: 2 theorem links

· Lean Theorem

CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks

Siyuan Li , Zehao Liu , Xi Lin , Qinghua Mao , Yuliang Chen , Haoyu Li , Jun Wu , Jianhua Li

show 1 more author

Xiu Su

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:03 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords LLM defensemulti-round attackscooperative agentsstateful defenseadversarial robustnessEMRA benchmarkAI safetyevolving attacks

0 comments

The pith

CoopGuard uses cooperative agents that track conversation history to defend LLMs against attacks that adapt across multiple rounds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CoopGuard as a defense framework for large language models that deploys multiple specialized agents working in coordination while preserving an internal record of prior interactions. Current defenses typically respond to individual prompts without adapting to an adversary's changing tactics over successive exchanges. CoopGuard counters this by updating a defense state that informs decisions across rounds and assigns complementary roles to its agents. A sympathetic reader would see value in this because real applications of LLMs often involve extended dialogues where attacks can build gradually. The reported experiments indicate substantially lower attack success rates than prior methods when tested on a new multi-round benchmark.

Core claim

CoopGuard is a stateful multi-round LLM defense framework based on cooperative agents that maintains and updates an internal defense state to counter evolving attacks. It employs three specialized agents (Deferring Agent, Tempting Agent, and Forensic Agent) for complementary round-level strategies, coordinated by System Agent, which conditions decisions on the evolving defense state (interaction history) and orchestrates agents over time. On the EMRA benchmark of 5,200 adversarial samples across 8 attack types, it reduces attack success rate by 78.9% over state-of-the-art defenses while improving deceptive rate by 186% and reducing attack efficiency by 167.9%.

What carries the argument

A stateful cooperative agent system in which a System Agent updates an internal defense state from interaction history and coordinates Deferring, Tempting, and Forensic agents for round-by-round responses.

If this is right

LLMs gain protection that persists and adapts across successive conversation turns rather than resetting per prompt.
Defense systems can simultaneously block attacks and increase the effort required for adversaries to succeed.
Multi-round evaluation becomes a standard requirement for assessing LLM robustness in interactive settings.
Agent coordination mechanisms become a practical design pattern for handling adaptive adversaries.
Production deployments of LLMs in extended dialogues receive stronger safeguards against strategy refinement by attackers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same state-maintenance idea could be tested in non-LLM agents that face adaptive opponents in sequential decision tasks.
Real-world deployment would require checking whether the agent roles remain effective when attackers have knowledge of the defense structure.
Combining this approach with static safety filters might produce layered protections that cover both single-turn and multi-turn threats.
New benchmarks focused on longer interaction chains could reveal whether the reported gains hold when conversations exceed the lengths tested here.

Load-bearing premise

The EMRA benchmark together with the assigned agent roles captures the full range of real evolving multi-round attacks without introducing new weaknesses or overfitting to the selected attack types.

What would settle it

Demonstrating that CoopGuard achieves no better attack success rate than existing defenses when evaluated on a fresh collection of multi-round attacks constructed independently of the EMRA benchmark.

Figures

Figures reproduced from arXiv: 2604.04060 by Haoyu Li, Jianhua Li, Jun Wu, Qinghua Mao, Siyuan Li, Xi Lin, Xiu Su, Yuliang Chen, Zehao Liu.

**Figure 1.** Figure 1: Illustration of the challenge posed by independent yet progressively evolving multi-round adversarial attacks on LLMs and our innovative CoopGuard multi-agent adaptive defense mechanism to effectively counter these evolving threats. defense state across rounds, so each decision explicitly depends on interaction history and prior defense actions. 2.1 Multi-Agent Cooperative Defense State and round-level … view at source ↗

**Figure 2.** Figure 2: Overview of CoopGuard multi-agent jailbreak defense framework. Deferring Agent slows down probing progress via ambiguity injection, while Tempting Agent generates deceptive traps to mislead them. Forensic Agent collects and analyzes evidence of attack behaviors. System Agent oversees the agents, dynamically refining defense strategies to adapt to evolving threats. This cooperative process safeguards the… view at source ↗

**Figure 3.** Figure 3: Resource footprint of multi-round attacks. (a) Token con [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Evaluation of AE, quantified by the average token consumption per dialogue across different models. CoopGuard forces the attacker to expend significantly more resources (higher is better for defense) compared to baselines. interactions. Overall, these results validate that stateful cooperative deception effectively counters multi-turn attacks by maintaining deceptive contexts, thereby reducing attacker e… view at source ↗

**Figure 6.** Figure 6: Cross-validation results with Deepseek-Judge and Human [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

As Large Language Models (LLMs) are increasingly deployed in complex applications, their vulnerability to adversarial attacks raises urgent safety concerns, especially those evolving over multi-round interactions. Existing defenses are largely reactive and struggle to adapt as adversaries refine strategies across rounds. In this work, we propose CoopGuard , a stateful multi-round LLM defense framework based on cooperative agents that maintains and updates an internal defense state to counter evolving attacks. It employs three specialized agents (Deferring Agent, Tempting Agent, and Forensic Agent) for complementary round-level strategies, coordinated by System Agent, which conditions decisions on the evolving defense state (interaction history) and orchestrates agents over time. To evaluate evolving threats, we introduce the EMRA benchmark with 5,200 adversarial samples across 8 attack types, simulating progressively LLM multi-round attacks. Experiments show that CoopGuard reduces attack success rate by 78.9% over state-of-the-art defenses, while improving deceptive rate by 186% and reducing attack efficiency by 167.9%, offering a more comprehensive assessment of multi-round defense. These results demonstrate that CoopGuard provides robust protection for LLMs in multi-round adversarial scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoopGuard adds a stateful multi-agent defense and the EMRA benchmark for multi-round LLM attacks, with large reported gains, but the evaluation likely under-tests adaptation by responsive adversaries.

read the letter

The core of this paper is a cooperative agent system that tracks conversation history as state and coordinates specialized agents—Deferring, Tempting, and Forensic—under a System Agent to handle attacks that build across turns. They also release EMRA, a benchmark of 5,200 samples over eight attack types meant to simulate progressive threats. The abstract shows clear numerical improvements over prior defenses on attack success rate, deceptive rate, and efficiency metrics. This setup targets a real gap: most existing LLM defenses are single-turn and reactive, while actual use involves ongoing interactions where adversaries can refine their approach. The agent division and state maintenance are a reasonable way to make defenses more dynamic, and introducing a dedicated multi-round benchmark is a concrete step that others can build on. The reported effect sizes are large enough to warrant attention from people working on deployed LLM safety. The main soft spot is the benchmark construction itself. If the attack sequences are fixed progressions that do not adjust based on the defense agents' outputs or the evolving internal state, then the experiments do not fully test whether the stateful coordination actually counters an adaptive adversary. That distinction matters for the central claim. The abstract also gives limited detail on how the agents are implemented, prompted, or trained, which leaves room for benchmark-specific tuning. This work is aimed at researchers in AI security who focus on conversational systems rather than single-query robustness. A reader already thinking about multi-turn threats would find the architecture and benchmark useful to examine. It deserves peer review so the methods can be checked and the adaptation question can be clarified.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces CoopGuard, a stateful multi-round defense framework for LLMs that employs four cooperative agents (Deferring Agent, Tempting Agent, Forensic Agent, and coordinating System Agent) to maintain and update an internal defense state based on interaction history. It introduces the EMRA benchmark with 5,200 adversarial samples across 8 attack types to simulate progressively evolving threats and reports that CoopGuard reduces attack success rate by 78.9% relative to prior defenses while improving deceptive rate by 186% and reducing attack efficiency by 167.9%.

Significance. If the central empirical claims hold under rigorous verification, the work offers a timely advance in LLM safety by shifting from reactive single-turn defenses to stateful, cooperative multi-agent strategies that adapt across rounds. The EMRA benchmark could serve as a useful evaluation resource for multi-round attack research, and the reported gains indicate potential practical value for interactive LLM deployments. The approach's emphasis on complementary agent roles and explicit state maintenance distinguishes it from prior defenses.

major comments (3)

[§4] §4 (EMRA Benchmark Construction): The description of how the 5,200 multi-round sequences are generated does not specify whether attack progressions remain fixed and non-responsive to the evolving defense state (i.e., outputs from Deferring/Tempting/Forensic agents and System Agent updates) or incorporate adaptive adversary behavior. This distinction is load-bearing for the claim that the 78.9% ASR reduction demonstrates robustness to truly evolving attacks rather than non-adaptive progressions.
[§5] §5 (Experiments) and Abstract: The headline metrics (78.9% ASR reduction, 186% deceptive-rate improvement, 167.9% attack-efficiency reduction) are reported without variance across runs, number of independent trials, statistical significance tests, or explicit train/test splits on EMRA. These omissions prevent assessment of whether the gains are stable or potentially due to benchmark-specific tuning.
[§3.2] §3.2 (Agent Coordination): The mechanisms by which the three specialized agents update the shared defense state and how the System Agent conditions decisions on that state are described at a high level without pseudocode, state-transition rules, or concrete update equations. This limits reproducibility and makes it difficult to verify that the complementary round-level strategies are implemented as claimed.

minor comments (2)

[Throughout] Ensure uniform capitalization and abbreviation of agent names (e.g., 'System Agent' vs. 'system agent') throughout the text and figures.
[Figure 1] Figure 1 (agent-interaction diagram): Add explicit arrows or labels indicating the flow of defense-state updates between agents to improve clarity of the stateful coordination.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which have helped us strengthen the clarity and rigor of the manuscript. We address each major comment point by point below. Revisions have been made to improve reproducibility and transparency where the comments identify gaps.

read point-by-point responses

Referee: §4 (EMRA Benchmark Construction): The description of how the 5,200 multi-round sequences are generated does not specify whether attack progressions remain fixed and non-responsive to the evolving defense state (i.e., outputs from Deferring/Tempting/Forensic agents and System Agent updates) or incorporate adaptive adversary behavior. This distinction is load-bearing for the claim that the 78.9% ASR reduction demonstrates robustness to truly evolving attacks rather than non-adaptive progressions.

Authors: We confirm that EMRA employs fixed, non-adaptive attack progressions: each multi-round sequence is predefined to simulate progressive evolution across the eight attack types without conditioning on the defense agents' outputs or state updates. This controlled design enables reproducible evaluation of how well the stateful defense counters increasingly sophisticated but non-responsive threats. We agree that fully adaptive adversaries would constitute a stronger test and have added an explicit statement of this limitation plus a brief discussion of future adaptive-benchmark extensions in the revised §4. The reported gains therefore apply to non-adaptive evolving progressions, which still represent a meaningful advance over single-turn defenses. revision: partial
Referee: §5 (Experiments) and Abstract: The headline metrics (78.9% ASR reduction, 186% deceptive-rate improvement, 167.9% attack-efficiency reduction) are reported without variance across runs, number of independent trials, statistical significance tests, or explicit train/test splits on EMRA. These omissions prevent assessment of whether the gains are stable or potentially due to benchmark-specific tuning.

Authors: We have rerun all experiments with five independent random seeds and now report mean ± standard deviation for every metric in §5. Paired t-tests confirm statistical significance (p < 0.01) for the reported improvements over baselines. Because CoopGuard is training-free, the 70/30 split mentioned in the original text applies only to any auxiliary data used for prompt engineering; we have clarified this and added the split details. The abstract has been updated to include the variance information. revision: yes
Referee: §3.2 (Agent Coordination): The mechanisms by which the three specialized agents update the shared defense state and how the System Agent conditions decisions on that state are described at a high level without pseudocode, state-transition rules, or concrete update equations. This limits reproducibility and makes it difficult to verify that the complementary round-level strategies are implemented as claimed.

Authors: We have expanded §3.2 with a new Algorithm 1 that provides pseudocode for the full coordination loop, including the precise state-update rules (defense-state vector concatenation of agent outputs and attack indicators) and the conditioning logic used by the System Agent. Concrete update equations for the state vector are now included. These additions directly address the reproducibility concern while preserving the original high-level description. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation or evaluation chain

full rationale

The paper introduces CoopGuard as a new multi-agent defense framework and the EMRA benchmark for evaluation, then reports empirical results comparing attack success rates against prior defenses. No equations, first-principles derivations, or predictions are present that reduce by construction to fitted parameters, self-definitions, or self-citation chains. The central claims rest on experimental measurements using the introduced benchmark and standard metrics, which are independent of any internal reduction to inputs. This is a standard empirical systems paper with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the assumption that multi-agent coordination with shared state can reliably counter adaptive attacks and that the new benchmark faithfully captures real threat evolution.

axioms (1)

domain assumption Cooperative agents with complementary roles can maintain an effective shared defense state across rounds
Invoked in the description of the System Agent coordinating Deferring, Tempting, and Forensic Agents.

invented entities (1)

Deferring Agent, Tempting Agent, Forensic Agent, System Agent no independent evidence
purpose: Specialized roles for round-level defense strategies coordinated over time
New agent types introduced to implement the stateful defense; no independent evidence provided beyond the framework description.

pith-pipeline@v0.9.0 · 5528 in / 1284 out tokens · 44574 ms · 2026-05-13T17:03:08.150559+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CoopGuard maintains an evolving defense state h_t = H(h_{t-1}, π(x_t), R_T(x_t)) with four-step detect→deceive→forensically summarize→fuse loop
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Tempting Agent generates deceptive decoys R_T(x_t) = F_T([x_t; h_{t-1}]; θ_T) to exhaust attacker resources

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 9 internal anchors

[1]

GPT-4 Technical Report

[Achiamet al., 2023 ] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Alt- man, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues.arXiv preprint arXiv:2402.14762, 2024

[Baiet al., 2024 ] Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, et al. Mt-bench-101: A fine- grained benchmark for evaluating large language models in multi-turn dialogues.arXiv preprint arXiv:2402.14762,

work page arXiv 2024
[3]

Jailbreaking Black Box Large Language Models in Twenty Queries

[Chaoet al., 2023 ] Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Jailbreaking black box large language models in twenty queries

[Chaoet al., 2025 ] Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 23–42. IEEE,

work page 2025
[5]

Jailbreaker: Automated jailbreak across multiple large language model chatbots

[Denget al., 2023 ] Gelei Deng, Yi Liu, Yuekang Li, Kai- long Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. Jailbreaker: Automated jailbreak across multiple large language model chatbots.arXiv preprint arXiv:2307.08715,

work page arXiv 2023
[6]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

[Duet al., 2023 ] Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving fac- tuality and reasoning in language models through multia- gent debate.arXiv preprint arXiv:2305.14325,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

M2s: Multi-turn to single-turn jailbreak in red team- ing for llms

[Haet al., 2025 ] Junwoo Ha, Hyunjun Kim, Sangyoon Yu, Haon Park, Ashkan Yousefpour, Yuna Park, and Suhyun Kim. M2s: Multi-turn to single-turn jailbreak in red team- ing for llms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, volume 1, pages 16489–16507,

work page 2025
[8]

and others , title =

[Hanet al., 2024 ] Shanshan Han, Qifan Zhang, Yuhang Yao, Weizhao Jin, Zhaozhuo Xu, and Chaoyang He. Llm multi- agent systems: Challenges and open problems.arXiv preprint arXiv:2402.03578,

work page arXiv 2024
[9]

Metagpt: Meta programming for a multi-agent collabora- tive framework

[Honget al., 2023 ] Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collabora- tive framework. InThe Twelfth International Conference on Learning Representations,

work page 2023
[10]

Steering dialogue dynamics for robust- ness against multi-turn jailbreaking attacks.arXiv preprint arXiv:2503.00187,

[Huet al., 2025 ] Hanjiang Hu, Alexander Robey, and Changliu Liu. Steering dialogue dynamics for robust- ness against multi-turn jailbreaking attacks.arXiv preprint arXiv:2503.00187,

work page arXiv 2025
[11]

DeepInception: Hypno- tize Large Language Model to Be Jailbreaker

[Liet al., 2023b ] Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han. Deepincep- tion: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191,

work page arXiv
[12]

Faster-gcg: Efficient discrete optimization jailbreak attacks against aligned large language models.arXiv preprint arXiv:2410.15362,

[Liet al., 2024 ] Xiao Li, Zhuhong Li, Qiongxiu Li, Bingze Lee, Jinghao Cui, and Xiaolin Hu. Faster-gcg: Efficient discrete optimization jailbreak attacks against aligned large language models.arXiv preprint arXiv:2410.15362,

work page arXiv 2024
[13]

Secu- ritylingua: Efficient defense of llm jailbreak attacks via security-aware prompt compression.arXiv preprint arXiv:2506.12707,

[Liet al., 2025 ] Yucheng Li, Surin Ahn, Huiqiang Jiang, Amir H Abdi, Yuqing Yang, and Lili Qiu. Secu- ritylingua: Efficient defense of llm jailbreak attacks via security-aware prompt compression.arXiv preprint arXiv:2506.12707,

work page arXiv 2025
[14]

Honey- trap: Deceiving large language model attackers to honey- pot traps with resilient multi-agent defense.arXiv preprint arXiv:2601.04034,

[Liet al., 2026 ] Siyuan Li, Xi Lin, Jun Wu, Zehao Liu, Haoyu Li, Tianjie Ju, Xiang Chen, and Jianhua Li. Honey- trap: Deceiving large language model attackers to honey- pot traps with resilient multi-agent defense.arXiv preprint arXiv:2601.04034,

work page arXiv 2026
[15]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

[Liuet al., 2023a ] Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Jailbreaking chat- gpt via prompt engineering: An empirical study

[Liuet al., 2023b ] Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, Kailong Wang, and Yang Liu. Jailbreak- ing chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860,

work page arXiv
[17]

Making them ask and answer: Jailbreaking large language models in few queries via disguise and reconstruction

[Liuet al., 2024 ] Tong Liu, Yingjie Zhang, Zhe Zhao, Yin- peng Dong, Guozhu Meng, and Kai Chen. Making them ask and answer: Jailbreaking large language models in few queries via disguise and reconstruction. In33rd USENIX Security Symposium (USENIX Security 24), pages 4711– 4728,

work page 2024
[18]

X-boundary: Establishing ex- act safety boundary to shield llms from multi-turn jail- breaks without compromising usability.arXiv preprint arXiv:2502.09990,

[Luet al., 2025 ] Xiaoya Lu, Dongrui Liu, Yi Yu, Luxin Xu, and Jing Shao. X-boundary: Establishing ex- act safety boundary to shield llms from multi-turn jail- breaks without compromising usability.arXiv preprint arXiv:2502.09990,

work page arXiv 2025
[19]

Fight back against jailbreaking via prompt adversarial tuning

[Moet al., 2024 ] Yichuan Mo, Yuji Wang, Zeming Wei, and Yisen Wang. Fight back against jailbreaking via prompt adversarial tuning. InThe Thirty-eighth Annual Confer- ence on Neural Information Processing Systems,

work page 2024
[20]

Helping big language models protect themselves: An enhanced filtering and summariza- tion system.arXiv preprint arXiv:2505.01315,

[Muhaimin and Mastorakis, 2025] Sheikh Samit Muhaimin and Spyridon Mastorakis. Helping big language models protect themselves: An enhanced filtering and summariza- tion system.arXiv preprint arXiv:2505.01315,

work page arXiv 2025
[21]

Generative agents: Interactive sim- ulacra of human behavior

[Parket al., 2023 ] Joon Sung Park, Joseph O’Brien, Car- rie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive sim- ulacra of human behavior. InProceedings of the 36th an- nual acm symposium on user interface software and tech- nology, pages 1–22,

work page 2023
[22]

Fine-tuning aligned language models compromises safety, even when users do not intend to! InThe Twelfth Interna- tional Conference on Learning Representations,

[Qiet al., 2024 ] Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! InThe Twelfth Interna- tional Conference on Learning Representations,

work page 2024
[23]

ChatDev: Communicative Agents for Software Development

[Qianet al., 2023 ] Chen Qian, Xin Cong, Wei Liu, Cheng Yang, Weize Chen, Yusheng Su, Yufan Dang, Jiahao Li, Juyuan Xu, Dahai Li, et al. Communicative agents for software development.arXiv preprint arXiv:2307.07924,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Great, now write an article about that: The crescendo multi-turn LLM jailbreak attack.arXiv preprint arXiv:2404.01833, 2024

[Russinovichet al., 2024 ] Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo multi-turn llm jailbreak attack. arXiv preprint arXiv:2404.01833,

work page arXiv 2024
[25]

Distributional pref- erence learning: Understanding and accounting for hidden context in rlhf

[Siththaranjanet al., 2024 ] Anand Siththaranjan, Cassidy Laidlaw, and Dylan Hadfield-Menell. Distributional pref- erence learning: Understanding and accounting for hidden context in rlhf. InThe Twelfth International Conference on Learning Representations,

work page 2024
[26]

Gemini: A Family of Highly Capable Multimodal Models

[Teamet al., 2023 ] Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Mil- lican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

LLaMA: Open and Efficient Foundation Language Models

[Touvronet al., 2023 ] Hugo Touvron, Thibaut Lavril, Gau- tier Izacard, Xavier Martinet, Marie-Anne Lachaux, Tim- oth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Ham- bro, Faisal Azhar, et al. Llama: Open and efficient founda- tion language models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Unleashing cognitive synergy in large lan-guage models: Atask-solving agent through multi-persona self- collaboration.arXiv preprint arXiv:2307.05300, 2023

[Wanget al., 2023 ] Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, and Heng Ji. Un- leashing the emergent cognitive synergy in large language models: A task-solving agent through multi-persona self- collaboration.arXiv preprint arXiv:2307.05300,

work page arXiv 2023
[29]

Defending llms against jail- breaking attacks via backtranslation.arXiv preprint arXiv:2402.16459,

[Wanget al., 2024 ] Yihan Wang, Zhouxing Shi, Andrew Bai, and Cho-Jui Hsieh. Defending llms against jail- breaking attacks via backtranslation.arXiv preprint arXiv:2402.16459,

work page arXiv 2024
[30]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

[Wuet al., 2023 ] Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: En- abling next-gen llm applications via multi-agent conversa- tion framework.arXiv preprint arXiv:2308.08155,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Defending chatgpt against jailbreak attack via self- reminders.Nature Machine Intelligence, 5(12):1486– 1496,

[Xieet al., 2023 ] Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. Defending chatgpt against jailbreak attack via self- reminders.Nature Machine Intelligence, 5(12):1486– 1496,

work page 2023
[32]

Lightdefense: A lightweight uncertainty-driven defense against jail- breaks via shifted token distribution.arXiv preprint arXiv:2504.01533,

[Yanget al., 2025 ] Zhuoran Yang, Jie Peng, Zhen Tan, Tianlong Chen, and Yanyong Zhang. Lightdefense: A lightweight uncertainty-driven defense against jail- breaks via shifted token distribution.arXiv preprint arXiv:2504.01533,

work page arXiv 2025
[33]

Cosafe: Evalu- ating large language model safety in multi-turn dialogue coreference

[Yuet al., 2024 ] Erxin Yu, Jing Li, Ming Liao, Siqi Wang, Gao Zuchen, Fei Mi, and Lanqing Hong. Cosafe: Evalu- ating large language model safety in multi-turn dialogue coreference. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17494–17508,

work page 2024
[34]

How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms

[Zenget al., 2024 ] Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms.arXiv preprint arXiv:2401.06373,

work page arXiv 2024
[35]

Parden, can you repeat that? de- fending against jailbreaks via repetition.arXiv preprint arXiv:2405.07932,

[Zhanget al., 2024b ] Ziyang Zhang, Qizhen Zhang, and Jakob Foerster. Parden, can you repeat that? de- fending against jailbreaks via repetition.arXiv preprint arXiv:2405.07932,

work page arXiv
[36]

Jb- shield: Defending large language models from jailbreak attacks through activated concept analysis and manipula- tion.arXiv preprint arXiv:2502.07557,

[Zhanget al., 2025 ] Shenyi Zhang, Yuchen Zhai, Keyan Guo, Hongxin Hu, Shengnan Guo, Zheng Fang, Lingchen Zhao, Chao Shen, Cong Wang, and Qian Wang. Jb- shield: Defending large language models from jailbreak attacks through activated concept analysis and manipula- tion.arXiv preprint arXiv:2502.07557,

work page arXiv 2025
[37]

A survey of large language models in medicine: Progress, application, and challenge

[Zhouet al., 2023 ] Hongjian Zhou, Fenglin Liu, Boyang Gu, Xinyu Zou, Jinfa Huang, Jinge Wu, Yiru Li, Sam S Chen, Peilin Zhou, Junling Liu, et al. A survey of large language models in medicine: Progress, application, and challenge. arXiv preprint arXiv:2311.05112,

work page arXiv 2023
[38]

Robust prompt optimization for defending language mod- els against jailbreaking attacks

[Zhouet al., 2024 ] Andy Zhou, Bo Li, and Haohan Wang. Robust prompt optimization for defending language mod- els against jailbreaking attacks. InThe Thirty-eighth An- nual Conference on Neural Information Processing Sys- tems,

work page 2024
[39]

Siege: Autonomous multi-turn jailbreaking of large language models with tree search

[Zhou, 2025] Andy Zhou. Siege: Autonomous multi-turn jailbreaking of large language models with tree search. arXiv preprint arXiv:2503.10619,

work page arXiv 2025
[40]

Universal and Transferable Adversarial Attacks on Aligned Language Models

[Zouet al., 2023 ] Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023