pith. machine review for the scientific record. sign in

arxiv: 2604.18976 · v1 · submitted 2026-04-21 · 💻 cs.CL

Recognition: unknown

STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming

Chaeyun Kim, Junghwan Kim, Kihyun Kim, MinJae Jung, Minwoo Kim, YongTaek Lim

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:02 UTC · model grok-4.3

classification 💻 cs.CL
keywords automated red teamingLLM jailbreakingmultiplex networkmulti-agent systemattack success ratesemantic communitiesblack-box optimizationstrategy sampling
0
0 comments X

The pith

STAR-Teaming recasts high-dimensional LLM strategy search into a multiplex network of semantic communities to raise jailbreak success while lowering cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents STAR-Teaming, a black-box automated red teaming framework that pairs a multi-agent system with a Strategy-Response Multiplex Network. The network converts the space of possible attack strategies into organized semantic communities drawn from embeddings of prior responses. This organization lets the system sample promising strategies more efficiently and avoids repeating similar failed attempts. A sympathetic reader would care because it supplies both higher success at eliciting restricted outputs and clearer maps of where an LLM's defenses are weakest.

Core claim

The central claim is that the Strategy-Response Multiplex Network, when used to drive optimization inside a multi-agent red teaming loop, converts the intractable high-dimensional embedding space into a tractable collection of semantic communities; these communities both improve search efficiency by eliminating redundant exploration and increase interpretability by revealing genuine clusters of strategic vulnerabilities in the target LLM.

What carries the argument

The Strategy-Response Multiplex Network, which maps strategies and responses to layered semantic communities that guide sampling and expose distinct vulnerability patterns.

If this is right

  • STAR-Teaming records higher attack success rates than prior automated red teaming baselines.
  • It reaches those rates with measurably lower total computation.
  • The resulting communities supply human-readable groupings of successful jailbreak tactics.
  • The same structure can be reused across multiple target LLMs without retraining the network from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Defenders could inspect the same communities to prioritize hardening against entire families of related prompts instead of individual examples.
  • The approach might extend to other black-box search tasks such as automated prompt optimization for capability elicitation.
  • If communities prove stable across model families, they could serve as a diagnostic tool for comparing safety alignments between different LLMs.

Load-bearing premise

That the multiplex network's communities accurately reflect real strategic differences in LLM behavior rather than artifacts created by the embedding method or the network construction process itself.

What would settle it

An ablation experiment in which the same multi-agent sampling runs without the multiplex network structure produces attack success rates equal to or higher than the full STAR-Teaming pipeline on the same target models and evaluation sets.

Figures

Figures reproduced from arXiv: 2604.18976 by Chaeyun Kim, Junghwan Kim, Kihyun Kim, MinJae Jung, Minwoo Kim, YongTaek Lim.

Figure 1
Figure 1. Figure 1: Overview of STAR-Teaming. STAR￾Teaming samples and presents attack strategies. These strategies are passed to the attacker LLM, which generates harmful prompts accordingly. evaluating how LLMs respond to harmful, illegal or violent prompts is now essential (Wei et al., 2023; Zou et al., 2023; Lin et al., 2025). This has led to growing interest in red-teaming methods that assess LLM robustness against jailb… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the STAR-Teaming architecture, consisting of (A) an Automated Red-Teaming Multi-Agent [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance comparison according to iteration (i.e. number of iterations) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: The correlation between average score using [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of selected strategies by retrieval [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of Response Network with (UP) [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of Strategy Network with (UP) [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Number of communities (Top), average degree (Middle), and clustering coefficient (Bottom) as a function [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Illustration of (A) Initial mapping matrix and (B) Updated mapping matrix. [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Cross-model strategy profile [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Illustration of attack pipeline [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Illustration of attack pipeline [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Illustration of attack pipeline [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Illustration of attack pipeline [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Illustration of attack pipeline [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗
read the original abstract

While Large Language Models (LLMs) are widely used, they remain susceptible to jailbreak prompts that can elicit harmful or inappropriate responses. This paper introduces STAR-Teaming, a novel black-box framework for automated red teaming that effectively generates such prompts. STAR-Teaming integrates a Multi-Agent System (MAS) with a Strategy-Response Multiplex Network and employs network-driven optimization to sample effective attack strategies. This network-based approach recasts the intractable high-dimensional embedding space into a tractable structure, yielding two key advantages: it enhances the interpretability of the LLM's strategic vulnerabilities, and it streamlines the search for effective strategies by organizing the search space into semantic communities, thereby preventing redundant exploration. Empirical results demonstrate that STAR-Teaming significantly surpasses existing methods, achieving a higher attack success rate (ASR) at a lower computational cost. Extensive experiments validate the effectiveness and explainability of the Multiplex Network. The code is available at https://github.com/selectstar-ai/STAR-Teaming-paper.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces STAR-Teaming, a black-box automated red-teaming framework for LLMs that combines a multi-agent system with a Strategy-Response Multiplex Network. The network organizes high-dimensional strategy and response embeddings into semantic communities via community detection and network-driven optimization, with the goals of improving search efficiency, preventing redundant exploration, and enhancing interpretability of LLM vulnerabilities. The central empirical claim is that this yields higher attack success rate (ASR) at lower computational cost than prior methods, supported by extensive experiments on effectiveness and explainability; code is released.

Significance. If the results hold and the multiplex communities correspond to genuine strategic vulnerabilities rather than embedding or clustering artifacts, the work offers a structured, interpretable alternative to unstructured prompt search in red teaming. The network-based recasting of the search space is a novel application of multiplex networks to LLM safety and could improve both efficiency and mechanistic understanding. Public code release supports reproducibility.

major comments (3)
  1. [Abstract] Abstract: The claim that STAR-Teaming 'significantly surpasses existing methods, achieving a higher attack success rate (ASR) at a lower computational cost' is stated without any numerical results, baseline comparisons, statistical tests, or definition of how ASR is measured (e.g., success criteria, number of queries, or human/AI judgment protocol). This absence prevents assessment of the central claim.
  2. [Experiments] Experiments section: To substantiate that the Strategy-Response Multiplex Network improves performance by revealing real vulnerabilities rather than introducing artifacts, ablation studies are required that vary the embedding model, edge-weighting scheme, and community-detection algorithm while measuring impact on ASR and cost. Without these, gains could be attributable to the specific network-construction choices listed in the free_parameters.
  3. [Methodology] Methodology: The description of how the multiplex network 'recasts the intractable high-dimensional embedding space into a tractable structure' must specify the exact community-detection algorithm, the optimization objective used for strategy sampling, and any validation that detected communities align with LLM response patterns rather than geometric properties of the chosen embeddings.
minor comments (1)
  1. [Abstract] Abstract: Adding one sentence with concrete ASR deltas, query budgets, and the strongest baseline would make the empirical contribution immediately evaluable.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications from the full paper and indicating where revisions will be made to strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that STAR-Teaming 'significantly surpasses existing methods, achieving a higher attack success rate (ASR) at a lower computational cost' is stated without any numerical results, baseline comparisons, statistical tests, or definition of how ASR is measured (e.g., success criteria, number of queries, or human/AI judgment protocol). This absence prevents assessment of the central claim.

    Authors: The abstract is written as a concise summary of the work. The full manuscript contains the requested details in the Experiments section, including tables with specific ASR values and improvements over baselines, query counts, statistical significance tests, and the ASR definition (proportion of prompts eliciting harmful outputs per the target LLM's safety policy, assessed via automated judgment with human validation on samples). We will revise the abstract to include key numerical highlights and a brief definition of ASR to make the central claim more self-contained. revision: yes

  2. Referee: [Experiments] Experiments section: To substantiate that the Strategy-Response Multiplex Network improves performance by revealing real vulnerabilities rather than introducing artifacts, ablation studies are required that vary the embedding model, edge-weighting scheme, and community-detection algorithm while measuring impact on ASR and cost. Without these, gains could be attributable to the specific network-construction choices listed in the free_parameters.

    Authors: The manuscript already includes ablation studies on the multiplex network's contribution and several design choices in the Experiments section. We agree that additional ablations systematically varying the embedding model, edge-weighting scheme, and community-detection algorithm would provide stronger evidence against artifacts. We will perform and report these experiments in the revised version, quantifying effects on ASR and computational cost. revision: yes

  3. Referee: [Methodology] Methodology: The description of how the multiplex network 'recasts the intractable high-dimensional embedding space into a tractable structure' must specify the exact community-detection algorithm, the optimization objective used for strategy sampling, and any validation that detected communities align with LLM response patterns rather than geometric properties of the chosen embeddings.

    Authors: The Methodology section describes the multiplex network construction, community detection, and network-driven optimization at a high level. We will expand this section to name the specific community-detection algorithm, state the optimization objective for strategy sampling (a utility function balancing expected attack success against redundancy across communities), and add validation results (e.g., semantic alignment metrics and case studies) showing that communities reflect LLM vulnerability patterns rather than embedding geometry alone. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on external validation, not self-referential derivation

full rationale

The paper presents STAR-Teaming as an empirical black-box framework combining multi-agent systems with a Strategy-Response Multiplex Network for organizing embeddings into communities and optimizing strategy search. Central claims of higher ASR and lower cost are supported by comparative experiments against baselines, not by any derivation that reduces to fitted parameters, self-citations, or ansatz by construction. The network is described as a recasting tool for interpretability and efficiency, with effectiveness validated through reported results rather than tautological redefinition. No load-bearing steps match the enumerated circularity patterns; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework introduces a new multiplex network construct whose effectiveness is asserted rather than derived from first principles; optimization parameters and community detection thresholds are likely tuned on data.

free parameters (1)
  • network optimization and community parameters
    Parameters controlling strategy sampling and community formation in the multiplex network are required to make the search tractable and are presumably fitted or chosen to achieve reported performance.
axioms (1)
  • domain assumption The multiplex network structure organizes attack strategies into semantic communities that prevent redundant exploration and enhance interpretability.
    Invoked to justify the tractability and explainability advantages over raw high-dimensional search.
invented entities (1)
  • Strategy-Response Multiplex Network no independent evidence
    purpose: To recast intractable high-dimensional embedding space into a tractable community structure for strategy search.
    New construct introduced by the paper; no independent evidence outside the framework itself is provided in the abstract.

pith-pipeline@v0.9.0 · 5491 in / 1153 out tokens · 42781 ms · 2026-05-10T03:02:24.999360+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 22 canonical work pages · 12 internal anchors

  1. [6]

    Advances in Neural Information Processing Systems , volume=

    Tree of attacks: Jailbreaking black-box llms automatically , author=. Advances in Neural Information Processing Systems , volume=

  2. [8]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  3. [9]

    do anything now

    " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models , author=. Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security , pages=

  4. [10]

    Advances in Neural Information Processing Systems , volume=

    Rainbow teaming: Open-ended generation of diverse adversarial prompts , author=. Advances in Neural Information Processing Systems , volume=

  5. [12]

    Advances in Neural Information Processing Systems , volume=

    Many-shot jailbreaking , author=. Advances in Neural Information Processing Systems , volume=

  6. [13]

    Journal of Artificial Intelligence Research , volume=

    Against The Achilles' Heel: A Survey on Red Teaming for Generative Models , author=. Journal of Artificial Intelligence Research , volume=

  7. [18]

    Advances in Neural Information Processing Systems , volume=

    Toolformer: Language models can teach themselves to use tools , author=. Advances in Neural Information Processing Systems , volume=

  8. [19]

    Nature , volume=

    Large language models encode clinical knowledge , author=. Nature , volume=. 2023 , publisher=

  9. [20]

    Cureus , volume=

    Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge , author=. Cureus , volume=. 2023 , publisher=

  10. [21]

    Advances in Neural Information Processing Systems , volume=

    Jailbroken: How does llm safety training fail? , author=. Advances in Neural Information Processing Systems , volume=

  11. [22]

    Communications in Mathematical Physics , volume=

    The inverse problem in classical statistical mechanics , author=. Communications in Mathematical Physics , volume=. 1984 , publisher=

  12. [23]

    Advances in Physics , volume=

    Inverse statistical problems: from the inverse Ising problem to data science , author=. Advances in Physics , volume=. 2017 , publisher=

  13. [24]

    Journal of complex networks , volume=

    Multilayer networks , author=. Journal of complex networks , volume=. 2014 , publisher=

  14. [25]

    Scientific reports , volume=

    From Louvain to Leiden: guaranteeing well-connected communities , author=. Scientific reports , volume=. 2019 , publisher=

  15. [28]

    ACM Computing Surveys (CSUR) , volume=

    Community detection in multiplex networks , author=. ACM Computing Surveys (CSUR) , volume=. 2021 , publisher=

  16. [30]

    2024 , eprint =

    Gemma: Open Models Based on Gemini , author =. 2024 , eprint =

  17. [31]

    2024 , howpublished =

    LLaMA 3 Technical Report , author =. 2024 , howpublished =

  18. [32]

    The claude 3 model family: Opus, sonnet, haiku , author =

  19. [33]

    2025 , eprint=

    Gemini: A Family of Highly Capable Multimodal Models , author=. 2025 , eprint=

  20. [34]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  21. [35]

    2025 , eprint=

    Gemma 3 Technical Report , author=. 2025 , eprint=

  22. [36]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  23. [38]

    2023 , eprint=

    BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset , author=. 2023 , eprint=

  24. [39]

    2024 , eprint=

    WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs , author=. 2024 , eprint=

  25. [41]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, and 1 others. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

  26. [42]

    Meta AI. 2024. Llama 3 technical report. https://llama.meta.com/llama3. Accessed: 2025-05-18

  27. [43]

    Cem Anil, Esin Durmus, Nina Panickssery, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Meg Tong, Jesse Mu, Daniel Ford, and 1 others. 2024 a . Many-shot jailbreaking. Advances in Neural Information Processing Systems, 37:129696--129742

  28. [44]

    Rohan Anil, Orpaz Goldstein, Yi Tay, Slav Petrov, Wenhan Xiong, Hyung Won Chung, Zhen Qin, Mostafa Dehghani, Aakanksha Chowdhery, Daphne Ippolito, Xuezhi Wang, Jiahui Yu, Jinsung Yoon, Hanxiao Liu, Alex Ku, Barret Zoph, William Fedus, Markus Freitag, Sebastian Gehrmann, and 8 others. 2024 b . https://arxiv.org/abs/2402.17764 Gemma: Open models based on ge...

  29. [45]

    Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku. https://www.anthropic. com/claude-3-model-card

  30. [46]

    Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. https://doi.org/10.1088/1742-5468/2008/10/p10008 Fast unfolding of communities in large networks . Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008

  31. [47]

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. 2023. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419

  32. [48]

    Marta R Costa-Juss \`a , James Cross, Onur C elebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, and 1 others. 2022. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672

  33. [49]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3...

  34. [50]

    Xingang Guo, Fangxu Yu, Huan Zhang, Lianhui Qin, and Bin Hu. 2024. Cold-attack: Jailbreaking llms with stealthiness and controllability. arXiv preprint arXiv:2402.08679

  35. [51]

    Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. 2024. https://arxiv.org/abs/2406.18495 Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms . Preprint, arXiv:2406.18495

  36. [52]

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, and 1 others. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276

  37. [53]

    E. T. Jaynes. 1957. https://doi.org/10.1103/PhysRev.106.620 Information theory and statistical mechanics . Phys. Rev., 106:620--630

  38. [54]

    Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Chi Zhang, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. 2023. https://arxiv.org/abs/2307.04657 Beavertails: Towards improved safety alignment of llm via a human-preference dataset . Preprint, arXiv:2307.04657

  39. [55]

    Haibo Jin, Ruoxi Chen, Andy Zhou, Yang Zhang, and Haohan Wang. 2024. Guard: Role-playing to generate natural-language jailbreakings to test guideline adherence of large language models. arXiv preprint arXiv:2402.03299

  40. [56]

    Mikko Kivel \"a , Alex Arenas, Marc Barthelemy, James P Gleeson, Yamir Moreno, and Mason A Porter. 2014. Multilayer networks. Journal of complex networks, 2(3):203--271

  41. [57]

    Yunxiang Li, Zihan Li, Kai Zhang, Ruilong Dan, Steve Jiang, and You Zhang. 2023. Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge. Cureus, 15(6)

  42. [58]

    Zeyi Liao and Huan Sun. 2024. Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms. arXiv preprint arXiv:2404.07921

  43. [59]

    Lizhi Lin, Honglin Mu, Zenan Zhai, Minghan Wang, Yuxia Wang, Renxi Wang, Junjie Gao, Yixuan Zhang, Wanxiang Che, Timothy Baldwin, and 1 others. 2025. Against the achilles' heel: A survey on red teaming for generative models. Journal of Artificial Intelligence Research, 82:687--775

  44. [60]

    Xiaogeng Liu, Peiran Li, Edward Suh, Yevgeniy Vorobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, and Chaowei Xiao. 2024. Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak llms. arXiv preprint arXiv:2410.05295

  45. [61]

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2023. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451

  46. [62]

    Matteo Magnani, Obaida Hanteer, Roberto Interdonato, Luca Rossi, and Andrea Tagarelli. 2021. Community detection in multiplex networks. ACM Computing Surveys (CSUR), 54(3):1--35

  47. [63]

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, and 1 others. 2024. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249

  48. [64]

    Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. 2024. Tree of attacks: Jailbreaking black-box llms automatically. Advances in Neural Information Processing Systems, 37:61065--61105

  49. [65]

    H Chau Nguyen, Riccardo Zecchina, and Johannes Berg. 2017. Inverse statistical problems: from the inverse ising problem to data science. Advances in Physics, 66(3):197--261

  50. [66]

    Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, and 1 others. 2024. Rainbow teaming: Open-ended generation of diverse adversarial prompts. Advances in Neural Information Processing Systems, 37:69747--69786

  51. [67]

    Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36:68539--68551

  52. [68]

    do anything now

    Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2024. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 1671--1685

  53. [69]

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, and 1 others. 2023. Large language models encode clinical knowledge. Nature, 620(7972):172--180

  54. [70]

    Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and 1 others. 2024. A strongreject for empty jailbreaks. arXiv preprint arXiv:2402.10260

  55. [71]

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, and 1 others. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805

  56. [72]

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, and 197 others. 2025. https://arxiv.org/abs/2503.19786...

  57. [73]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and 1 others. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288

  58. [74]

    Vincent A Traag, Ludo Waltman, and Nees Jan Van Eck. 2019. From louvain to leiden: guaranteeing well-connected communities. Scientific reports, 9(1):1--12

  59. [75]

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36:80079--80110

  60. [76]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . Preprint, arXiv:2505.09388

  61. [77]

    Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. 2024. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14322--14350

  62. [78]

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043