pith. sign in

arxiv: 2606.12429 · v1 · pith:43SVYWMFnew · submitted 2026-05-14 · 💻 cs.CY · cs.AI

Muse Spark Safety & Preparedness Report

Cristina Menghini , Peter Ney , Hamza Kwisaba , Zifan (Sail) Wang , Miles Turpin , Felix Binder , Jean-Christophe Testud , Aidan Boyd
show 111 more authors
Nathaniel Li Ivan Evtimov Klaudia Krawiecka Arman Zharmagambetov Jeremy Kritz Alexander R. Fabbri Daniel Song Jinpeng Miao Joonas Hjelt Meghna Ramani Leona Lan Reza Aghajani Joanna Bitton Mahesh Pasupuleti Devin Norder Khalid El-Arini Paridhi Singh V\'itor Albiero Sahana CB Rashnil Chaturvedi Elahe Dabir Edoardo Debenedetti Jim Gust Ziwen Han Kat He Sean Hendryx Lifeng Jin Polina Kirichenko Sandra Lefdal Kenneth Li Asad Liaqat Inna Lin Despoina Magka Neal Mangaokar Ishita Mediratta Zach Miller Smitha Milli Niloofar Mireshghallah Saba Nazir Hung Nguyen Maximilian Nickel Kelvin Niu Kerem Oktar Bhargavi Paranjape Parth Pathak Maya Pavlova Emmanuel Ramirez David Renardy Candace Ross Yasha Sheynin Claudia Shi Shivam Singhal Evangelia Spiliopoulou Rakshith Sharma Srinivasa Jamelle Watson-Daniels Spencer Whitman Adina Williams Chen Xing Andy Zou Tommy Ma Siqi Deng James Beldock Prashant Ratanchandani Kate Plawiak Taesung Lee Ryan Victory Lindsay Hundley Rachad Alao Himaghna Bhattacharjee Jianfeng Chi Gary Frost Pegah Ghahremani Niki Howe Yuheng Huang Saeed Jahed Hannah Korevaar Trang Le Zhe Liu Jinghong Luo Qin Lyu Nina Mehrabi Abraham Montilla Chirag Nagpal Cyrus Nikolaidis Rajvardhan Oak Manoj Ravi Vidya Sarma Aman Shankar Alana Shine Eric Michael Smith Mariana Tandon Michael Tontchev Caoyu Wang Zihan Wang Corinne Wong Zheng Wu Hongyuan Zhan Justin Zhao Zexuan Zhong Chengxu Zhuang Tristan Goodman Ayaz Minhas Harrison Rudolph Victoria Jeffries Ingrid Dickinson Alex Vaughan Lauren Deason Kamalika Chaudhuri Julian Michael Shengjia Zhao Summer Yue
This is my paper

Pith reviewed 2026-06-30 19:46 UTC · model grok-4.3

classification 💻 cs.CY cs.AI
keywords AI safety evaluationcatastrophic riskchemical biological riskscybersecurity risksloss of controlmodel deploymentmitigationsrefusal benchmarks
0
0 comments X

The pith

Muse Spark deployment meets acceptable residual risk levels for chemical, biological, cybersecurity, and loss of control threats after mitigations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates Muse Spark for potential catastrophic risks in chemical and biological domains, cybersecurity, and scenarios involving loss of control. It identifies elevated risks in these areas before any safeguards, with chemical and biological capabilities reaching high risk. After applying a multi-layered set of mitigations, the residual risks are assessed as acceptable, supporting the decision to release the model. A reader would care about the specific evidence and benchmarks used to reach this conclusion for a new large language model.

Core claim

The preparedness results covering Chemical and Biological, Cybersecurity, and Loss of Control risks assess Muse Spark's deployment as presenting acceptable levels of residual risks under the internal scaling framework. Broad evaluations identified elevated risks prior to mitigations, with Chemical and Biological capabilities likely reaching the high risk category before safeguards. A multi-layered set of mitigations was implemented, and the model shows state-of-the-art refusal on benchmarks for hazardous workflows in chemistry and biology, leading to its release.

What carries the argument

The multi-layered mitigations addressing identified dual-use and high-risk capabilities, particularly refusal mechanisms for hazardous workflows.

If this is right

  • The model is suitable for deployment in the AI product.
  • State-of-the-art performance on refusal benchmarks for chemistry and biology hazards is achieved.
  • Risks in cybersecurity and loss of control are also reduced to acceptable levels.
  • Broader content safety considerations are addressed separately from the main risk framework.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar mitigation strategies might be tested on future models to maintain acceptable risk levels.
  • Independent verification of the refusal benchmarks could strengthen confidence in the results.
  • The approach to risk assessment may set expectations for how other developers handle dual-use capabilities in language models.

Load-bearing premise

The implemented multi-layered mitigations sufficiently reduce all pre-mitigation elevated risks to acceptable residual levels.

What would settle it

Demonstration that Muse Spark can still provide actionable assistance on a high-risk chemical or biological task after mitigations would indicate the residual risks are higher than assessed.

Figures

Figures reproduced from arXiv: 2606.12429 by Abraham Montilla, Adina Williams, Aidan Boyd, Alana Shine, Alexander R. Fabbri, Alex Vaughan, Aman Shankar, Andy Zou, Arman Zharmagambetov, Asad Liaqat, Ayaz Minhas, Bhargavi Paranjape, Candace Ross, Caoyu Wang, Chengxu Zhuang, Chen Xing, Chirag Nagpal, Claudia Shi, Corinne Wong, Cristina Menghini, Cyrus Nikolaidis, Daniel Song, David Renardy, Despoina Magka, Devin Norder, Edoardo Debenedetti, Elahe Dabir, Emmanuel Ramirez, Eric Michael Smith, Evangelia Spiliopoulou, Felix Binder, Gary Frost, Hamza Kwisaba, Hannah Korevaar, Harrison Rudolph, Himaghna Bhattacharjee, Hongyuan Zhan, Hung Nguyen, Ingrid Dickinson, Inna Lin, Ishita Mediratta, Ivan Evtimov, Jamelle Watson-Daniels, James Beldock, Jean-Christophe Testud, Jeremy Kritz, Jianfeng Chi, Jim Gust, Jinghong Luo, Jinpeng Miao, Joanna Bitton, Joonas Hjelt, Julian Michael, Justin Zhao, Kamalika Chaudhuri, Kate Plawiak, Kat He, Kelvin Niu, Kenneth Li, Kerem Oktar, Khalid El-Arini, Klaudia Krawiecka, Lauren Deason, Leona Lan, Lifeng Jin, Lindsay Hundley, Mahesh Pasupuleti, Manoj Ravi, Mariana Tandon, Maximilian Nickel, Maya Pavlova, Meghna Ramani, Michael Tontchev, Miles Turpin, Nathaniel Li, Neal Mangaokar, Niki Howe, Niloofar Mireshghallah, Nina Mehrabi, Paridhi Singh, Parth Pathak, Pegah Ghahremani, Peter Ney, Polina Kirichenko, Prashant Ratanchandani, Qin Lyu, Rachad Alao, Rajvardhan Oak, Rakshith Sharma Srinivasa, Rashnil Chaturvedi, Reza Aghajani, Ryan Victory, Saba Nazir, Saeed Jahed, Sahana CB, Sandra Lefdal, Sean Hendryx, Shengjia Zhao, Shivam Singhal, Siqi Deng, Smitha Milli, Spencer Whitman, Summer Yue, Taesung Lee, Tommy Ma, Trang Le, Tristan Goodman, Victoria Jeffries, Vidya Sarma, V\'itor Albiero, Yasha Sheynin, Yuheng Huang, Zach Miller, Zexuan Zhong, Zhe Liu, Zheng Wu, Zifan (Sail) Wang, Zihan Wang, Ziwen Han.

Figure 1
Figure 1. Figure 1: Indirect prompt injection in a RAG pipeline. An attacker embeds instructions in web gure 35 A successful prompt injection example in Search-PI Simple. [PITH_FULL_IMAGE:figures/full_fig_p090_1.png] view at source ↗
read the original abstract

Muse Spark is the latest large language model developed by Meta. In this report, we first present evaluations for catastrophic risk domains under Meta's Advanced AI Scaling Framework, along with the evidence that informed our launch decision. We then discuss additional considerations, such as Muse Spark's broader content safety and behavioral profile, that are relevant to overall safety but fall outside the catastrophic risk domains governed by the Framework. Our preparedness results covering Chemical and Biological, Cybersecurity, and Loss of Control risks assess Muse Spark's deployment within Meta AI as presenting acceptable levels of residual risks under our Advanced AI Scaling Framework. We conducted a broad set of evaluations targeting dual-use and high-risk capabilities across these catastrophic risk domains. Those evaluations identified elevated risks prior to mitigations, with Chemical and Biological capabilities assessed as likely reaching the "high risk" category under the Advanced AI Scaling Framework before safeguards were applied. We have implemented a multi-layered set of mitigations that address the identified risks, and Muse Spark demonstrates state-of-the-art refusal across a range of benchmarks related to hazardous workflows in chemistry and biology. We therefore release Muse Spark as the underlying model of Meta AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript is a safety and preparedness report on Meta's Muse Spark LLM. It describes evaluations under the company's Advanced AI Scaling Framework for Chemical and Biological, Cybersecurity, and Loss of Control risks, states that pre-mitigation evaluations identified elevated (including high) risks in some domains, claims that multi-layered mitigations reduce these to acceptable residual levels with state-of-the-art refusal on hazardous workflows, and concludes that deployment within Meta AI is justified.

Significance. If the internal evaluations and mitigation claims were independently verifiable, the report would supply a concrete industry case study of applying a scaling framework to deployment decisions and could help establish norms for documenting residual catastrophic risks. In its current form the absence of data means it functions primarily as a high-level assertion rather than a contribution that advances measurable understanding of mitigation effectiveness.

major comments (3)
  1. [Abstract] Abstract: The central claim that 'Muse Spark's deployment within Meta AI as presenting acceptable levels of residual risks' after mitigations is unsupported by any benchmark scores, pre/post-mitigation comparisons, test protocols, error bars, or raw results, so the sufficiency of the mitigations cannot be evaluated.
  2. [Abstract] Abstract: The risk category assignments ('high risk' pre-mitigation for Chemical and Biological capabilities) and the determination that mitigations achieve 'acceptable' residual levels rest entirely on internal Meta evaluations and the company's own Advanced AI Scaling Framework without any external validation, independent data, or explicit statement of the Framework's numerical thresholds and acceptance criteria.
  3. [Abstract] Abstract: The assertion of 'state-of-the-art refusal across a range of benchmarks related to hazardous workflows in chemistry and biology' is made without naming the benchmarks, reporting scores, or providing comparisons to prior models, leaving the performance claim uncheckable.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their review. We address the three major comments below. As an industry safety report based on internal evaluations, the manuscript is intentionally high-level; detailed quantitative data cannot be released publicly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'Muse Spark's deployment within Meta AI as presenting acceptable levels of residual risks' after mitigations is unsupported by any benchmark scores, pre/post-mitigation comparisons, test protocols, error bars, or raw results, so the sufficiency of the mitigations cannot be evaluated.

    Authors: We acknowledge the report provides only summarized conclusions rather than raw scores or protocols. These evaluations were conducted under Meta's internal Advanced AI Scaling Framework; releasing specific numbers, pre/post comparisons, or test details would expose sensitive capability information. The manuscript communicates the risk assessment outcomes that informed the deployment decision at the level appropriate for a public preparedness report. No revision to add this data is planned. revision: no

  2. Referee: [Abstract] Abstract: The risk category assignments ('high risk' pre-mitigation for Chemical and Biological capabilities) and the determination that mitigations achieve 'acceptable' residual levels rest entirely on internal Meta evaluations and the company's own Advanced AI Scaling Framework without any external validation, independent data, or explicit statement of the Framework's numerical thresholds and acceptance criteria.

    Authors: The risk categories and acceptance criteria are defined internally within Meta's Advanced AI Scaling Framework. Numerical thresholds are proprietary and not disclosed to protect our risk management methodology. This document reports the application of the framework and resulting decision rather than enabling external replication or validation. External validation is not feasible for evaluations involving sensitive dual-use capabilities. revision: no

  3. Referee: [Abstract] Abstract: The assertion of 'state-of-the-art refusal across a range of benchmarks related to hazardous workflows in chemistry and biology' is made without naming the benchmarks, reporting scores, or providing comparisons to prior models, leaving the performance claim uncheckable.

    Authors: Specific benchmark names and scores are internal to avoid providing actionable details on hazardous workflows. The state-of-the-art claim reflects comparative internal testing against prior models. The manuscript states the outcome of these evaluations at a summary level consistent with the report's purpose. revision: no

standing simulated objections not resolved
  • Providing specific benchmark names, raw scores, pre/post-mitigation data, or explicit numerical thresholds from the internal Advanced AI Scaling Framework due to proprietary and misuse-prevention considerations.

Circularity Check

1 steps flagged

Acceptable residual risk conclusion is defined and asserted via internal Meta framework and evaluations

specific steps
  1. self definitional [Abstract]
    "Our preparedness results covering Chemical and Biological, Cybersecurity, and Loss of Control risks assess Muse Spark's deployment within Meta AI as presenting acceptable levels of residual risks under our Advanced AI Scaling Framework. ... We have implemented a multi-layered set of mitigations that address the identified risks, and Muse Spark demonstrates state-of-the-art refusal across a range of benchmarks related to hazardous workflows in chemistry and biology. We therefore release Muse Spark as the underlying model of Meta AI."

    The framework that defines 'acceptable levels' and 'high risk' categories is internal to Meta; the evaluations identifying elevated risks, applying mitigations, and declaring the outcome acceptable are performed by the same organization. The release decision is therefore equivalent to the authors' own input assertion that their mitigations suffice under criteria they control.

full rationale

The paper's central claim—that Muse Spark presents acceptable residual risks and can be released—rests on the authors' own Advanced AI Scaling Framework and their internal pre/post-mitigation assessments. No external benchmarks, independent verification, or falsifiable criteria outside the company's definitions are referenced. This reduces the load-bearing conclusion to a self-referential assertion by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, fitted parameters, or new entities are described in the abstract; the report rests on Meta's proprietary Advanced AI Scaling Framework and undisclosed internal evaluation procedures.

pith-pipeline@v0.9.1-grok · 6247 in / 1108 out tokens · 30156 ms · 2026-06-30T19:46:45.933581+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

101 extracted references · 16 canonical work pages · 8 internal anchors

  1. [1]

    Problems and Projects , url =

    Seven strictures on similarity , author =. Problems and Projects , url =

  2. [2]

    2025 , publisher=

    Toward Comprehensive Benchmarking of the Biological Knowledge of Frontier Large Language Models , author=. 2025 , publisher=

  3. [3]

    2024 , url=

    CodeShield , author=. 2024 , url=

  4. [4]

    2025 , url=

    CyScenarioBench: Evaluating LLM Cyber Capabilities Through Scenario-Based Benchmarking , author=. 2025 , url=

  5. [5]

    International Conference on Learning Representations (ICLR) , year=

    React: Synergizing reasoning and acting in language models , author=. International Conference on Learning Representations (ICLR) , year=

  6. [6]

    2024 , url=

    John Yang and Carlos E Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik R Narasimhan and Ofir Press , booktitle=. 2024 , url=

  7. [7]

    The Thirteenth International Conference on Learning Representations , year=

    Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

  8. [8]

    Capture The Flag , author=

  9. [9]

    LAB-Bench: Measuring Capabilities of Language Models for Biology Research

    Lab-bench: Measuring capabilities of language models for biology research , author=. arXiv preprint arXiv:2407.10362 , year=

  10. [10]

    The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

    The wmdp benchmark: Measuring and reducing malicious use with unlearning , author=. arXiv preprint arXiv:2403.03218 , year=

  11. [11]

    2024 , doi =

    Ivanov, Igor , title =. 2024 , doi =. https://www.biorxiv.org/content/early/2024/09/12/2024.08.21.608694.full.pdf , journal =

  12. [12]

    arXiv preprint arXiv:2503.03750 , year=

    The mask benchmark: Disentangling honesty from accuracy in ai systems , author=. arXiv preprint arXiv:2503.03750 , year=

  13. [13]

    arXiv [preprint]

    Virology capabilities test (VCT): a multimodal virology Q&A benchmark , author=. arXiv [preprint]. arXiv: 2504.16137 , pages=

  14. [14]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  15. [15]

    gpt-oss-120b & gpt-oss-20b Model Card

    gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

  16. [16]

    Llama 4 Blogpost , howpublished =

  17. [17]

    Llama 4 Model Card , howpublished =

  18. [18]

    arXiv e-prints , pages=

    The llama 3 herd of models , author=. arXiv e-prints , pages=

  19. [19]

    OpenAI o1 System Card

    OpenAI o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

  20. [20]

    OpenAI o3 and o4-mini System Card , howpublished =

  21. [21]

    GPT-5 System Card , howpublished =

  22. [22]

    Gemini 2.5 Pro Model Card , howpublished =

  23. [23]

    Gemini 2.5 Flash & 2.5 Flash Image Model Card , howpublished =

  24. [24]

    Meta Frontier AI Framework , howpublished =

  25. [25]

    Qwen3-Coder-480B-A35B-Instruct , howpublished =

  26. [26]

    2023 , eprint=

    PaperQA: Retrieval-Augmented Generative Agent for Scientific Research , author=. 2023 , eprint=

  27. [27]

    , Cassano, F

    Planning in natural language improves llm search for code generation , author=. arXiv preprint arXiv:2409.03733 , year=

  28. [28]

    The Twelfth International Conference on Learning Representations , year=

    Large language models as optimizers , author=. The Twelfth International Conference on Learning Representations , year=

  29. [29]

    2024 , eprint=

    Badllama 3: removing safety finetuning from Llama 3 in minutes , author=. 2024 , eprint=

  30. [30]

    2023 , eprint=

    AI Deception: A Survey of Examples, Risks, and Potential Solutions , author=. 2023 , eprint=

  31. [31]

    2016 , eprint=

    Concrete Problems in AI Safety , author=. 2016 , eprint=

  32. [32]

    2025 , eprint=

    Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation , author=. 2025 , eprint=

  33. [33]

    2025 , eprint=

    Deliberative Alignment: Reasoning Enables Safer Language Models , author=. 2025 , eprint=

  34. [34]

    2025 , eprint=

    Stress Testing Deliberative Alignment for Anti-Scheming Training , author=. 2025 , eprint=

  35. [35]

    2024 , url=

    US AISI and UK AISI Joint Pre-Deployment Test OpenAI o1 , author=. 2024 , url=

  36. [36]

    2023 , eprint=

    Purple Llama CYBERSECEVAL: A Secure Coding Benchmark for Language Models , author=. 2023 , eprint=

  37. [37]

    2024 , eprint=

    CYBERSECEVAL 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models , author=. 2024 , eprint=

  38. [38]

    PurpleLlama , author=

  39. [39]

    Llama Defenders Program , author=

  40. [40]

    2026 , month =

    Anthropic , title =. 2026 , month =

  41. [41]

    2025 , month =

    Anthropic , title =. 2025 , month =

  42. [42]

    System Card: Claude

    Anthropic , year =. System Card: Claude

  43. [43]

    2025 , month =

    OpenAI , title =. 2025 , month =

  44. [44]

    AI Security Institute, UK , title =

  45. [45]

    2025 , month =

    CAISI Evaluation of DeepSeek AI Models , institution =. 2025 , month =

  46. [46]

    2024 , eprint=

    Rapid Response: Mitigating LLM Jailbreaks with a Few Examples , author=. 2024 , eprint=

  47. [47]

    2025 , eprint=

    Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition , author=. 2025 , eprint=

  48. [48]

    A Strong

    Souly, Alexandra and Lu, Qingyuan and Bowen, Dillon and Trinh, Tu and Hsieh, Elvis and Pandey, Sana and Abbeel, Pieter and Svegliato, Justin and Emmons, Scott and Watkins, Olivia and Toyer, Sam , booktitle=. A Strong

  49. [49]

    2023 , eprint=

    Jailbroken: How Does LLM Safety Training Fail? , author=. 2023 , eprint=

  50. [50]

    Hendryx and Summer Yue and Zifan Wang , booktitle=

    Priyanshu Kumar and Elaine Lau and Saranya Vijayakumar and Tu Trinh and Elaine T Chang and Vaughn Robinson and Shuyan Zhou and Matt Fredrikson and Sean M. Hendryx and Summer Yue and Zifan Wang , booktitle=. Aligned. 2025 , url=

  51. [51]

    2025 , eprint=

    LLM Output Homogenization is Task Dependent , author=. 2025 , eprint=

  52. [52]

    CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs , url =

    Wang, Zirui and Xia, Mengzhou and He, Luxi and Chen, Howard and Liu, Yitao and Zhu, Richard and Liang, Kaiqu and Wu, Xindi and Liu, Haotian and Malladi, Sadhika and Chevalier, Alexis and Arora, Sanjeev and Chen, Danqi , booktitle =. CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs , url =. doi:10.52202/079017-3609 , editor =

  53. [53]

    Weather and Forecasting , volume=

    Jochen Br\". Weather and Forecasting , volume=. 2007 , doi=

  54. [54]

    Niloofar Mireshghallah and Neal Mangaokar and Narine Kokhlikyan and Arman Zharmagambetov and Manzil Zaheer and Saeed Mahloujifar and Kamalika Chaudhuri , booktitle=

  55. [55]

    Bowman and He He and Shi Feng , booktitle=

    Jiaxin Wen and Ruiqi Zhong and Akbir Khan and Ethan Perez and Jacob Steinhardt and Minlie Huang and Samuel R. Bowman and He He and Shi Feng , booktitle=. Language Models Learn to Mislead Humans via. 2025 , url=

  56. [56]

    Eric Wallace and Kai Xiao and Reimar Leike and Lilian Weng and Johannes Heidecke and Alex Beutel , year=. The. 2404.13208 , archivePrefix=

  57. [57]

    2025 , publisher=

    Zhang, Zhihan and Li, Shiyang and Zhang, Zixuan and Liu, Xin and Jiang, Haoming and Tang, Xianfeng and Gao, Yifan and Li, Zheng and Wang, Haodong and Tan, Zhaoxuan and Li, Yichuan and Yin, Qingyu and Yin, Bing and Jiang, Meng , booktitle=. 2025 , publisher=

  58. [58]

    Ziwen Han and Dean Lee and Meher Mankikar and Edward Gan and Summer Yue , year=. How

  59. [59]

    Reward Hacker

    Ziqian Zhong and Aditi Raghunathan and Nicholas Carlini , year=. 2510.20270 , archivePrefix=

  60. [60]

    A benchmark of expert-level academic questions to assess

    Long Phan and Alice Gatti and Ziwen Han and Nathaniel Li and Josephina Hu and Hugh Zhang and Sean Shi and Michael Choi and Anish Agrawal and Arnav Chopra and Adam Khoja and Ryan Kim and Richard Ren and Jason Hausenloy and Oliver Zhang and others , journal=. A benchmark of expert-level academic questions to assess. 2026 , publisher=

  61. [61]

    Bell , booktitle=

    Polina Kirichenko and Mark Ibrahim and Kamalika Chaudhuri and Samuel J. Bell , booktitle=

  62. [62]

    2509.07968 , archivePrefix=

    Lukas Haas and Gal Yona and Giovanni D'Antonio and Sasha Goldshtein and Dipanjan Das , year=. 2509.07968 , archivePrefix=

  63. [63]

    CyberGym: Evaluating

    Zhun Wang and Tianneng Shi and Jingxuan He and Matthew Cai and Jialin Zhang and Dawn Song , booktitle=. CyberGym: Evaluating. 2026 , url=

  64. [64]

    Yao Huang and Yitong Sun and Yichi Zhang and Ruochen Zhang and Yinpeng Dong and Xingxing Wei , booktitle=

  65. [65]

    Alexander Pan and Jun Shern Chan and Andy Zou and Nathaniel Li and Steven Basart and Thomas Woodside and Jonathan Ng and Hanlin Zhang and Scott Emmons and Dan Hendrycks , booktitle=. Do the

  66. [66]

    2507.23701 , archivePrefix=

    Long Phan and Mantas Mazeika and Andy Zou and Dan Hendrycks , year=. 2507.23701 , archivePrefix=

  67. [67]

    2024 , eprint=

    Ryan Greenblatt and Carson Denison and Benjamin Wright and Fabien Roger and Monte MacDiarmid and Sam Marks and Johannes Treutlein and Tim Belonax and Jack Chen and David Duvenaud and Akbir Khan and Julian Michael and S\". 2024 , eprint=

  68. [68]

    Abhay Sheshadri and John Hughes and Julian Michael and Alex Mallen and Arun Jose and Janus and Fabien Roger , booktitle=. Why

  69. [69]

    2023 , eprint=

    Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=

  70. [70]

    Do Anything Now

    "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models , author=. 2024 , eprint=

  71. [71]

    2025 , eprint=

    Jailbreaking to Jailbreak , author=. 2025 , eprint=

  72. [72]

    2025 , eprint=

    FORTRESS: Frontier Risk Evaluation for National Security and Public Safety , author=. 2025 , eprint=

  73. [73]

    arXiv preprint arXiv:2405.20947 , year =

    OR-Bench: An Over-Refusal Benchmark for Large Language Models , author=. arXiv preprint arXiv:2405.20947 , year=

  74. [74]

    2024 , eprint=

    AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents , author=. 2024 , eprint=

  75. [75]

    Prompt Siren: a Framework for Prompt Injection Evaluations , year =

    Edoardo Debenedetti and Klaudia Krawiecka and Neal Mangaokar and Arman Zharmagambetov and Nina Mehrabi and Aidan Boyd and Kat He and Sahana Chennabasappa and Lauren Deason and Kamalika Chaudhuri and Florian Tram. Prompt Siren: a Framework for Prompt Injection Evaluations , year =

  76. [76]

    2024 , eprint=

    Measuring short-form factuality in large language models , author=. 2024 , eprint=

  77. [77]

    Automated Red Teaming with

    Maya Pavlova and Erik Brinkman and Krithika Iyer and V. Automated Red Teaming with. Forty-second International Conference on Machine Learning , year=

  78. [78]

    AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

    AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents , author=. arXiv preprint arXiv:2410.09024 , year=

  79. [79]

    and Srivastava, Sanjay , booktitle=

    John, Oliver P. and Srivastava, Sanjay , booktitle=. The. 1999 , publisher=

  80. [80]

    and Wagner, Claudia and Rammstedt, Beatrice and Strohmaier, Markus , journal=

    Pellert, Max and Lechner, Clemens M. and Wagner, Claudia and Rammstedt, Beatrice and Strohmaier, Markus , journal=. 2024 , publisher=

Showing first 80 references.