Muse Spark Safety & Preparedness Report
Pith reviewed 2026-06-30 19:46 UTC · model grok-4.3
The pith
Muse Spark deployment meets acceptable residual risk levels for chemical, biological, cybersecurity, and loss of control threats after mitigations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The preparedness results covering Chemical and Biological, Cybersecurity, and Loss of Control risks assess Muse Spark's deployment as presenting acceptable levels of residual risks under the internal scaling framework. Broad evaluations identified elevated risks prior to mitigations, with Chemical and Biological capabilities likely reaching the high risk category before safeguards. A multi-layered set of mitigations was implemented, and the model shows state-of-the-art refusal on benchmarks for hazardous workflows in chemistry and biology, leading to its release.
What carries the argument
The multi-layered mitigations addressing identified dual-use and high-risk capabilities, particularly refusal mechanisms for hazardous workflows.
If this is right
- The model is suitable for deployment in the AI product.
- State-of-the-art performance on refusal benchmarks for chemistry and biology hazards is achieved.
- Risks in cybersecurity and loss of control are also reduced to acceptable levels.
- Broader content safety considerations are addressed separately from the main risk framework.
Where Pith is reading between the lines
- Similar mitigation strategies might be tested on future models to maintain acceptable risk levels.
- Independent verification of the refusal benchmarks could strengthen confidence in the results.
- The approach to risk assessment may set expectations for how other developers handle dual-use capabilities in language models.
Load-bearing premise
The implemented multi-layered mitigations sufficiently reduce all pre-mitigation elevated risks to acceptable residual levels.
What would settle it
Demonstration that Muse Spark can still provide actionable assistance on a high-risk chemical or biological task after mitigations would indicate the residual risks are higher than assessed.
Figures
read the original abstract
Muse Spark is the latest large language model developed by Meta. In this report, we first present evaluations for catastrophic risk domains under Meta's Advanced AI Scaling Framework, along with the evidence that informed our launch decision. We then discuss additional considerations, such as Muse Spark's broader content safety and behavioral profile, that are relevant to overall safety but fall outside the catastrophic risk domains governed by the Framework. Our preparedness results covering Chemical and Biological, Cybersecurity, and Loss of Control risks assess Muse Spark's deployment within Meta AI as presenting acceptable levels of residual risks under our Advanced AI Scaling Framework. We conducted a broad set of evaluations targeting dual-use and high-risk capabilities across these catastrophic risk domains. Those evaluations identified elevated risks prior to mitigations, with Chemical and Biological capabilities assessed as likely reaching the "high risk" category under the Advanced AI Scaling Framework before safeguards were applied. We have implemented a multi-layered set of mitigations that address the identified risks, and Muse Spark demonstrates state-of-the-art refusal across a range of benchmarks related to hazardous workflows in chemistry and biology. We therefore release Muse Spark as the underlying model of Meta AI.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a safety and preparedness report on Meta's Muse Spark LLM. It describes evaluations under the company's Advanced AI Scaling Framework for Chemical and Biological, Cybersecurity, and Loss of Control risks, states that pre-mitigation evaluations identified elevated (including high) risks in some domains, claims that multi-layered mitigations reduce these to acceptable residual levels with state-of-the-art refusal on hazardous workflows, and concludes that deployment within Meta AI is justified.
Significance. If the internal evaluations and mitigation claims were independently verifiable, the report would supply a concrete industry case study of applying a scaling framework to deployment decisions and could help establish norms for documenting residual catastrophic risks. In its current form the absence of data means it functions primarily as a high-level assertion rather than a contribution that advances measurable understanding of mitigation effectiveness.
major comments (3)
- [Abstract] Abstract: The central claim that 'Muse Spark's deployment within Meta AI as presenting acceptable levels of residual risks' after mitigations is unsupported by any benchmark scores, pre/post-mitigation comparisons, test protocols, error bars, or raw results, so the sufficiency of the mitigations cannot be evaluated.
- [Abstract] Abstract: The risk category assignments ('high risk' pre-mitigation for Chemical and Biological capabilities) and the determination that mitigations achieve 'acceptable' residual levels rest entirely on internal Meta evaluations and the company's own Advanced AI Scaling Framework without any external validation, independent data, or explicit statement of the Framework's numerical thresholds and acceptance criteria.
- [Abstract] Abstract: The assertion of 'state-of-the-art refusal across a range of benchmarks related to hazardous workflows in chemistry and biology' is made without naming the benchmarks, reporting scores, or providing comparisons to prior models, leaving the performance claim uncheckable.
Simulated Author's Rebuttal
We thank the referee for their review. We address the three major comments below. As an industry safety report based on internal evaluations, the manuscript is intentionally high-level; detailed quantitative data cannot be released publicly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'Muse Spark's deployment within Meta AI as presenting acceptable levels of residual risks' after mitigations is unsupported by any benchmark scores, pre/post-mitigation comparisons, test protocols, error bars, or raw results, so the sufficiency of the mitigations cannot be evaluated.
Authors: We acknowledge the report provides only summarized conclusions rather than raw scores or protocols. These evaluations were conducted under Meta's internal Advanced AI Scaling Framework; releasing specific numbers, pre/post comparisons, or test details would expose sensitive capability information. The manuscript communicates the risk assessment outcomes that informed the deployment decision at the level appropriate for a public preparedness report. No revision to add this data is planned. revision: no
-
Referee: [Abstract] Abstract: The risk category assignments ('high risk' pre-mitigation for Chemical and Biological capabilities) and the determination that mitigations achieve 'acceptable' residual levels rest entirely on internal Meta evaluations and the company's own Advanced AI Scaling Framework without any external validation, independent data, or explicit statement of the Framework's numerical thresholds and acceptance criteria.
Authors: The risk categories and acceptance criteria are defined internally within Meta's Advanced AI Scaling Framework. Numerical thresholds are proprietary and not disclosed to protect our risk management methodology. This document reports the application of the framework and resulting decision rather than enabling external replication or validation. External validation is not feasible for evaluations involving sensitive dual-use capabilities. revision: no
-
Referee: [Abstract] Abstract: The assertion of 'state-of-the-art refusal across a range of benchmarks related to hazardous workflows in chemistry and biology' is made without naming the benchmarks, reporting scores, or providing comparisons to prior models, leaving the performance claim uncheckable.
Authors: Specific benchmark names and scores are internal to avoid providing actionable details on hazardous workflows. The state-of-the-art claim reflects comparative internal testing against prior models. The manuscript states the outcome of these evaluations at a summary level consistent with the report's purpose. revision: no
- Providing specific benchmark names, raw scores, pre/post-mitigation data, or explicit numerical thresholds from the internal Advanced AI Scaling Framework due to proprietary and misuse-prevention considerations.
Circularity Check
Acceptable residual risk conclusion is defined and asserted via internal Meta framework and evaluations
specific steps
-
self definitional
[Abstract]
"Our preparedness results covering Chemical and Biological, Cybersecurity, and Loss of Control risks assess Muse Spark's deployment within Meta AI as presenting acceptable levels of residual risks under our Advanced AI Scaling Framework. ... We have implemented a multi-layered set of mitigations that address the identified risks, and Muse Spark demonstrates state-of-the-art refusal across a range of benchmarks related to hazardous workflows in chemistry and biology. We therefore release Muse Spark as the underlying model of Meta AI."
The framework that defines 'acceptable levels' and 'high risk' categories is internal to Meta; the evaluations identifying elevated risks, applying mitigations, and declaring the outcome acceptable are performed by the same organization. The release decision is therefore equivalent to the authors' own input assertion that their mitigations suffice under criteria they control.
full rationale
The paper's central claim—that Muse Spark presents acceptable residual risks and can be released—rests on the authors' own Advanced AI Scaling Framework and their internal pre/post-mitigation assessments. No external benchmarks, independent verification, or falsifiable criteria outside the company's definitions are referenced. This reduces the load-bearing conclusion to a self-referential assertion by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Problems and Projects , url =
Seven strictures on similarity , author =. Problems and Projects , url =
-
[2]
2025 , publisher=
Toward Comprehensive Benchmarking of the Biological Knowledge of Frontier Large Language Models , author=. 2025 , publisher=
2025
-
[3]
2024 , url=
CodeShield , author=. 2024 , url=
2024
-
[4]
2025 , url=
CyScenarioBench: Evaluating LLM Cyber Capabilities Through Scenario-Based Benchmarking , author=. 2025 , url=
2025
-
[5]
International Conference on Learning Representations (ICLR) , year=
React: Synergizing reasoning and acting in language models , author=. International Conference on Learning Representations (ICLR) , year=
-
[6]
2024 , url=
John Yang and Carlos E Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik R Narasimhan and Ofir Press , booktitle=. 2024 , url=
2024
-
[7]
The Thirteenth International Conference on Learning Representations , year=
Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models , author=. The Thirteenth International Conference on Learning Representations , year=
-
[8]
Capture The Flag , author=
-
[9]
LAB-Bench: Measuring Capabilities of Language Models for Biology Research
Lab-bench: Measuring capabilities of language models for biology research , author=. arXiv preprint arXiv:2407.10362 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
The wmdp benchmark: Measuring and reducing malicious use with unlearning , author=. arXiv preprint arXiv:2403.03218 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
2024 , doi =
Ivanov, Igor , title =. 2024 , doi =. https://www.biorxiv.org/content/early/2024/09/12/2024.08.21.608694.full.pdf , journal =
2024
-
[12]
arXiv preprint arXiv:2503.03750 , year=
The mask benchmark: Disentangling honesty from accuracy in ai systems , author=. arXiv preprint arXiv:2503.03750 , year=
-
[13]
Virology capabilities test (VCT): a multimodal virology Q&A benchmark , author=. arXiv [preprint]. arXiv: 2504.16137 , pages=
-
[14]
Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
gpt-oss-120b & gpt-oss-20b Model Card
gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Llama 4 Blogpost , howpublished =
-
[17]
Llama 4 Model Card , howpublished =
-
[18]
arXiv e-prints , pages=
The llama 3 herd of models , author=. arXiv e-prints , pages=
-
[19]
OpenAI o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
OpenAI o3 and o4-mini System Card , howpublished =
-
[21]
GPT-5 System Card , howpublished =
-
[22]
Gemini 2.5 Pro Model Card , howpublished =
-
[23]
Gemini 2.5 Flash & 2.5 Flash Image Model Card , howpublished =
-
[24]
Meta Frontier AI Framework , howpublished =
-
[25]
Qwen3-Coder-480B-A35B-Instruct , howpublished =
-
[26]
2023 , eprint=
PaperQA: Retrieval-Augmented Generative Agent for Scientific Research , author=. 2023 , eprint=
2023
-
[27]
Planning in natural language improves llm search for code generation , author=. arXiv preprint arXiv:2409.03733 , year=
-
[28]
The Twelfth International Conference on Learning Representations , year=
Large language models as optimizers , author=. The Twelfth International Conference on Learning Representations , year=
-
[29]
2024 , eprint=
Badllama 3: removing safety finetuning from Llama 3 in minutes , author=. 2024 , eprint=
2024
-
[30]
2023 , eprint=
AI Deception: A Survey of Examples, Risks, and Potential Solutions , author=. 2023 , eprint=
2023
-
[31]
2016 , eprint=
Concrete Problems in AI Safety , author=. 2016 , eprint=
2016
-
[32]
2025 , eprint=
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation , author=. 2025 , eprint=
2025
-
[33]
2025 , eprint=
Deliberative Alignment: Reasoning Enables Safer Language Models , author=. 2025 , eprint=
2025
-
[34]
2025 , eprint=
Stress Testing Deliberative Alignment for Anti-Scheming Training , author=. 2025 , eprint=
2025
-
[35]
2024 , url=
US AISI and UK AISI Joint Pre-Deployment Test OpenAI o1 , author=. 2024 , url=
2024
-
[36]
2023 , eprint=
Purple Llama CYBERSECEVAL: A Secure Coding Benchmark for Language Models , author=. 2023 , eprint=
2023
-
[37]
2024 , eprint=
CYBERSECEVAL 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models , author=. 2024 , eprint=
2024
-
[38]
PurpleLlama , author=
-
[39]
Llama Defenders Program , author=
-
[40]
2026 , month =
Anthropic , title =. 2026 , month =
2026
-
[41]
2025 , month =
Anthropic , title =. 2025 , month =
2025
-
[42]
System Card: Claude
Anthropic , year =. System Card: Claude
-
[43]
2025 , month =
OpenAI , title =. 2025 , month =
2025
-
[44]
AI Security Institute, UK , title =
-
[45]
2025 , month =
CAISI Evaluation of DeepSeek AI Models , institution =. 2025 , month =
2025
-
[46]
2024 , eprint=
Rapid Response: Mitigating LLM Jailbreaks with a Few Examples , author=. 2024 , eprint=
2024
-
[47]
2025 , eprint=
Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition , author=. 2025 , eprint=
2025
-
[48]
A Strong
Souly, Alexandra and Lu, Qingyuan and Bowen, Dillon and Trinh, Tu and Hsieh, Elvis and Pandey, Sana and Abbeel, Pieter and Svegliato, Justin and Emmons, Scott and Watkins, Olivia and Toyer, Sam , booktitle=. A Strong
-
[49]
2023 , eprint=
Jailbroken: How Does LLM Safety Training Fail? , author=. 2023 , eprint=
2023
-
[50]
Hendryx and Summer Yue and Zifan Wang , booktitle=
Priyanshu Kumar and Elaine Lau and Saranya Vijayakumar and Tu Trinh and Elaine T Chang and Vaughn Robinson and Shuyan Zhou and Matt Fredrikson and Sean M. Hendryx and Summer Yue and Zifan Wang , booktitle=. Aligned. 2025 , url=
2025
-
[51]
2025 , eprint=
LLM Output Homogenization is Task Dependent , author=. 2025 , eprint=
2025
-
[52]
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs , url =
Wang, Zirui and Xia, Mengzhou and He, Luxi and Chen, Howard and Liu, Yitao and Zhu, Richard and Liang, Kaiqu and Wu, Xindi and Liu, Haotian and Malladi, Sadhika and Chevalier, Alexis and Arora, Sanjeev and Chen, Danqi , booktitle =. CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs , url =. doi:10.52202/079017-3609 , editor =
-
[53]
Weather and Forecasting , volume=
Jochen Br\". Weather and Forecasting , volume=. 2007 , doi=
2007
-
[54]
Niloofar Mireshghallah and Neal Mangaokar and Narine Kokhlikyan and Arman Zharmagambetov and Manzil Zaheer and Saeed Mahloujifar and Kamalika Chaudhuri , booktitle=
-
[55]
Bowman and He He and Shi Feng , booktitle=
Jiaxin Wen and Ruiqi Zhong and Akbir Khan and Ethan Perez and Jacob Steinhardt and Minlie Huang and Samuel R. Bowman and He He and Shi Feng , booktitle=. Language Models Learn to Mislead Humans via. 2025 , url=
2025
-
[56]
Eric Wallace and Kai Xiao and Reimar Leike and Lilian Weng and Johannes Heidecke and Alex Beutel , year=. The. 2404.13208 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv
-
[57]
2025 , publisher=
Zhang, Zhihan and Li, Shiyang and Zhang, Zixuan and Liu, Xin and Jiang, Haoming and Tang, Xianfeng and Gao, Yifan and Li, Zheng and Wang, Haodong and Tan, Zhaoxuan and Li, Yichuan and Yin, Qingyu and Yin, Bing and Jiang, Meng , booktitle=. 2025 , publisher=
2025
-
[58]
Ziwen Han and Dean Lee and Meher Mankikar and Edward Gan and Summer Yue , year=. How
-
[59]
Ziqian Zhong and Aditi Raghunathan and Nicholas Carlini , year=. 2510.20270 , archivePrefix=
-
[60]
A benchmark of expert-level academic questions to assess
Long Phan and Alice Gatti and Ziwen Han and Nathaniel Li and Josephina Hu and Hugh Zhang and Sean Shi and Michael Choi and Anish Agrawal and Arnav Chopra and Adam Khoja and Ryan Kim and Richard Ren and Jason Hausenloy and Oliver Zhang and others , journal=. A benchmark of expert-level academic questions to assess. 2026 , publisher=
2026
-
[61]
Bell , booktitle=
Polina Kirichenko and Mark Ibrahim and Kamalika Chaudhuri and Samuel J. Bell , booktitle=
-
[62]
Lukas Haas and Gal Yona and Giovanni D'Antonio and Sasha Goldshtein and Dipanjan Das , year=. 2509.07968 , archivePrefix=
-
[63]
CyberGym: Evaluating
Zhun Wang and Tianneng Shi and Jingxuan He and Matthew Cai and Jialin Zhang and Dawn Song , booktitle=. CyberGym: Evaluating. 2026 , url=
2026
-
[64]
Yao Huang and Yitong Sun and Yichi Zhang and Ruochen Zhang and Yinpeng Dong and Xingxing Wei , booktitle=
-
[65]
Alexander Pan and Jun Shern Chan and Andy Zou and Nathaniel Li and Steven Basart and Thomas Woodside and Jonathan Ng and Hanlin Zhang and Scott Emmons and Dan Hendrycks , booktitle=. Do the
-
[66]
Long Phan and Mantas Mazeika and Andy Zou and Dan Hendrycks , year=. 2507.23701 , archivePrefix=
-
[67]
2024 , eprint=
Ryan Greenblatt and Carson Denison and Benjamin Wright and Fabien Roger and Monte MacDiarmid and Sam Marks and Johannes Treutlein and Tim Belonax and Jack Chen and David Duvenaud and Akbir Khan and Julian Michael and S\". 2024 , eprint=
2024
-
[68]
Abhay Sheshadri and John Hughes and Julian Michael and Alex Mallen and Arun Jose and Janus and Fabien Roger , booktitle=. Why
-
[69]
2023 , eprint=
Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=
2023
-
[70]
Do Anything Now
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models , author=. 2024 , eprint=
2024
-
[71]
2025 , eprint=
Jailbreaking to Jailbreak , author=. 2025 , eprint=
2025
-
[72]
2025 , eprint=
FORTRESS: Frontier Risk Evaluation for National Security and Public Safety , author=. 2025 , eprint=
2025
-
[73]
arXiv preprint arXiv:2405.20947 , year =
OR-Bench: An Over-Refusal Benchmark for Large Language Models , author=. arXiv preprint arXiv:2405.20947 , year=
-
[74]
2024 , eprint=
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents , author=. 2024 , eprint=
2024
-
[75]
Prompt Siren: a Framework for Prompt Injection Evaluations , year =
Edoardo Debenedetti and Klaudia Krawiecka and Neal Mangaokar and Arman Zharmagambetov and Nina Mehrabi and Aidan Boyd and Kat He and Sahana Chennabasappa and Lauren Deason and Kamalika Chaudhuri and Florian Tram. Prompt Siren: a Framework for Prompt Injection Evaluations , year =
-
[76]
2024 , eprint=
Measuring short-form factuality in large language models , author=. 2024 , eprint=
2024
-
[77]
Automated Red Teaming with
Maya Pavlova and Erik Brinkman and Krithika Iyer and V. Automated Red Teaming with. Forty-second International Conference on Machine Learning , year=
-
[78]
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents , author=. arXiv preprint arXiv:2410.09024 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[79]
and Srivastava, Sanjay , booktitle=
John, Oliver P. and Srivastava, Sanjay , booktitle=. The. 1999 , publisher=
1999
-
[80]
and Wagner, Claudia and Rammstedt, Beatrice and Strohmaier, Markus , journal=
Pellert, Max and Lechner, Clemens M. and Wagner, Claudia and Rammstedt, Beatrice and Strohmaier, Markus , journal=. 2024 , publisher=
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.