Recognition: 2 theorem links
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Pith reviewed 2026-05-15 19:51 UTC · model grok-4.3
The pith
The WMDP benchmark publicly measures hazardous knowledge in LLMs, and the RMU unlearning method reduces performance on it while preserving general capabilities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the publicly released WMDP benchmark provides an open proxy for measuring hazardous knowledge in biosecurity, cybersecurity, and chemical security domains, and that the RMU unlearning technique, which operates by controlling model representations, can selectively reduce model performance on these questions while leaving general capabilities in biology and computer science unchanged.
What carries the argument
RMU, a representation-control unlearning method that targets and suppresses specific hazardous knowledge in model activations.
Load-bearing premise
Performance on WMDP questions reliably indicates real-world ability to assist in developing weapons of mass destruction.
What would settle it
A model that scores near random on WMDP but still supplies accurate step-by-step guidance for producing a biological weapon would show the benchmark fails to capture actual hazardous capability.
read the original abstract
The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing further research into mitigating risk. Furthermore, they focus on only a few, highly specific pathways for malicious use. To fill these gaps, we publicly release the Weapons of Mass Destruction Proxy (WMDP) benchmark, a dataset of 3,668 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP was developed by a consortium of academics and technical consultants, and was stringently filtered to eliminate sensitive information prior to public release. WMDP serves two roles: first, as an evaluation for hazardous knowledge in LLMs, and second, as a benchmark for unlearning methods to remove such hazardous knowledge. To guide progress on unlearning, we develop RMU, a state-of-the-art unlearning method based on controlling model representations. RMU reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs. We release our benchmark and code publicly at https://wmdp.ai
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the publicly released WMDP benchmark, a dataset of 3,668 multiple-choice questions spanning biosecurity, cybersecurity, and chemical security, developed as a proxy for hazardous knowledge in LLMs. It proposes RMU, a representation-control unlearning method, and reports that RMU lowers WMDP accuracy while preserving general capabilities in biology and computer science, with the benchmark and code released at https://wmdp.ai.
Significance. If the empirical claims hold, the work is significant because it supplies the first open benchmark for hazardous capabilities, directly addressing the limitation that existing evaluations are private. The public release of both the dataset and RMU code enables reproducible research on unlearning, and the suggestion that targeted representation control can reduce proxy risk without broad capability loss provides a concrete, testable direction for AI safety.
major comments (2)
- [§3] §3 (Benchmark Construction and Filtering): The stringently filtered MCQ items remove sensitive details by design, yet the central claim that WMDP serves as a proxy for real-world malicious use rests on the untested assumption that reduced MCQ accuracy implies reduced ability to synthesize or apply hazardous knowledge in open-ended settings. No ablation or external validation (e.g., expert red-teaming on retained items) is provided to show that the retained questions remain sufficient for misuse.
- [§4] §4 (RMU Experiments): The abstract and results claim that RMU reduces WMDP performance while maintaining general biology/CS capabilities, but no quantitative tables, error bars, baseline comparisons (e.g., gradient ascent, fine-tuning), or statistical tests are referenced in the provided summary; without these, the preservation claim cannot be evaluated for robustness or effect size.
minor comments (1)
- Add explicit section numbers and equation references throughout for easier navigation; several cross-references in the methods appear to rely on implicit numbering.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review. We address each major comment point by point below, indicating where revisions will be incorporated.
read point-by-point responses
-
Referee: [§3] §3 (Benchmark Construction and Filtering): The stringently filtered MCQ items remove sensitive details by design, yet the central claim that WMDP serves as a proxy for real-world malicious use rests on the untested assumption that reduced MCQ accuracy implies reduced ability to synthesize or apply hazardous knowledge in open-ended settings. No ablation or external validation (e.g., expert red-teaming on retained items) is provided to show that the retained questions remain sufficient for misuse.
Authors: We agree that WMDP functions as a proxy and that direct validation linking MCQ accuracy to open-ended synthesis capabilities would strengthen the work. However, performing such validations (e.g., expert red-teaming on retained items) would require testing actual hazardous knowledge application, which raises insurmountable ethical and safety barriers. The questions were developed and filtered in consultation with domain experts precisely to retain proxy relevance while eliminating actionable details. The public release at wmdp.ai is intended to enable safe, community-driven follow-up studies. We will add an expanded limitations subsection discussing the proxy assumption and outlining directions for future validation. revision: partial
-
Referee: [§4] §4 (RMU Experiments): The abstract and results claim that RMU reduces WMDP performance while maintaining general biology/CS capabilities, but no quantitative tables, error bars, baseline comparisons (e.g., gradient ascent, fine-tuning), or statistical tests are referenced in the provided summary; without these, the preservation claim cannot be evaluated for robustness or effect size.
Authors: The full manuscript (Section 4 and associated tables) already contains the requested details: accuracy tables showing WMDP drops (with standard deviations across 3–5 seeds), direct comparisons to gradient ascent and fine-tuning baselines, MMLU biology/CS subset scores demonstrating capability preservation, and statistical tests (paired t-tests) confirming significance. These results are referenced in the results narrative. We will revise the abstract and introduction summary to explicitly cite the relevant tables and effect sizes so the quantitative support is immediately visible. revision: yes
Circularity Check
No significant circularity: empirical benchmark release and experimental results
full rationale
The paper releases a public MCQ benchmark (WMDP) and evaluates an unlearning method (RMU) via direct performance measurements on held-out test sets and general capability benchmarks. No mathematical derivation chain exists; results are reported from experiments rather than fitted parameters or self-referential definitions. Central claims rest on observed accuracy drops on WMDP items and maintained scores on MMLU-style biology/CS questions, which are independently verifiable and not reduced to the inputs by construction. Self-citations, if present, are not load-bearing for the empirical findings.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption WMDP questions serve as a valid proxy for hazardous knowledge without leaking sensitive information after stringent filtering.
Forward citations
Cited by 19 Pith papers
-
Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models
Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
-
Inducing Artificial Uncertainty in Language Models
Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
-
Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models
CoVUBench is the first benchmark framework for evaluating multimodal copyright unlearning in LVLMs via synthetic data, systematic variations, and a dual protocol for forgetting efficacy and utility preservation.
-
Jailbroken Frontier Models Retain Their Capabilities
Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.
-
Is your algorithm unlearning or untraining?
Machine unlearning conflates reversing the influence of specific training examples (untraining) with removing the full underlying distribution or behavior (unlearning).
-
Robust LLM Unlearning Against Relearning Attacks: The Minor Components in Representations Matter
Targeting minor components in LLM representations during unlearning yields substantially better resistance to relearning attacks than prior methods.
-
Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies
A separable expert architecture uses base models, LoRA adapters, and deletable per-user proxies to enable privacy-preserving personalization and deterministic unlearning in LLMs.
-
CAP: Controllable Alignment Prompting for Unlearning in LLMs
CAP optimizes prompts via reinforcement learning to selectively unlearn target knowledge in LLMs while preserving general capabilities, without any parameter updates and with reversible revocation.
-
CAP: Controllable Alignment Prompting for Unlearning in LLMs
CAP enables reversible unlearning of targeted knowledge in LLMs through optimized prompts generated via reinforcement learning, without any parameter updates.
-
Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks
Different LLM jailbreak techniques achieve similar harmful compliance but lead to distinct behavioral side effects and mechanistic changes.
-
WIN-U: Woodbury-Informed Newton-Unlearning as a retain-free Machine Unlearning Framework
WIN-U delivers a retain-free unlearning update that approximates the gold-standard retrained model via a Woodbury-informed Newton step using only forget-set curvature information.
-
Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs
LIRA aligns latent instruction representations in LLMs to defend against jailbreaks, backdoors, and undesired knowledge, blocking over 99% of PEZ attacks and achieving optimal WMDP forgetting.
-
Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts
Swiss-Bench 003 extends an existing Swiss LLM assessment with two new dimensions and evaluates ten models on 808 items, finding high self-graded reliability scores but low adversarial security scores.
-
Efficient machine unlearning with minimax optimality
ULS provides minimax-optimal estimation of remaining-data parameters in machine unlearning with limited access and decomposes error into oracle plus unlearning cost terms.
-
Chain-of-Authorization: Embedding authorization into large language models
LLMs fine-tuned to output authorization trajectories as a prerequisite for responses achieve high rejection rates for unauthorized prompts while preserving utility in allowed scenarios.
-
SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond
SafeSci creates a large objective benchmark and training resource that reveals safety weaknesses in current LLMs for science and demonstrates measurable improvement through targeted fine-tuning.
-
Do Linear Probes Generalize Better in Persona Coordinates?
Probes on persona principal components from contrastive prompts generalize better than raw activation probes for harmful behaviors across 10 datasets.
-
Humanity's Last Exam
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
-
Risk Reporting for Developers' Internal AI Model Use
A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.
Reference graph
Works this paper leans on
-
[1]
doi: 10 .48550/arXiv.2402.10058. Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, and Dylan Hadfield-Menell. Eight methods to evaluate robust unlearning in llms, 2024. Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C. Lipton, and J. Zico Kolter. Tofu: A task of fictitious unlearning for llms, 2024. Mantas Mazeika, Long Phan, Xuwang Yin, And...
-
[2]
Overview. How is this work intended to reduce existential risks from advanced AI systems? Answer: This work aims to mitigate existential risks posed by the malicious use of LLMs in developing bioweapons and cyber weapons. WMDP serves both as a metric for evaluating the presence of hazardous knowledge, and as a benchmark for testing unlearning methods. We ...
work page 2023
-
[3]
Direct Effects. If this work directly reduces existential risks, what are the main hazards, vulnerabilities, or failure modes that it directly affects? 29 Answer: WMDP increases the barrier of entry for malicious actors to cause catastrophic harm. It decreases access to models with hazardous biological or cyber capabilities, reducing the number of malicio...
-
[4]
Diffuse Effects. If this work reduces existential risks indirectly or diffusely, what are the main contributing factors that it affects? Answer: Unlearning on WMDP reduces the risks of language model aided cyberattacks, particularly from low-skilled malicious actors. Cyberattacks, particularly on critical infras- tructure, could be catastrophic. They are ...
work page 2024
-
[5]
What’s at Stake?What is a future scenario in which this research direction could prevent the sudden, large-scale loss of life? If not applicable, what is a future scenario in which this research direction be highly beneficial? Answer: This directly reduces x-risks associated with the malicious use of language models in developing weapons of mass destructi...
work page 2022
-
[6]
Result Fragility. Do the findings rest on strong theoretical assumptions; are they not demonstrated using leading-edge tasks or models; or are the findings highly sensitive to hyperparameters? □
-
[7]
Is it implausible that any practical system could ever markedly outper- form humans at this task? ⊠
Problem Difficulty. Is it implausible that any practical system could ever markedly outper- form humans at this task? ⊠
-
[8]
Human Unreliability. Does this approach strongly depend on handcrafted features, expert supervision, or human reliability? □
-
[9]
Competitive Pressures. Does work towards this approach strongly trade off against raw intelligence, other general capabilities, or economic utility? □ E.2 Safety-Capabilities Balance In this section, please analyze how this work relates to general capabilities and how it affects the balance between safety and hazards from general capabilities
-
[10]
Overview. How does this improve safety more than it improves general capabilities? Answer: Unlearning does not improve general capabilities; rather, it removes specific model capabilities while improving inherent model safety
-
[11]
Red Teaming. What is a way in which this hastens general capabilities or the onset of x-risks? Answer: Although WMDP is constructed as a benchmark for measuring and reducing inherent model hazards, it may inadvertently serve as a roadmap for malicious use, hastening the onset of x-risks by lowering the barrier for causing catastrophe. To reduce these risk...
-
[12]
General Tasks. Does this work advance progress on tasks that have been previously considered the subject of usual capabilities research? □
-
[13]
General Goals. Does this improve or facilitate research towards general prediction, clas- sification, state estimation, efficiency, scalability, generation, data compression, executing clear instructions, helpfulness, informativeness, reasoning, planning, researching, optimiza- tion, (self-)supervised learning, sequential decision making, recursive self-i...
-
[14]
Correlation with General Aptitude.Is the analyzed capability known to be highly predicted by general cognitive ability or educational attainment? □
-
[15]
Safety via Capabilities. Does this advance safety along with, or as a consequence of, advancing other capabilities or the study of AI? □ 30 E.3 Elaborations and Other Considerations
-
[16]
Other. What clarifications or uncertainties about this work and x-risk are worth mentioning? Answer: While unlearning is an important intervention for reducing model hazards, un- learning with may reduce the defensive, or beneficial, applications in those areas. unlearning should be complemented with other interventions that reduce risk (Appendix D). 31
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.