arxiv: 2403.03218 · v7 · submitted 2024-03-05 · 💻 cs.LG · cs.AI· cs.CL· cs.CY

Recognition: 2 theorem links

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Nathaniel Li , Alexander Pan , Anjali Gopal , Summer Yue , Daniel Berrios , Alice Gatti , Justin D. Li , Ann-Kathrin Dombrowski

show 49 more authors

Shashwat Goel Long Phan Gabriel Mukobi Nathan Helm-Burger Rassin Lababidi Lennart Justen Andrew B. Liu Michael Chen Isabelle Barrass Oliver Zhang Xiaoyuan Zhu Rishub Tamirisa Bhrugu Bharathi Adam Khoja Zhenqi Zhao Ariel Herbert-Voss Cort B. Breuer Samuel Marks Oam Patel Andy Zou Mantas Mazeika Zifan Wang Palash Oswal Weiran Lin Adam A. Hunt Justin Tienken-Harder Kevin Y. Shih Kemper Talley John Guan Russell Kaplan Ian Steneker David Campbell Brad Jokubaitis Alex Levinson Jean Wang William Qian Kallol Krishna Karmakar Steven Basart Stephen Fitz Mindy Levine Ponnurangam Kumaraguru Uday Tupakula Vijay Varadharajan Ruoyu Wang Yan Shoshitaishvili Jimmy Ba Kevin M. Esvelt Alexandr Wang Dan Hendrycks

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.CY

keywords WMDP benchmarkunlearninghazardous knowledgeLLM safetybiosecuritycybersecuritychemical securitymalicious use

0 comments

The pith

The WMDP benchmark publicly measures hazardous knowledge in LLMs, and the RMU unlearning method reduces performance on it while preserving general capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper releases the Weapons of Mass Destruction Proxy benchmark, a public set of 3,668 multiple-choice questions on biosecurity, cybersecurity, and chemical security topics. The questions act as a proxy to evaluate how much dangerous knowledge large language models contain. The authors introduce RMU, a method that adjusts internal model representations to unlearn this knowledge. Tests show RMU lowers scores on WMDP questions yet leaves performance intact on standard biology and computer science benchmarks. The results point to unlearning as one workable route to limit AI risks of enabling weapons development.

Core claim

The central claim is that the publicly released WMDP benchmark provides an open proxy for measuring hazardous knowledge in biosecurity, cybersecurity, and chemical security domains, and that the RMU unlearning technique, which operates by controlling model representations, can selectively reduce model performance on these questions while leaving general capabilities in biology and computer science unchanged.

What carries the argument

RMU, a representation-control unlearning method that targets and suppresses specific hazardous knowledge in model activations.

Load-bearing premise

Performance on WMDP questions reliably indicates real-world ability to assist in developing weapons of mass destruction.

What would settle it

A model that scores near random on WMDP but still supplies accurate step-by-step guidance for producing a biological weapon would show the benchmark fails to capture actual hazardous capability.

read the original abstract

The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing further research into mitigating risk. Furthermore, they focus on only a few, highly specific pathways for malicious use. To fill these gaps, we publicly release the Weapons of Mass Destruction Proxy (WMDP) benchmark, a dataset of 3,668 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP was developed by a consortium of academics and technical consultants, and was stringently filtered to eliminate sensitive information prior to public release. WMDP serves two roles: first, as an evaluation for hazardous knowledge in LLMs, and second, as a benchmark for unlearning methods to remove such hazardous knowledge. To guide progress on unlearning, we develop RMU, a state-of-the-art unlearning method based on controlling model representations. RMU reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs. We release our benchmark and code publicly at https://wmdp.ai

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WMDP gives the field a public benchmark for hazardous knowledge, but the MCQ format and filtering make it unclear how well it tracks actual malicious-use risk.

read the letter

The paper's main contribution is releasing WMDP, a set of 3,668 filtered multiple-choice questions on biosecurity, cybersecurity, and chemical security, plus the RMU unlearning method that tries to suppress performance on those questions. Making this public is the useful part, since prior evaluations stayed private and blocked follow-on work. The authors also show RMU can lower WMDP scores while keeping general biology and computer science performance roughly intact, which at least gives people a concrete starting point for testing unlearning ideas. That combination of benchmark and method is new enough to matter for anyone working on capability measurement or mitigation. The soft spot is the proxy question itself. The questions went through heavy filtering to remove sensitive details, and the format is multiple choice rather than open-ended synthesis or protocol design. If a model can still reason about the underlying material but just fails the test items, then lower WMDP numbers do not necessarily mean lower real-world risk. The abstract does not give the quantitative tables or error bars, so the strength of the preservation claim is hard to judge without the full results. For readers focused on AI safety benchmarks or unlearning techniques, the paper is worth reading and citing for the dataset alone. It is coherent on its own terms and engages the right literature, so it deserves a serious referee rather than a desk reject. I would bring it to a reading group to discuss the proxy validity issue.

Referee Report

2 major / 1 minor

Summary. The paper introduces the publicly released WMDP benchmark, a dataset of 3,668 multiple-choice questions spanning biosecurity, cybersecurity, and chemical security, developed as a proxy for hazardous knowledge in LLMs. It proposes RMU, a representation-control unlearning method, and reports that RMU lowers WMDP accuracy while preserving general capabilities in biology and computer science, with the benchmark and code released at https://wmdp.ai.

Significance. If the empirical claims hold, the work is significant because it supplies the first open benchmark for hazardous capabilities, directly addressing the limitation that existing evaluations are private. The public release of both the dataset and RMU code enables reproducible research on unlearning, and the suggestion that targeted representation control can reduce proxy risk without broad capability loss provides a concrete, testable direction for AI safety.

major comments (2)

[§3] §3 (Benchmark Construction and Filtering): The stringently filtered MCQ items remove sensitive details by design, yet the central claim that WMDP serves as a proxy for real-world malicious use rests on the untested assumption that reduced MCQ accuracy implies reduced ability to synthesize or apply hazardous knowledge in open-ended settings. No ablation or external validation (e.g., expert red-teaming on retained items) is provided to show that the retained questions remain sufficient for misuse.
[§4] §4 (RMU Experiments): The abstract and results claim that RMU reduces WMDP performance while maintaining general biology/CS capabilities, but no quantitative tables, error bars, baseline comparisons (e.g., gradient ascent, fine-tuning), or statistical tests are referenced in the provided summary; without these, the preservation claim cannot be evaluated for robustness or effect size.

minor comments (1)

Add explicit section numbers and equation references throughout for easier navigation; several cross-references in the methods appear to rely on implicit numbering.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We address each major comment point by point below, indicating where revisions will be incorporated.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction and Filtering): The stringently filtered MCQ items remove sensitive details by design, yet the central claim that WMDP serves as a proxy for real-world malicious use rests on the untested assumption that reduced MCQ accuracy implies reduced ability to synthesize or apply hazardous knowledge in open-ended settings. No ablation or external validation (e.g., expert red-teaming on retained items) is provided to show that the retained questions remain sufficient for misuse.

Authors: We agree that WMDP functions as a proxy and that direct validation linking MCQ accuracy to open-ended synthesis capabilities would strengthen the work. However, performing such validations (e.g., expert red-teaming on retained items) would require testing actual hazardous knowledge application, which raises insurmountable ethical and safety barriers. The questions were developed and filtered in consultation with domain experts precisely to retain proxy relevance while eliminating actionable details. The public release at wmdp.ai is intended to enable safe, community-driven follow-up studies. We will add an expanded limitations subsection discussing the proxy assumption and outlining directions for future validation. revision: partial
Referee: [§4] §4 (RMU Experiments): The abstract and results claim that RMU reduces WMDP performance while maintaining general biology/CS capabilities, but no quantitative tables, error bars, baseline comparisons (e.g., gradient ascent, fine-tuning), or statistical tests are referenced in the provided summary; without these, the preservation claim cannot be evaluated for robustness or effect size.

Authors: The full manuscript (Section 4 and associated tables) already contains the requested details: accuracy tables showing WMDP drops (with standard deviations across 3–5 seeds), direct comparisons to gradient ascent and fine-tuning baselines, MMLU biology/CS subset scores demonstrating capability preservation, and statistical tests (paired t-tests) confirming significance. These results are referenced in the results narrative. We will revise the abstract and introduction summary to explicitly cite the relevant tables and effect sizes so the quantitative support is immediately visible. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical benchmark release and experimental results

full rationale

The paper releases a public MCQ benchmark (WMDP) and evaluates an unlearning method (RMU) via direct performance measurements on held-out test sets and general capability benchmarks. No mathematical derivation chain exists; results are reported from experiments rather than fitted parameters or self-referential definitions. Central claims rest on observed accuracy drops on WMDP items and maintained scores on MMLU-style biology/CS questions, which are independently verifiable and not reduced to the inputs by construction. Self-citations, if present, are not load-bearing for the empirical findings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the domain assumption that WMDP accurately proxies real hazardous capabilities after filtering and that representation control in RMU produces durable unlearning without side effects.

axioms (1)

domain assumption WMDP questions serve as a valid proxy for hazardous knowledge without leaking sensitive information after stringent filtering.
Stated in abstract as developed by consortium and stringently filtered prior to public release.

pith-pipeline@v0.9.0 · 5807 in / 1195 out tokens · 34589 ms · 2026-05-15T19:51:44.059609+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models
cs.CR 2026-05 conditional novelty 8.0

Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
Inducing Artificial Uncertainty in Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models
cs.CV 2026-05 unverdicted novelty 7.0

CoVUBench is the first benchmark framework for evaluating multimodal copyright unlearning in LVLMs via synthetic data, systematic variations, and a dual protocol for forgetting efficacy and utility preservation.
Jailbroken Frontier Models Retain Their Capabilities
cs.LG 2026-04 unverdicted novelty 7.0

Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.
Is your algorithm unlearning or untraining?
cs.LG 2026-04 conditional novelty 7.0

Machine unlearning conflates reversing the influence of specific training examples (untraining) with removing the full underlying distribution or behavior (unlearning).
Robust LLM Unlearning Against Relearning Attacks: The Minor Components in Representations Matter
cs.CL 2026-05 unverdicted novelty 6.0

Targeting minor components in LLM representations during unlearning yields substantially better resistance to relearning attacks than prior methods.
Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies
cs.AI 2026-04 unverdicted novelty 6.0

A separable expert architecture uses base models, LoRA adapters, and deletable per-user proxies to enable privacy-preserving personalization and deterministic unlearning in LLMs.
CAP: Controllable Alignment Prompting for Unlearning in LLMs
cs.LG 2026-04 unverdicted novelty 6.0

CAP optimizes prompts via reinforcement learning to selectively unlearn target knowledge in LLMs while preserving general capabilities, without any parameter updates and with reversible revocation.
CAP: Controllable Alignment Prompting for Unlearning in LLMs
cs.LG 2026-04 unverdicted novelty 6.0

CAP enables reversible unlearning of targeted knowledge in LLMs through optimized prompts generated via reinforcement learning, without any parameter updates.
Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks
cs.CR 2026-04 unverdicted novelty 6.0

Different LLM jailbreak techniques achieve similar harmful compliance but lead to distinct behavioral side effects and mechanistic changes.
WIN-U: Woodbury-Informed Newton-Unlearning as a retain-free Machine Unlearning Framework
cs.LG 2026-04 unverdicted novelty 6.0

WIN-U delivers a retain-free unlearning update that approximates the gold-standard retrained model via a Woodbury-informed Newton step using only forget-set curvature information.
Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs
cs.LG 2026-04 unverdicted novelty 6.0

LIRA aligns latent instruction representations in LLMs to defend against jailbreaks, backdoors, and undesired knowledge, blocking over 99% of PEZ attacks and achieving optimal WMDP forgetting.
Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts
cs.CR 2026-04 unverdicted novelty 6.0

Swiss-Bench 003 extends an existing Swiss LLM assessment with two new dimensions and evaluates ten models on 808 items, finding high self-graded reliability scores but low adversarial security scores.
Efficient machine unlearning with minimax optimality
stat.ML 2026-04 unverdicted novelty 6.0

ULS provides minimax-optimal estimation of remaining-data parameters in machine unlearning with limited access and decomposes error into oracle plus unlearning cost terms.
Chain-of-Authorization: Embedding authorization into large language models
cs.AI 2026-03 unverdicted novelty 6.0

LLMs fine-tuned to output authorization trajectories as a prerequisite for responses achieve high rejection rates for unauthorized prompts while preserving utility in allowed scenarios.
SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond
cs.LG 2026-03 conditional novelty 6.0

SafeSci creates a large objective benchmark and training resource that reveals safety weaknesses in current LLMs for science and demonstrates measurable improvement through targeted fine-tuning.
Do Linear Probes Generalize Better in Persona Coordinates?
cs.AI 2026-05 unverdicted novelty 5.0

Probes on persona principal components from contrastive prompts generalize better than raw activation probes for harmful behaviors across 10 datasets.
Humanity's Last Exam
cs.LG 2025-01 unverdicted novelty 5.0

Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
Risk Reporting for Developers' Internal AI Model Use
cs.CY 2026-04 unverdicted novelty 4.0

A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 18 Pith papers

[1]

citizen science

doi: 10 .48550/arXiv.2402.10058. Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, and Dylan Hadfield-Menell. Eight methods to evaluate robust unlearning in llms, 2024. Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C. Lipton, and J. Zico Kolter. Tofu: A task of fictitious unlearning for llms, 2024. Mantas Mazeika, Long Phan, Xuwang Yin, And...

work page doi:10.7249/rra2977-2 2024
[2]

Overview. How is this work intended to reduce existential risks from advanced AI systems? Answer: This work aims to mitigate existential risks posed by the malicious use of LLMs in developing bioweapons and cyber weapons. WMDP serves both as a metric for evaluating the presence of hazardous knowledge, and as a benchmark for testing unlearning methods. We ...

work page 2023
[3]

Direct Effects. If this work directly reduces existential risks, what are the main hazards, vulnerabilities, or failure modes that it directly affects? 29 Answer: WMDP increases the barrier of entry for malicious actors to cause catastrophic harm. It decreases access to models with hazardous biological or cyber capabilities, reducing the number of malicio...

work page
[4]

Diffuse Effects. If this work reduces existential risks indirectly or diffusely, what are the main contributing factors that it affects? Answer: Unlearning on WMDP reduces the risks of language model aided cyberattacks, particularly from low-skilled malicious actors. Cyberattacks, particularly on critical infras- tructure, could be catastrophic. They are ...

work page 2024
[5]

What’s at Stake?What is a future scenario in which this research direction could prevent the sudden, large-scale loss of life? If not applicable, what is a future scenario in which this research direction be highly beneficial? Answer: This directly reduces x-risks associated with the malicious use of language models in developing weapons of mass destructi...

work page 2022
[6]

Do the findings rest on strong theoretical assumptions; are they not demonstrated using leading-edge tasks or models; or are the findings highly sensitive to hyperparameters? □

Result Fragility. Do the findings rest on strong theoretical assumptions; are they not demonstrated using leading-edge tasks or models; or are the findings highly sensitive to hyperparameters? □

work page
[7]

Is it implausible that any practical system could ever markedly outper- form humans at this task? ⊠

Problem Difficulty. Is it implausible that any practical system could ever markedly outper- form humans at this task? ⊠

work page
[8]

Does this approach strongly depend on handcrafted features, expert supervision, or human reliability? □

Human Unreliability. Does this approach strongly depend on handcrafted features, expert supervision, or human reliability? □

work page
[9]

Competitive Pressures. Does work towards this approach strongly trade off against raw intelligence, other general capabilities, or economic utility? □ E.2 Safety-Capabilities Balance In this section, please analyze how this work relates to general capabilities and how it affects the balance between safety and hazards from general capabilities

work page
[10]

Overview. How does this improve safety more than it improves general capabilities? Answer: Unlearning does not improve general capabilities; rather, it removes specific model capabilities while improving inherent model safety

work page
[11]

Red Teaming. What is a way in which this hastens general capabilities or the onset of x-risks? Answer: Although WMDP is constructed as a benchmark for measuring and reducing inherent model hazards, it may inadvertently serve as a roadmap for malicious use, hastening the onset of x-risks by lowering the barrier for causing catastrophe. To reduce these risk...

work page
[12]

Does this work advance progress on tasks that have been previously considered the subject of usual capabilities research? □

General Tasks. Does this work advance progress on tasks that have been previously considered the subject of usual capabilities research? □

work page
[13]

General Goals. Does this improve or facilitate research towards general prediction, clas- sification, state estimation, efficiency, scalability, generation, data compression, executing clear instructions, helpfulness, informativeness, reasoning, planning, researching, optimiza- tion, (self-)supervised learning, sequential decision making, recursive self-i...

work page
[14]

Correlation with General Aptitude.Is the analyzed capability known to be highly predicted by general cognitive ability or educational attainment? □

work page
[15]

Does this advance safety along with, or as a consequence of, advancing other capabilities or the study of AI? □ 30 E.3 Elaborations and Other Considerations

Safety via Capabilities. Does this advance safety along with, or as a consequence of, advancing other capabilities or the study of AI? □ 30 E.3 Elaborations and Other Considerations

work page
[16]

Other. What clarifications or uncertainties about this work and x-risk are worth mentioning? Answer: While unlearning is an important intervention for reducing model hazards, un- learning with may reduce the defensive, or beneficial, applications in those areas. unlearning should be complemented with other interventions that reduce risk (Appendix D). 31

work page