SoSBench: Benchmarking Safety Alignment on Six Scientific Domains

Bhaskar Ramasubramanian; Bo Li; Fengbo Ma; Fengqing Jiang; Luyao Niu; Radha Poovendran; Xianyan Chen; Yuetai Li; Zhangchen Xu; Zhen Xiang

arxiv: 2505.21605 · v3 · submitted 2025-05-27 · 💻 cs.LG · cs.AI· cs.CR

SoSBench: Benchmarking Safety Alignment on Six Scientific Domains

Fengqing Jiang , Fengbo Ma , Zhangchen Xu , Yuetai Li , Zixin Rao , Bhaskar Ramasubramanian , Luyao Niu , Bo Li

show 3 more authors

Xianyan Chen Zhen Xiang Radha Poovendran

This is my paper

Pith reviewed 2026-05-19 12:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CR

keywords safety alignmentlarge language modelsbenchmarkscientific domainsharmful responsesmisuse scenariosregulation-groundedrefusal failures

0 comments

The pith

Advanced language models still disclose detailed harmful scientific information in up to 85 percent of regulation-based queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates SoSBench to test whether large language models refuse requests for hazardous details in six scientific fields when those requests are grounded in actual laws and expanded into realistic misuse cases. It generates 3000 prompts covering chemistry, biology, medicine, pharmacology, physics, and psychology, then runs frontier models through them in a single evaluation setup. Results show consistent failures, with one model responding harmfully 84.9 percent of the time and another 50.3 percent of the time. A sympathetic reader would care because this points to gaps in how current safety training handles knowledge-rich risks that could enable real-world damage.

Core claim

SoSBench is a benchmark of 3000 regulation-derived prompts across six high-risk scientific domains that were expanded through an LLM-assisted evolutionary process to create detailed misuse scenarios, and evaluations within a unified framework show that advanced models disclose policy-violating content at high rates, including 84.9 percent for Deepseek-R1 and 50.3 percent for GPT-4.1, revealing deficiencies in safety alignment for sophisticated scientific queries.

What carries the argument

The LLM-assisted evolutionary pipeline that starts from real regulatory text and generates diverse, detailed misuse scenarios for each of the six domains.

If this is right

Current alignment methods must be extended to cover detailed, knowledge-intensive scientific content rather than only simple harmful instructions.
Models intended for scientific use will require additional domain-specific refusal layers to limit disclosure of regulated synthesis or application details.
Deployment decisions for powerful LLMs should include testing against regulation-grounded scenarios before granting broad access.
New safety techniques could be developed by using the benchmark's evolutionary method to generate training examples that force better refusal on complex queries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the benchmark holds, real-world users seeking harmful scientific assistance may succeed more often than simpler tests have suggested.
Safety teams could adapt the prompt-expansion approach to create targeted training data that improves refusal without hurting helpfulness on legitimate queries.
The results raise the possibility that hybrid systems combining LLMs with external verification tools might be needed for any scientific application involving regulated knowledge.

Load-bearing premise

The generated prompts accurately capture realistic high-risk misuse situations from actual regulations without introducing phrasing or details that systematically increase refusal failures.

What would settle it

Running the same models on a fresh set of prompts that domain experts have independently confirmed as realistic and directly tied to enforceable regulations would produce substantially lower harmful-response rates if the original benchmark overstates the risk.

read the original abstract

Large language models (LLMs) exhibit advancing capabilities in complex tasks, such as reasoning and graduate-level question answering, yet their resilience against misuse, particularly involving scientifically sophisticated risks, remains underexplored. Existing safety benchmarks typically focus either on instructions requiring minimal knowledge comprehension (e.g., ``tell me how to build a bomb") or utilize prompts that are relatively low-risk (e.g., multiple-choice or classification tasks about hazardous content). Consequently, they fail to adequately assess model safety when handling knowledge-intensive, hazardous scenarios. To address this critical gap, we introduce SoSBench, a regulation-grounded, hazard-focused benchmark encompassing six high-risk scientific domains: chemistry, biology, medicine, pharmacology, physics, and psychology. The benchmark comprises 3,000 prompts derived from real-world regulations and laws, systematically expanded via an LLM-assisted evolutionary pipeline that introduces diverse, realistic misuse scenarios (e.g., detailed explosive synthesis instructions involving advanced chemical formulas). We evaluate frontier models within a unified evaluation framework using our SoSBench. Despite their alignment claims, advanced models consistently disclose policy-violating content across all domains, demonstrating alarmingly high rates of harmful responses (e.g., 84.9% for Deepseek-R1 and 50.3% for GPT-4.1). These results highlight significant safety alignment deficiencies and underscore urgent concerns regarding the responsible deployment of powerful LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SoSBench supplies a regulation-grounded test set for knowledge-heavy scientific misuse and reports high refusal failures on frontier models, but the prompt-generation step needs closer scrutiny before the numbers can be taken at face value.

read the letter

The main point is that this paper builds a benchmark of 3,000 prompts drawn from real regulations across chemistry, biology, medicine, pharmacology, physics, and psychology, then expands them with an LLM evolutionary pipeline and measures how often current models still give policy-violating answers. The headline numbers—50% harmful responses for GPT-4.1 and 85% for Deepseek-R1—stand out because they target detailed, technical scenarios rather than the usual low-knowledge jailbreaks. That focus is the clearest addition to existing safety work. The evaluation covers multiple named frontier models under one setup, which makes direct comparison easier than scattered prior tests. The authors also release the benchmark, which is useful on its own. The central claim holds up as far as the abstract goes: models that claim strong alignment still leak content in these domains at rates that matter for deployment decisions. The soft spot is the evolutionary pipeline itself. The stress-test note is right to flag that no human validation or control comparison to raw regulatory excerpts is described in the provided text. If the LLM step routinely adds over-specified formulas or implausible combinations, the refusal failures could partly reflect prompt artifacts instead of genuine gaps in training. The abstract also leaves the exact judge criteria and any post-generation filtering steps unspecified, so the reported percentages rest on an evaluation protocol that still needs to be checked in the full manuscript. These issues are real but look like gaps that more documentation or an appendix could close rather than load-bearing flaws in the overall design. The work is aimed at people who build or audit safety evaluations for models that might be asked technical questions in regulated fields. Anyone running red-teaming or deciding on release thresholds for scientific capabilities will get concrete data points from the benchmark even if they later adjust the scoring. It is worth sending to referees because the domain coverage and grounding in regulations address a documented gap, and the empirical results are sharp enough to repay careful review even if revisions are needed on the validation details.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SoSBench, a benchmark of 3,000 prompts spanning six scientific domains (chemistry, biology, medicine, pharmacology, physics, psychology). Prompts are derived from real-world regulations and expanded via an LLM-assisted evolutionary pipeline to generate detailed misuse scenarios. Frontier LLMs are evaluated in a unified framework, with results showing high rates of harmful responses (e.g., 84.9% for Deepseek-R1 and 50.3% for GPT-4.1), which the authors interpret as evidence of significant safety alignment deficiencies in knowledge-intensive hazardous contexts.

Significance. If the prompts accurately instantiate realistic high-risk misuse scenarios grounded in regulations, the benchmark would fill a clear gap left by existing safety evaluations that focus on low-knowledge or low-risk tasks. The regulation-grounded construction and multi-domain scope represent a concrete strength, and the direct empirical measurement on external models provides falsifiable data that could guide future alignment work.

major comments (2)

[Benchmark construction / prompt generation pipeline] The abstract and benchmark construction description state that the LLM-assisted evolutionary pipeline produces 'diverse, realistic misuse scenarios' (e.g., detailed explosive synthesis instructions involving advanced chemical formulas), yet no validation steps—such as human expert review, inter-rater reliability metrics, or control comparisons against non-evolved regulatory excerpts—are reported. This is load-bearing for the central claim, because the headline failure rates (84.9% and 50.3%) could reflect prompt artifacts rather than genuine alignment gaps if the pipeline systematically introduces unrealistic phrasing or over-specified details.
[Evaluation framework] The evaluation framework reports specific harmful-response percentages but does not detail the judge criteria used to classify responses as harmful or any post-generation filtering steps. This detail is required to assess whether the numbers fully support the claim of alignment deficiencies across domains.

minor comments (1)

[Abstract] The abstract could briefly note the total number of models evaluated and the distribution of prompts across the six domains to give readers an immediate sense of scale.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and for highlighting the strengths of the regulation-grounded approach in SoSBench. We address each of the major comments in turn and outline the revisions we plan to make.

read point-by-point responses

Referee: [Benchmark construction / prompt generation pipeline] The abstract and benchmark construction description state that the LLM-assisted evolutionary pipeline produces 'diverse, realistic misuse scenarios' (e.g., detailed explosive synthesis instructions involving advanced chemical formulas), yet no validation steps—such as human expert review, inter-rater reliability metrics, or control comparisons against non-evolved regulatory excerpts—are reported. This is load-bearing for the central claim, because the headline failure rates (84.9% and 50.3%) could reflect prompt artifacts rather than genuine alignment gaps if the pipeline systematically introduces unrealistic phrasing or over-specified details.

Authors: We recognize that the current description of the prompt generation pipeline lacks explicit validation procedures such as human expert review or inter-rater reliability assessments. The pipeline is designed to start from regulatory texts and evolve them into detailed scenarios using LLMs, with the intent of preserving realism and regulatory grounding. To strengthen this aspect, we will revise the benchmark construction section to provide more details on the evolutionary pipeline, including the base prompts and evolution parameters. Additionally, we will include a new analysis reporting expert validation on a subset of the prompts to confirm their realism and relevance to high-risk misuse scenarios. revision: yes
Referee: [Evaluation framework] The evaluation framework reports specific harmful-response percentages but does not detail the judge criteria used to classify responses as harmful or any post-generation filtering steps. This detail is required to assess whether the numbers fully support the claim of alignment deficiencies across domains.

Authors: We agree that more transparency is needed regarding the evaluation framework. The harmful response rates are determined using an LLM judge with specific criteria focused on the provision of detailed, actionable advice that contravenes the referenced regulations. We will update the manuscript to include the complete judge prompt, the decision criteria with illustrative examples, and any filtering applied to the model outputs prior to judgment. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical measurement on external models

full rationale

The paper constructs SoSBench by starting from real-world regulations, applying an LLM-assisted evolutionary pipeline to generate 3,000 prompts, and then measuring refusal rates on frontier models. The central results (e.g., 84.9% harmful responses for Deepseek-R1) are direct empirical observations on independent models and do not reduce to the input regulations or pipeline outputs by any equation, fitted parameter, or definitional equivalence. No self-citations, uniqueness theorems, or ansatzes are invoked to force the reported rates. The derivation chain is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the assumption that the generated prompts constitute valid tests of safety alignment for real-world misuse; no free parameters or new entities are introduced.

axioms (1)

domain assumption Prompts derived from real regulations and expanded by LLM evolution represent realistic high-risk misuse scenarios.
This premise is required to interpret high refusal-failure rates as evidence of safety deficiencies rather than artifacts of unrealistic test items.

pith-pipeline@v0.9.0 · 5821 in / 1266 out tokens · 51418 ms · 2026-05-19T12:40:15.308153+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety
cs.CL 2026-05 unverdicted novelty 6.0

MultiBreak is a large diverse multi-turn jailbreak benchmark that achieves substantially higher attack success rates on LLMs than prior datasets and reveals topic-specific vulnerabilities in multi-turn settings.
SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond
cs.LG 2026-03 conditional novelty 6.0

SafeSci creates a large objective benchmark and training resource that reveals safety weaknesses in current LLMs for science and demonstrates measurable improvement through targeted fine-tuning.