LLMs Encode Harmfulness and Refusal Separately , December 2025

URL https://arxiv · 2025 · arXiv 2507.11878

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

representative citing papers

How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

cs.CL · 2026-04-06 · unverdicted · novelty 7.0

Alignment policy in language models is implemented as an early-commitment routing circuit of detection gates and amplifier heads that can be localized, scaled, and directly controlled without removing the underlying capability.

Before the Last Token: Diagnosing Final-Token Safety Probe Failures

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Final-token probes miss distributed unsafe evidence in jailbreaks, but a PCA-HMM model on prefill trajectories recovers many misses without naive pooling's false positives.

Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams

cs.LG · 2026-04-20 · unverdicted · novelty 6.0 · 2 refs

Harmful intent is linearly separable in LLM residual streams across 12 models and multiple architectures, reaching mean AUROC 0.982 while showing protocol-dependent directions and strong generalization to held-out harm benchmarks.

LLM Safety From Within: Detecting Harmful Content with Internal Representations

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.

Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks

cs.CR · 2026-04-20 · unverdicted · novelty 6.0

Different LLM jailbreak techniques achieve similar harmful compliance but lead to distinct behavioral side effects and mechanistic changes.

Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules

cs.AI · 2026-04-03 · unverdicted · novelty 6.0

Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.

citing papers explorer

Showing 6 of 6 citing papers.

How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models cs.CL · 2026-04-06 · unverdicted · none · ref 10
Alignment policy in language models is implemented as an early-commitment routing circuit of detection gates and amplifier heads that can be localized, scaled, and directly controlled without removing the underlying capability.
Before the Last Token: Diagnosing Final-Token Safety Probe Failures cs.LG · 2026-05-12 · unverdicted · none · ref 14
Final-token probes miss distributed unsafe evidence in jailbreaks, but a PCA-HMM model on prefill trajectories recovers many misses without naive pooling's false positives.
Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams cs.LG · 2026-04-20 · unverdicted · none · ref 13 · 2 links
Harmful intent is linearly separable in LLM residual streams across 12 models and multiple architectures, reaching mean AUROC 0.982 while showing protocol-dependent directions and strong generalization to held-out harm benchmarks.
LLM Safety From Within: Detecting Harmful Content with Internal Representations cs.AI · 2026-04-20 · unverdicted · none · ref 11
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks cs.CR · 2026-04-20 · unverdicted · none · ref 40
Different LLM jailbreak techniques achieve similar harmful compliance but lead to distinct behavioral side effects and mechanistic changes.
Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules cs.AI · 2026-04-03 · unverdicted · none · ref 38
Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.

LLMs Encode Harmfulness and Refusal Separately , December 2025

fields

years

verdicts

representative citing papers

citing papers explorer