CRISP: Persistent Concept Unlearning via Sparse Autoencoders

Aaron Mueller; Dana Arad; Martin Tutek; Tomer Ashuach; Yonatan Belinkov

arxiv: 2508.13650 · v3 · submitted 2025-08-19 · 💻 cs.CL

CRISP: Persistent Concept Unlearning via Sparse Autoencoders

Tomer Ashuach , Dana Arad , Aaron Mueller , Martin Tutek , Yonatan Belinkov This is my paper

Pith reviewed 2026-05-18 22:43 UTC · model grok-4.3

classification 💻 cs.CL

keywords concept unlearningsparse autoencoderslarge language modelspersistent unlearningWMDP benchmarkfeature suppressionAI safety

0 comments

The pith

CRISP identifies and suppresses salient SAE features across layers to achieve persistent unlearning of harmful knowledge in LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CRISP, a parameter-efficient approach that uses sparse autoencoders to locate features tied to unwanted concepts in language models and then suppresses those features. Unlike earlier techniques limited to inference-time changes, this method alters the model parameters themselves so the unlearning endures even if an adversary gains full access to the weights. Experiments on two LLMs and the WMDP benchmark demonstrate stronger removal of harmful knowledge than prior methods, with little loss to general capabilities or performance on related tasks. Feature-level inspection shows the suppressed elements form a semantically distinct group separate from benign concepts.

Core claim

CRISP automatically identifies salient SAE features across multiple layers and suppresses their activations, successfully removing harmful knowledge while preserving general and in-domain capabilities on WMDP tasks.

What carries the argument

Automatic identification and suppression of salient SAE features across multiple layers to produce persistent parameter changes.

If this is right

Unlearning persists against actors who can access and modify model parameters.
General capabilities and in-domain performance remain largely intact after the intervention.
Target concepts separate cleanly from benign ones at the feature level.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same feature-suppression process could be tested on other forms of unwanted content such as biases or specific memorized facts.
Repeated applications on evolving models might allow ongoing management of emerging harmful concepts.
Combining the approach with monitoring for feature reactivation could create more adaptive safety layers.

Load-bearing premise

That the SAE features flagged as salient for a target concept are sufficiently monosemantic and causally responsible for the unwanted knowledge.

What would settle it

A model that still produces the target harmful outputs after feature suppression or that shows clear drops in general or in-domain task performance.

read the original abstract

As large language models (LLMs) are increasingly deployed in real-world applications, the need to selectively remove unwanted knowledge while preserving model utility has become paramount. Recent work has explored sparse autoencoders (SAEs) to perform precise interventions on monosemantic features. However, most SAE-based methods operate at inference time, which does not create persistent changes in the model's parameters. Such interventions can be bypassed or reversed by malicious actors with parameter access. We introduce CRISP, a parameter-efficient method for persistent concept unlearning using SAEs. CRISP automatically identifies salient SAE features across multiple layers and suppresses their activations. We experiment with two LLMs and show that our method outperforms prior approaches on safety-critical unlearning tasks from the WMDP benchmark, successfully removing harmful knowledge while preserving general and in-domain capabilities. Feature-level analysis reveals that CRISP achieves semantically coherent separation between target and benign concepts, allowing precise suppression of the target features.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CRISP tries to turn SAE feature suppression into a lasting parameter change rather than an inference-time patch, but the abstract gives too little on controls and baselines to judge if it actually sticks.

read the letter

Hi, the core move here is taking SAE-identified features for unwanted concepts, then suppressing them in a way that alters the model weights instead of just masking activations at runtime. That addresses the reversibility issue with prior inference-only edits, and the multi-layer selection plus coherence checks are a reasonable extension of existing SAE work. The experiments on two LLMs and reported gains on WMDP tasks for removing harmful knowledge while keeping general capabilities are the parts that could matter for safety applications. They also show some separation between target and benign features, which helps the selectivity story. The main gaps are in the evaluation. The abstract claims outperformance without numbers, statistical tests, or clear descriptions of how capability preservation was measured or whether further fine-tuning could recover the suppressed knowledge. If the features picked up are correlated rather than causal, or if the knowledge sits in redundant directions, the suppression may not deliver the promised persistence. That concern from the stress-test note still looks live until the full results are examined. This is for people already working on mechanistic editing and unlearning benchmarks. A reader focused on practical deployment fixes might pick up the pipeline idea, but anyone wanting to build on it would need the detailed tables and ablations first. I would send it to peer review so the experimental claims get proper scrutiny rather than desk-rejecting on the abstract alone.

Referee Report

2 major / 2 minor

Summary. The paper introduces CRISP, a parameter-efficient method for persistent concept unlearning in LLMs that uses sparse autoencoders to automatically identify salient features across multiple layers and suppress their activations. Experiments on two LLMs demonstrate outperformance over prior methods on safety-critical tasks from the WMDP benchmark, with successful removal of harmful knowledge while preserving general and in-domain capabilities; feature-level analysis is presented as evidence of semantically coherent separation between target and benign concepts.

Significance. If the central empirical claims hold under rigorous controls, the work would be significant for AI safety, as it targets the reversibility limitation of inference-time SAE interventions by producing parameter-level changes. The multi-layer feature identification and suppression approach, combined with reported capability preservation on WMDP, offers a practical direction for selective unlearning. The empirical procedure and feature coherence analysis constitute strengths that could be built upon if quantitative details and causal evidence are strengthened.

major comments (2)

[Abstract] Abstract: the claim of outperformance on WMDP tasks and successful removal of harmful knowledge while preserving capabilities provides no quantitative details on baselines, effect sizes, statistical significance, or controls, which is load-bearing for assessing whether the central empirical claim holds.
[Evaluation] Evaluation / Feature analysis: the assumption that automatically identified salient SAE features are the primary causal carriers of target knowledge (rather than correlated or redundant directions) is not tested against alternative pathways or non-SAE representations; this directly threatens both the selectivity and persistence claims, especially given the paper's own note that prior inference-time interventions are reversible.

minor comments (2)

[Methods] Clarify the precise algorithm and hyperparameters for identifying 'salient' features across layers and the exact form of activation suppression (e.g., whether it is a permanent parameter edit or a training-time penalty).
[Experiments] Add explicit discussion of how capability preservation is measured on both general and in-domain WMDP tasks, including any ablation on the number of suppressed features.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and insightful comments on our paper. We appreciate the emphasis on providing more quantitative details in the abstract and strengthening the causal evidence for our feature identification approach. We have prepared revisions to address these points and respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of outperformance on WMDP tasks and successful removal of harmful knowledge while preserving capabilities provides no quantitative details on baselines, effect sizes, statistical significance, or controls, which is load-bearing for assessing whether the central empirical claim holds.

Authors: We agree that the abstract lacks specific quantitative details, which are important for evaluating the claims. In the revised manuscript, we will incorporate key metrics from our experiments, including the exact performance numbers on WMDP benchmarks compared to baselines, effect sizes, and information on statistical significance and experimental controls. This revision will make the abstract more informative and directly support the central empirical claims. revision: yes
Referee: [Evaluation] Evaluation / Feature analysis: the assumption that automatically identified salient SAE features are the primary causal carriers of target knowledge (rather than correlated or redundant directions) is not tested against alternative pathways or non-SAE representations; this directly threatens both the selectivity and persistence claims, especially given the paper's own note that prior inference-time interventions are reversible.

Authors: We acknowledge the referee's point that our work assumes the identified SAE features are the primary causal carriers without explicit tests against alternatives. While the feature analysis in the paper demonstrates coherent separation, and the parameter-efficient updates provide persistence, we agree that additional validation would strengthen the selectivity claims. In the revision, we will include new ablations testing suppression on alternative directions (such as non-SAE or correlated features) to provide causal evidence and address concerns about reversibility and redundancy. revision: yes

Circularity Check

0 steps flagged

Empirical procedure with no derivation reducing to inputs by construction

full rationale

The paper describes CRISP as a practical, parameter-efficient method that identifies salient SAE features via activation analysis on target prompts and suppresses them through editing or penalties, then evaluates outcomes on WMDP benchmarks and capability tests. No equations or claims present a first-principles prediction, fitted parameter renamed as output, or self-citation chain that forces the central result. The success metrics (harmful knowledge removal while preserving utility) are external to the identification/suppression steps themselves, and the method is presented as an empirical intervention rather than a closed derivation. This matches the reader's assessment that the work is not circular by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review limits visibility into exact modeling choices; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5698 in / 1061 out tokens · 33330 ms · 2026-05-18T22:43:49.203356+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CRISP automatically identifies salient SAE features across multiple layers and suppresses their activations... Lunlearn = E t∼Dtarget [E fi∼Fsalient [a(t)i + λct]]
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose an automated pipeline for identifying SAE features salient for a target concept via contrastive activation analysis.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Interpretability Can Be Actionable
cs.LG 2026-05 conditional novelty 6.0

Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
Latent Agents: A Post-Training Procedure for Internalized Multi-Agent Debate
cs.AI 2026-04 unverdicted novelty 6.0

Two-stage fine-tuning distills multi-agent debate into single LLMs, matching performance at 93% lower token cost while revealing agent-specific activation subspaces for steering.