CRISP: Persistent Concept Unlearning via Sparse Autoencoders
Pith reviewed 2026-05-18 22:43 UTC · model grok-4.3
The pith
CRISP identifies and suppresses salient SAE features across layers to achieve persistent unlearning of harmful knowledge in LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CRISP automatically identifies salient SAE features across multiple layers and suppresses their activations, successfully removing harmful knowledge while preserving general and in-domain capabilities on WMDP tasks.
What carries the argument
Automatic identification and suppression of salient SAE features across multiple layers to produce persistent parameter changes.
If this is right
- Unlearning persists against actors who can access and modify model parameters.
- General capabilities and in-domain performance remain largely intact after the intervention.
- Target concepts separate cleanly from benign ones at the feature level.
Where Pith is reading between the lines
- The same feature-suppression process could be tested on other forms of unwanted content such as biases or specific memorized facts.
- Repeated applications on evolving models might allow ongoing management of emerging harmful concepts.
- Combining the approach with monitoring for feature reactivation could create more adaptive safety layers.
Load-bearing premise
That the SAE features flagged as salient for a target concept are sufficiently monosemantic and causally responsible for the unwanted knowledge.
What would settle it
A model that still produces the target harmful outputs after feature suppression or that shows clear drops in general or in-domain task performance.
read the original abstract
As large language models (LLMs) are increasingly deployed in real-world applications, the need to selectively remove unwanted knowledge while preserving model utility has become paramount. Recent work has explored sparse autoencoders (SAEs) to perform precise interventions on monosemantic features. However, most SAE-based methods operate at inference time, which does not create persistent changes in the model's parameters. Such interventions can be bypassed or reversed by malicious actors with parameter access. We introduce CRISP, a parameter-efficient method for persistent concept unlearning using SAEs. CRISP automatically identifies salient SAE features across multiple layers and suppresses their activations. We experiment with two LLMs and show that our method outperforms prior approaches on safety-critical unlearning tasks from the WMDP benchmark, successfully removing harmful knowledge while preserving general and in-domain capabilities. Feature-level analysis reveals that CRISP achieves semantically coherent separation between target and benign concepts, allowing precise suppression of the target features.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CRISP, a parameter-efficient method for persistent concept unlearning in LLMs that uses sparse autoencoders to automatically identify salient features across multiple layers and suppress their activations. Experiments on two LLMs demonstrate outperformance over prior methods on safety-critical tasks from the WMDP benchmark, with successful removal of harmful knowledge while preserving general and in-domain capabilities; feature-level analysis is presented as evidence of semantically coherent separation between target and benign concepts.
Significance. If the central empirical claims hold under rigorous controls, the work would be significant for AI safety, as it targets the reversibility limitation of inference-time SAE interventions by producing parameter-level changes. The multi-layer feature identification and suppression approach, combined with reported capability preservation on WMDP, offers a practical direction for selective unlearning. The empirical procedure and feature coherence analysis constitute strengths that could be built upon if quantitative details and causal evidence are strengthened.
major comments (2)
- [Abstract] Abstract: the claim of outperformance on WMDP tasks and successful removal of harmful knowledge while preserving capabilities provides no quantitative details on baselines, effect sizes, statistical significance, or controls, which is load-bearing for assessing whether the central empirical claim holds.
- [Evaluation] Evaluation / Feature analysis: the assumption that automatically identified salient SAE features are the primary causal carriers of target knowledge (rather than correlated or redundant directions) is not tested against alternative pathways or non-SAE representations; this directly threatens both the selectivity and persistence claims, especially given the paper's own note that prior inference-time interventions are reversible.
minor comments (2)
- [Methods] Clarify the precise algorithm and hyperparameters for identifying 'salient' features across layers and the exact form of activation suppression (e.g., whether it is a permanent parameter edit or a training-time penalty).
- [Experiments] Add explicit discussion of how capability preservation is measured on both general and in-domain WMDP tasks, including any ablation on the number of suppressed features.
Simulated Author's Rebuttal
We thank the referee for the detailed and insightful comments on our paper. We appreciate the emphasis on providing more quantitative details in the abstract and strengthening the causal evidence for our feature identification approach. We have prepared revisions to address these points and respond to each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of outperformance on WMDP tasks and successful removal of harmful knowledge while preserving capabilities provides no quantitative details on baselines, effect sizes, statistical significance, or controls, which is load-bearing for assessing whether the central empirical claim holds.
Authors: We agree that the abstract lacks specific quantitative details, which are important for evaluating the claims. In the revised manuscript, we will incorporate key metrics from our experiments, including the exact performance numbers on WMDP benchmarks compared to baselines, effect sizes, and information on statistical significance and experimental controls. This revision will make the abstract more informative and directly support the central empirical claims. revision: yes
-
Referee: [Evaluation] Evaluation / Feature analysis: the assumption that automatically identified salient SAE features are the primary causal carriers of target knowledge (rather than correlated or redundant directions) is not tested against alternative pathways or non-SAE representations; this directly threatens both the selectivity and persistence claims, especially given the paper's own note that prior inference-time interventions are reversible.
Authors: We acknowledge the referee's point that our work assumes the identified SAE features are the primary causal carriers without explicit tests against alternatives. While the feature analysis in the paper demonstrates coherent separation, and the parameter-efficient updates provide persistence, we agree that additional validation would strengthen the selectivity claims. In the revision, we will include new ablations testing suppression on alternative directions (such as non-SAE or correlated features) to provide causal evidence and address concerns about reversibility and redundancy. revision: yes
Circularity Check
Empirical procedure with no derivation reducing to inputs by construction
full rationale
The paper describes CRISP as a practical, parameter-efficient method that identifies salient SAE features via activation analysis on target prompts and suppresses them through editing or penalties, then evaluates outcomes on WMDP benchmarks and capability tests. No equations or claims present a first-principles prediction, fitted parameter renamed as output, or self-citation chain that forces the central result. The success metrics (harmful knowledge removal while preserving utility) are external to the identification/suppression steps themselves, and the method is presented as an empirical intervention rather than a closed derivation. This matches the reader's assessment that the work is not circular by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CRISP automatically identifies salient SAE features across multiple layers and suppresses their activations... Lunlearn = E t∼Dtarget [E fi∼Fsalient [a(t)i + λct]]
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose an automated pipeline for identifying SAE features salient for a target concept via contrastive activation analysis.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Interpretability Can Be Actionable
Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
-
Latent Agents: A Post-Training Procedure for Internalized Multi-Agent Debate
Two-stage fine-tuning distills multi-agent debate into single LLMs, matching performance at 93% lower token cost while revealing agent-specific activation subspaces for steering.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.