CAP: Controllable Alignment Prompting for Unlearning in LLMs

Guangchun Luo; Hongli Pu; Jie Ou; Jingwen Pu; Jinyu Guo; Meng Yang; Wenhong Tian; Wenyi Li; Xunlei Chen; Zhaokun Wang

arxiv: 2604.21251 · v5 · pith:3BT6KYVMnew · submitted 2026-04-23 · 💻 cs.LG · cs.AI

CAP: Controllable Alignment Prompting for Unlearning in LLMs

Zhaokun Wang , Jinyu Guo , Jingwen Pu , Hongli Pu , Meng Yang , Xunlei Chen , Jie Ou , Wenyi Li

show 2 more authors

Guangchun Luo Wenhong Tian

This is my paper

Pith reviewed 2026-05-19 17:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords knowledge unlearninglarge language modelsprompt optimizationreinforcement learningalignmentcontrollable forgettingreversible unlearning

0 comments

The pith

Reinforcement learning trains prompts that suppress specific knowledge in fixed LLMs while preserving general capabilities and allowing reversal by prompt removal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that unlearning in large language models can be performed by optimizing prompts through reinforcement learning instead of altering model weights. A prompt generator collaborates with the unchanged LLM to suppress targeted information while keeping broader abilities intact. If this holds, unlearning would become feasible for closed-source models without weight access, lower computational demands, and permit easy reversal simply by dropping the prompt. The approach seeks to fix problems of high cost, poor control over what gets forgotten, and reliance on internal model access that limit earlier techniques.

Core claim

CAP decouples unlearning into a learnable prompt optimization process via reinforcement learning, where a prompt generator collaborates with the LLM to suppress target knowledge while preserving general capabilities selectively, establishing a dynamic alignment mechanism that overcomes transferability limitations of prior methods and enables reversible knowledge restoration through prompt revocation.

What carries the argument

A reinforcement-learned prompt generator that collaborates with a fixed LLM to produce alignment prompts suppressing targeted knowledge.

If this is right

Unlearning becomes accessible for closed-source models where parameters cannot be inspected or changed.
The unlearning effect reverses immediately when the prompt is removed.
Forgetting boundaries gain finer control compared with methods that edit model weights directly.
General model performance outside the suppressed domain stays largely intact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployed models could receive temporary behavior adjustments through prompt swaps without any retraining step.
The same prompt-generation process might adapt to related tasks such as reducing unwanted biases or enforcing style constraints.
Cross-model tests could show whether one trained prompt generator transfers effectively to different LLM architectures.

Load-bearing premise

Reinforcement learning can train a prompt generator to collaborate with a fixed LLM such that target knowledge is suppressed while general capabilities remain selectively preserved and the effect is reversible upon prompt revocation.

What would settle it

An experiment in which the optimized prompt is applied to the model yet it continues to produce outputs showing retention of the target knowledge or suffers measurable drops in unrelated capabilities.

Figures

Figures reproduced from arXiv: 2604.21251 by Guangchun Luo, Hongli Pu, Jie Ou, Jingwen Pu, Jinyu Guo, Meng Yang, Wenhong Tian, Wenyi Li, Xunlei Chen, Zhaokun Wang.

**Figure 2.** Figure 2: The CAP pipeline consists of two stages: Prompt Generator Optimization and Inference Stage. Dual [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization Example of B-PPO. During inference, the SLM generates multiple candidate prompts for the input query. The SelfCheck instruction then selects or slightly refines the most appropriate candidate to guide the final output. More implementation details of the SelfCheck instruction are provided in Appendix H.1.3. 4 Experiments 4.1 Experimental Settings Datasets. To evaluate the method’s ability to… view at source ↗

**Figure 4.** Figure 4: Comparison of the attention matrix before and [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of the forgetting prompt guidance [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: ROUGE-L recall comparison of unlearning methods with and without adversarial prompts. 5.2 Visualization of Hidden State Shift Although CAP effectively reduces accuracy on sensitive questions, a critical question remains: Does it disrupt semantic understanding or redirect semantics toward an ignorance region? To investigate, we extracted hidden states from each layer of LLaMA2-7B when processing sensitiv… view at source ↗

**Figure 7.** Figure 7: Comparison of the same sentence with or without our prompt. that the generated prefix functions as a semantic anchor that redirects internal activations from knowledge regions toward safe/refusal regions, rather than merely introducing noise. This representationlevel separation explains how CAP achieves deep unlearning while preserving linguistic fluency. 6 Conclusion We present CAP, an end-to-end promp… view at source ↗

read the original abstract

Large language models (LLMs) trained on unfiltered corpora inherently risk retaining sensitive information, necessitating selective knowledge unlearning for regulatory compliance and ethical safety. However, existing parameter-modifying methods face fundamental limitations: high computational costs, uncontrollable forgetting boundaries, and strict dependency on model weight access. These constraints render them impractical for closed-source models, yet current non-invasive alternatives remain unsystematic and reliant on empirical experience. To address these challenges, we propose the Controllable Alignment Prompting for Unlearning (CAP) framework, an end-to-end prompt-driven unlearning paradigm. CAP decouples unlearning into a learnable prompt optimization process via reinforcement learning, where a prompt generator collaborates with the LLM to suppress target knowledge while preserving general capabilities selectively. This approach enables reversible knowledge restoration through prompt revocation. Extensive experiments demonstrate that CAP achieves precise, controllable unlearning without updating model parameters, establishing a dynamic alignment mechanism that overcomes the transferability limitations of prior methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CAP frames unlearning as RL-driven prompt optimization on a frozen LLM, which is a practical angle for closed-source models, but the selectivity and reversibility rest on an unshown reward design that could easily collapse into broad degradation.

read the letter

The main point is that this paper tries to solve unlearning for models you cannot modify by training a separate prompt generator with reinforcement learning. The generator produces prompts that steer the fixed LLM to suppress target facts while the claim is that general performance stays mostly intact and the effect disappears when the prompt is taken away. That framing is new compared to weight-editing methods or purely heuristic prompt hacks, and it directly targets the closed-source case where access is limited. The abstract positions it as creating a reversible dynamic alignment that sidesteps transferability issues in prior work. If the experiments hold up, this could be a useful tool for compliance scenarios where you need to forget specific data without retraining everything. The paper does a reasonable job laying out the limitations of existing approaches and why a prompt-based route matters for deployment. What is less convincing is the core mechanism. The stress-test concern lands: without details on how the reward function balances a suppression signal against capability preservation, the RL step can converge to prompts that simply make the model less capable overall rather than selectively blocking access. The abstract gives no description of the state representation, the exact reward terms, or ablations that isolate whether selectivity is real. Reversibility on prompt removal is asserted but not obviously demonstrated against the possibility that the generator learns a general dampening effect. If the full paper includes those controls and shows quantitative separation between target forgetting and unrelated task drops, the claim strengthens; otherwise it stays assumptive. This work is for people working on practical LLM safety and regulatory unlearning, especially teams that cannot touch weights. A reader already thinking about prompt control or RL for alignment would get value from the setup even if they end up skeptical of the results. It is coherent enough on its own terms to deserve a serious referee rather than a desk reject, though any review should focus on the reward construction and the selectivity metrics.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the Controllable Alignment Prompting (CAP) framework for selective knowledge unlearning in LLMs. It decouples unlearning into an RL-based prompt optimization process in which a learnable prompt generator collaborates with a fixed LLM to suppress target knowledge while selectively preserving general capabilities; the effect is claimed to be reversible upon prompt revocation and to avoid the computational and access limitations of parameter-modifying methods.

Significance. If the central claim holds, the work would offer a practical route to controllable unlearning for closed-source models where weight access is unavailable. The introduction of an RL-trained prompt generator to realize a reversible dynamic alignment mechanism is a distinct contribution relative to existing prompt-engineering or fine-tuning approaches, provided the selectivity and reversibility can be rigorously demonstrated.

major comments (2)

[§3.2] §3.2 (Reward Function): The reward function that trains the prompt generator is not given in explicit form (no equation or pseudocode). Without the precise weighting between the target-suppression term and the capability-preservation terms, it is impossible to verify that the RL objective produces selective rather than broad impairment, which is load-bearing for the abstract claim of 'precise, controllable unlearning' and 'selectively preserved' general capabilities.
[§4.3] §4.3 (Reversibility Experiments): The reported restoration of capabilities upon prompt revocation lacks controls for prompt length, token distribution, or other surface-level confounders. If the observed reversibility is partly an artifact of prompt removal rather than true knowledge restoration, the 'dynamic alignment mechanism' claim is weakened; an ablation isolating these factors is required.

minor comments (2)

[Abstract] The abstract states 'extensive experiments' but provides no quantitative headline numbers; adding one or two key metrics (e.g., unlearning success rate and capability retention delta) would improve readability.
[§3] Notation for the prompt generator and state representation is introduced without a compact table summarizing symbols; this slows reading of the method section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will incorporate revisions to enhance the clarity and rigor of the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Reward Function): The reward function that trains the prompt generator is not given in explicit form (no equation or pseudocode). Without the precise weighting between the target-suppression term and the capability-preservation terms, it is impossible to verify that the RL objective produces selective rather than broad impairment, which is load-bearing for the abstract claim of 'precise, controllable unlearning' and 'selectively preserved' general capabilities.

Authors: We agree that an explicit formulation is necessary to substantiate the selectivity claims. The manuscript describes the reward components in prose within §3.2 (negative log-likelihood penalty on target knowledge combined with retention terms on general capabilities), but does not present the combined objective as a single equation or include pseudocode. In the revised version we will add the full reward equation with explicit weighting coefficients and a short pseudocode block for the RL training loop. revision: yes
Referee: [§4.3] §4.3 (Reversibility Experiments): The reported restoration of capabilities upon prompt revocation lacks controls for prompt length, token distribution, or other surface-level confounders. If the observed reversibility is partly an artifact of prompt removal rather than true knowledge restoration, the 'dynamic alignment mechanism' claim is weakened; an ablation isolating these factors is required.

Authors: We acknowledge that additional controls would strengthen the reversibility argument. The current §4.3 results show capability recovery after prompt removal, yet we did not explicitly ablate against length-matched or distribution-matched control prompts. We will add such an ablation in the revision, comparing revocation of the learned prompt against random and length-matched prompts to isolate the contribution of the optimized alignment. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper proposes the CAP framework as a new end-to-end prompt-driven unlearning paradigm that uses reinforcement learning to optimize prompts for a fixed LLM. The abstract and description introduce this as a methodological decoupling of unlearning into learnable prompt optimization, with claims supported by experimental demonstration rather than any mathematical derivation. No equations, self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations are present in the provided text. The central claims rest on the proposed architecture and empirical results, which do not reduce to inputs by construction, making the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the unverified effectiveness of RL-driven prompt collaboration for selective suppression; this is treated as a domain assumption rather than a derived result.

axioms (1)

domain assumption Reinforcement learning can optimize prompts to achieve selective knowledge suppression in a fixed LLM while preserving general capabilities.
Invoked when the paper states that the prompt generator collaborates with the LLM to suppress target knowledge.

invented entities (1)

dynamic alignment mechanism no independent evidence
purpose: To overcome transferability limitations of prior unlearning methods.
Introduced in the abstract as the outcome of the CAP framework.

pith-pipeline@v0.9.0 · 5722 in / 1041 out tokens · 60639 ms · 2026-05-19T17:09:46.710788+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CAP decouples unlearning into a learnable prompt optimization process via reinforcement learning, where a prompt generator collaborates with the LLM to suppress target knowledge while preserving general capabilities selectively.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

min_θ L_IB = I(a^f_i ; a^k | q^k) - β I(a^r_i ; a^k | q^k)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages · 5 internal anchors

[1]

AI, :, Alex Young, Bei Chen, Chao Li, Chen- gen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, and 13 others. 2024. Yi: Open foundation models by 01.ai.Preprint, arXiv:2403.04652. Arash Barfar and Lee Sommerfeldt. 2026. Propaganda by prompt: ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

DeepSeek-V3 Technical Report

Alter: Asymmetric lora for token-entropy- guided unlearning of llms. InProceedings of the AAAI Conference on Artificial Intelligence, vol- ume 40, pages 35366–35374. Yubo Chen, Baoli Zhang, Sirui Li, Zhuoran Jin, Zhengyuan Cai, Yingzheng Wang, Delai Qiu, Sheng- Ping Liu, and Jun Zhao. 2025c. Prompt robust large language model for chinese medical named ent...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Fast machine unlearning without retraining through selective synaptic dampening. InProceed- ings of the AAAI conference on artificial intelligence, volume 38, pages 12043–12051. Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chen- hui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Han- lin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Hash-rag: bridging deep hashing with retriever for efficient, fine retrieval and augmented generation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 26847–26858. Hyeonmin Ha, Jihye Lee, Wookje Han, and Byung-Gon Chun. 2023. Meta-learning of prompt generation for lightweight prompt engineering on language-model- as-a-service....

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

TDA-RC: Task-Driven Alignment for Knowledge-Based Reasoning Chains in Large Language Models

Obliviate: Robust and practical machine un- learning for large language models.arXiv e-prints, pages arXiv–2505. Jin Yao, Eli Chien, Minxin Du, Xinyao Niu, Tianhao Wang, Zezhou Cheng, and Xiang Yue. 2024a. Ma- chine unlearning of pre-trained large language mod- els. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

anti-examples

formulates LLM unlearning as a preference optimization problem using only negative samples from the forget set DFG. Specifically, it minimizes a bounded loss that encourages the unlearned pol- icy πθ to assign lower likelihood to forget-set re- sponses relative to a reference policy πref. The NPO objective is defined as (Eq. (3) in the original paper): LN...

work page 2024
[7]

You are about to receive a user question and three prompt words (numbered A/B/C)

work page
[8]

Your primary goal is to select or fine-tune one of the three prompt words that best allow you to refuse to answer legally, safely, and in compliance with policies

work page
[9]

the most concise and least ambiguous

If multiple prompt words meet the requirements, select the one that is “the most concise and least ambiguous”. If none of them are fully satisfied, only make necessary revisions that suit oneself. We will now start receiving user questions and three prompt words. Table 12: Illustrative Self-Check Instruction

work page

[1] [1]

AI, :, Alex Young, Bei Chen, Chao Li, Chen- gen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, and 13 others. 2024. Yi: Open foundation models by 01.ai.Preprint, arXiv:2403.04652. Arash Barfar and Lee Sommerfeldt. 2026. Propaganda by prompt: ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

DeepSeek-V3 Technical Report

Alter: Asymmetric lora for token-entropy- guided unlearning of llms. InProceedings of the AAAI Conference on Artificial Intelligence, vol- ume 40, pages 35366–35374. Yubo Chen, Baoli Zhang, Sirui Li, Zhuoran Jin, Zhengyuan Cai, Yingzheng Wang, Delai Qiu, Sheng- Ping Liu, and Jun Zhao. 2025c. Prompt robust large language model for chinese medical named ent...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Fast machine unlearning without retraining through selective synaptic dampening. InProceed- ings of the AAAI conference on artificial intelligence, volume 38, pages 12043–12051. Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chen- hui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Han- lin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Hash-rag: bridging deep hashing with retriever for efficient, fine retrieval and augmented generation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 26847–26858. Hyeonmin Ha, Jihye Lee, Wookje Han, and Byung-Gon Chun. 2023. Meta-learning of prompt generation for lightweight prompt engineering on language-model- as-a-service....

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

TDA-RC: Task-Driven Alignment for Knowledge-Based Reasoning Chains in Large Language Models

Obliviate: Robust and practical machine un- learning for large language models.arXiv e-prints, pages arXiv–2505. Jin Yao, Eli Chien, Minxin Du, Xinyao Niu, Tianhao Wang, Zezhou Cheng, and Xiang Yue. 2024a. Ma- chine unlearning of pre-trained large language mod- els. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

anti-examples

formulates LLM unlearning as a preference optimization problem using only negative samples from the forget set DFG. Specifically, it minimizes a bounded loss that encourages the unlearned pol- icy πθ to assign lower likelihood to forget-set re- sponses relative to a reference policy πref. The NPO objective is defined as (Eq. (3) in the original paper): LN...

work page 2024

[7] [7]

You are about to receive a user question and three prompt words (numbered A/B/C)

work page

[8] [8]

Your primary goal is to select or fine-tune one of the three prompt words that best allow you to refuse to answer legally, safely, and in compliance with policies

work page

[9] [9]

the most concise and least ambiguous

If multiple prompt words meet the requirements, select the one that is “the most concise and least ambiguous”. If none of them are fully satisfied, only make necessary revisions that suit oneself. We will now start receiving user questions and three prompt words. Table 12: Illustrative Self-Check Instruction

work page