CAP: Controllable Alignment Prompting for Unlearning in LLMs
Pith reviewed 2026-05-19 17:09 UTC · model grok-4.3
The pith
Reinforcement learning trains prompts that suppress specific knowledge in fixed LLMs while preserving general capabilities and allowing reversal by prompt removal.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CAP decouples unlearning into a learnable prompt optimization process via reinforcement learning, where a prompt generator collaborates with the LLM to suppress target knowledge while preserving general capabilities selectively, establishing a dynamic alignment mechanism that overcomes transferability limitations of prior methods and enables reversible knowledge restoration through prompt revocation.
What carries the argument
A reinforcement-learned prompt generator that collaborates with a fixed LLM to produce alignment prompts suppressing targeted knowledge.
If this is right
- Unlearning becomes accessible for closed-source models where parameters cannot be inspected or changed.
- The unlearning effect reverses immediately when the prompt is removed.
- Forgetting boundaries gain finer control compared with methods that edit model weights directly.
- General model performance outside the suppressed domain stays largely intact.
Where Pith is reading between the lines
- Deployed models could receive temporary behavior adjustments through prompt swaps without any retraining step.
- The same prompt-generation process might adapt to related tasks such as reducing unwanted biases or enforcing style constraints.
- Cross-model tests could show whether one trained prompt generator transfers effectively to different LLM architectures.
Load-bearing premise
Reinforcement learning can train a prompt generator to collaborate with a fixed LLM such that target knowledge is suppressed while general capabilities remain selectively preserved and the effect is reversible upon prompt revocation.
What would settle it
An experiment in which the optimized prompt is applied to the model yet it continues to produce outputs showing retention of the target knowledge or suffers measurable drops in unrelated capabilities.
Figures
read the original abstract
Large language models (LLMs) trained on unfiltered corpora inherently risk retaining sensitive information, necessitating selective knowledge unlearning for regulatory compliance and ethical safety. However, existing parameter-modifying methods face fundamental limitations: high computational costs, uncontrollable forgetting boundaries, and strict dependency on model weight access. These constraints render them impractical for closed-source models, yet current non-invasive alternatives remain unsystematic and reliant on empirical experience. To address these challenges, we propose the Controllable Alignment Prompting for Unlearning (CAP) framework, an end-to-end prompt-driven unlearning paradigm. CAP decouples unlearning into a learnable prompt optimization process via reinforcement learning, where a prompt generator collaborates with the LLM to suppress target knowledge while preserving general capabilities selectively. This approach enables reversible knowledge restoration through prompt revocation. Extensive experiments demonstrate that CAP achieves precise, controllable unlearning without updating model parameters, establishing a dynamic alignment mechanism that overcomes the transferability limitations of prior methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes the Controllable Alignment Prompting (CAP) framework for selective knowledge unlearning in LLMs. It decouples unlearning into an RL-based prompt optimization process in which a learnable prompt generator collaborates with a fixed LLM to suppress target knowledge while selectively preserving general capabilities; the effect is claimed to be reversible upon prompt revocation and to avoid the computational and access limitations of parameter-modifying methods.
Significance. If the central claim holds, the work would offer a practical route to controllable unlearning for closed-source models where weight access is unavailable. The introduction of an RL-trained prompt generator to realize a reversible dynamic alignment mechanism is a distinct contribution relative to existing prompt-engineering or fine-tuning approaches, provided the selectivity and reversibility can be rigorously demonstrated.
major comments (2)
- [§3.2] §3.2 (Reward Function): The reward function that trains the prompt generator is not given in explicit form (no equation or pseudocode). Without the precise weighting between the target-suppression term and the capability-preservation terms, it is impossible to verify that the RL objective produces selective rather than broad impairment, which is load-bearing for the abstract claim of 'precise, controllable unlearning' and 'selectively preserved' general capabilities.
- [§4.3] §4.3 (Reversibility Experiments): The reported restoration of capabilities upon prompt revocation lacks controls for prompt length, token distribution, or other surface-level confounders. If the observed reversibility is partly an artifact of prompt removal rather than true knowledge restoration, the 'dynamic alignment mechanism' claim is weakened; an ablation isolating these factors is required.
minor comments (2)
- [Abstract] The abstract states 'extensive experiments' but provides no quantitative headline numbers; adding one or two key metrics (e.g., unlearning success rate and capability retention delta) would improve readability.
- [§3] Notation for the prompt generator and state representation is introduced without a compact table summarizing symbols; this slows reading of the method section.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will incorporate revisions to enhance the clarity and rigor of the manuscript.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Reward Function): The reward function that trains the prompt generator is not given in explicit form (no equation or pseudocode). Without the precise weighting between the target-suppression term and the capability-preservation terms, it is impossible to verify that the RL objective produces selective rather than broad impairment, which is load-bearing for the abstract claim of 'precise, controllable unlearning' and 'selectively preserved' general capabilities.
Authors: We agree that an explicit formulation is necessary to substantiate the selectivity claims. The manuscript describes the reward components in prose within §3.2 (negative log-likelihood penalty on target knowledge combined with retention terms on general capabilities), but does not present the combined objective as a single equation or include pseudocode. In the revised version we will add the full reward equation with explicit weighting coefficients and a short pseudocode block for the RL training loop. revision: yes
-
Referee: [§4.3] §4.3 (Reversibility Experiments): The reported restoration of capabilities upon prompt revocation lacks controls for prompt length, token distribution, or other surface-level confounders. If the observed reversibility is partly an artifact of prompt removal rather than true knowledge restoration, the 'dynamic alignment mechanism' claim is weakened; an ablation isolating these factors is required.
Authors: We acknowledge that additional controls would strengthen the reversibility argument. The current §4.3 results show capability recovery after prompt removal, yet we did not explicitly ablate against length-matched or distribution-matched control prompts. We will add such an ablation in the revision, comparing revocation of the learned prompt against random and length-matched prompts to isolate the contribution of the optimized alignment. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper proposes the CAP framework as a new end-to-end prompt-driven unlearning paradigm that uses reinforcement learning to optimize prompts for a fixed LLM. The abstract and description introduce this as a methodological decoupling of unlearning into learnable prompt optimization, with claims supported by experimental demonstration rather than any mathematical derivation. No equations, self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations are present in the provided text. The central claims rest on the proposed architecture and empirical results, which do not reduce to inputs by construction, making the chain self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reinforcement learning can optimize prompts to achieve selective knowledge suppression in a fixed LLM while preserving general capabilities.
invented entities (1)
-
dynamic alignment mechanism
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CAP decouples unlearning into a learnable prompt optimization process via reinforcement learning, where a prompt generator collaborates with the LLM to suppress target knowledge while preserving general capabilities selectively.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
min_θ L_IB = I(a^f_i ; a^k | q^k) - β I(a^r_i ; a^k | q^k)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
AI, :, Alex Young, Bei Chen, Chao Li, Chen- gen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, and 13 others. 2024. Yi: Open foundation models by 01.ai.Preprint, arXiv:2403.04652. Arash Barfar and Lee Sommerfeldt. 2026. Propaganda by prompt: ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Alter: Asymmetric lora for token-entropy- guided unlearning of llms. InProceedings of the AAAI Conference on Artificial Intelligence, vol- ume 40, pages 35366–35374. Yubo Chen, Baoli Zhang, Sirui Li, Zhuoran Jin, Zhengyuan Cai, Yingzheng Wang, Delai Qiu, Sheng- Ping Liu, and Jun Zhao. 2025c. Prompt robust large language model for chinese medical named ent...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
Fast machine unlearning without retraining through selective synaptic dampening. InProceed- ings of the AAAI conference on artificial intelligence, volume 38, pages 12043–12051. Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chen- hui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Han- lin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Hash-rag: bridging deep hashing with retriever for efficient, fine retrieval and augmented generation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 26847–26858. Hyeonmin Ha, Jihye Lee, Wookje Han, and Byung-Gon Chun. 2023. Meta-learning of prompt generation for lightweight prompt engineering on language-model- as-a-service....
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
TDA-RC: Task-Driven Alignment for Knowledge-Based Reasoning Chains in Large Language Models
Obliviate: Robust and practical machine un- learning for large language models.arXiv e-prints, pages arXiv–2505. Jin Yao, Eli Chien, Minxin Du, Xinyao Niu, Tianhao Wang, Zezhou Cheng, and Xiang Yue. 2024a. Ma- chine unlearning of pre-trained large language mod- els. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
formulates LLM unlearning as a preference optimization problem using only negative samples from the forget set DFG. Specifically, it minimizes a bounded loss that encourages the unlearned pol- icy πθ to assign lower likelihood to forget-set re- sponses relative to a reference policy πref. The NPO objective is defined as (Eq. (3) in the original paper): LN...
work page 2024
-
[7]
You are about to receive a user question and three prompt words (numbered A/B/C)
-
[8]
Your primary goal is to select or fine-tune one of the three prompt words that best allow you to refuse to answer legally, safely, and in compliance with policies
-
[9]
the most concise and least ambiguous
If multiple prompt words meet the requirements, select the one that is “the most concise and least ambiguous”. If none of them are fully satisfied, only make necessary revisions that suit oneself. We will now start receiving user questions and three prompt words. Table 12: Illustrative Self-Check Instruction
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.