ICCU: In-Context Continual Unlearning via Pattern-Induced Refusal Rules
Pith reviewed 2026-06-29 17:15 UTC · model grok-4.3
The pith
In-context refusal rules let language models unlearn specific data sequentially without parameter changes or interference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ICCU induces readable refusal rules from unlearning datasets and applies them at inference time either as a filter or via the system prompt, without modifying model parameters. Because rules are accumulated as an order-independent union, ICCU is compositional and free of cross-request interference, and the original forget-set data can be discarded after rule induction. Experiments show it suppresses target knowledge while preserving utility and handles paraphrased and cross-lingual queries.
What carries the argument
Pattern-induced refusal rules applied in-context at inference time to block recall of forgotten knowledge.
If this is right
- Target knowledge is suppressed while model utility on other tasks is preserved.
- The approach scales to multiple sequential unlearning requests without cross-interference or cumulative utility loss.
- Robustness holds for paraphrased queries and queries in different languages.
- Original unlearning data can be discarded after rule induction, reducing storage needs.
Where Pith is reading between the lines
- Models could comply with privacy regulations more efficiently in ongoing deployments.
- The method might apply to other types of knowledge suppression beyond data unlearning.
- It raises the possibility of hybrid systems combining in-context rules with occasional fine-tuning for harder cases.
Load-bearing premise
Readable refusal rules induced from the unlearning dataset can cover all possible ways the model might express or recall the forgotten knowledge, including paraphrases and cross-lingual queries.
What would settle it
A query using a novel paraphrase or untranslated phrasing of the target knowledge that the induced rules do not block, yet the model still produces the correct answer.
Figures
read the original abstract
Machine unlearning aims to remove the influence of specific data from trained language models. In real-world deployments, unlearning requests often arrive sequentially, which challenges existing fine-tuning-based methods: fine-tuning each request is costly, accumulates utility loss, and may cause cross-request interference. To address these issues, we propose ICCU (In-Context Continual Unlearning), an in-context continual unlearning framework that induces readable refusal rules from unlearning datasets and applies them at inference time either as a filter or via the system prompt, without modifying model parameters. Because rules are accumulated as an order-independent union, ICCU is compositional and free of cross-request interference, and the original forget-set data can be discarded after rule induction. Extensive experiments show that ICCU effectively suppresses target knowledge while preserving utility, scales across sequential requests, and remains robust to paraphrased and cross-lingual queries.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ICCU, an in-context continual unlearning framework for LLMs that extracts readable refusal rules from sequential unlearning (forget) datasets via pattern induction and applies the accumulated rule set at inference time either as a filter or system prompt, without any parameter updates. The approach is presented as compositional and order-independent, allowing original forget-set data to be discarded post-induction while claiming to suppress target knowledge, preserve utility on unrelated tasks, scale to multiple sequential requests without interference, and remain robust to paraphrased and cross-lingual queries, supported by extensive experiments.
Significance. If the empirical results hold under the coverage assumption, ICCU would provide a lightweight, non-destructive alternative to fine-tuning-based unlearning methods that avoids cumulative utility loss and cross-request interference in sequential settings. The parameter-free, inference-time nature and explicit data-discard property are practical strengths for deployed models; the readability of induced rules could also aid interpretability if coverage proves reliable.
major comments (2)
- [Method (rule induction procedure)] The robustness claim to paraphrased and cross-lingual queries (abstract and experimental sections) depends on the induced refusal rules covering all possible expressions of target knowledge. The induction step is described as empirical pattern extraction with no stated completeness argument, exhaustive enumeration, or formal coverage guarantee; this is load-bearing because any uncovered formulation would allow leakage under the in-context-only mechanism.
- [Experiments] Experiments demonstrate suppression and utility preservation, but without the full experimental details (baselines, exact metrics, post-hoc rule selection criteria, and failure-case analysis), it is unclear whether the reported robustness generalizes beyond the tested paraphrases or if rule sets were tuned to the evaluation distribution.
minor comments (2)
- [Method] Notation for the accumulated rule set (union operation) and its application as filter vs. prompt should be formalized with pseudocode or equations for reproducibility.
- [Method] Clarify how rule induction handles conflicting or overlapping patterns across sequential requests to support the order-independence claim.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment point by point below, with clarifications on the empirical nature of our method and commitments to expand experimental details where appropriate.
read point-by-point responses
-
Referee: [Method (rule induction procedure)] The robustness claim to paraphrased and cross-lingual queries (abstract and experimental sections) depends on the induced refusal rules covering all possible expressions of target knowledge. The induction step is described as empirical pattern extraction with no stated completeness argument, exhaustive enumeration, or formal coverage guarantee; this is load-bearing because any uncovered formulation would allow leakage under the in-context-only mechanism.
Authors: We acknowledge that the rule induction is an empirical pattern extraction process without a formal completeness argument, exhaustive enumeration, or coverage guarantee. The manuscript presents ICCU as a practical, data-driven approach that induces readable rules from forget datasets and validates generalization through experiments on paraphrased and cross-lingual queries. We do not claim theoretical universality, as that would be infeasible for natural language. We will revise the method section to explicitly note the empirical basis and reliance on experimental evidence rather than formal guarantees. revision: partial
-
Referee: [Experiments] Experiments demonstrate suppression and utility preservation, but without the full experimental details (baselines, exact metrics, post-hoc rule selection criteria, and failure-case analysis), it is unclear whether the reported robustness generalizes beyond the tested paraphrases or if rule sets were tuned to the evaluation distribution.
Authors: The full manuscript describes the experimental setup, baselines, and metrics. To improve clarity, we will expand the experiments section in revision to provide precise details on post-hoc rule selection criteria, additional failure-case analysis, and confirmation that rules are induced solely from forget data without reference to evaluation queries. This will better demonstrate that reported robustness is not due to distribution-specific tuning. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper proposes an empirical framework (induce refusal rules from forget-set data, apply as filter or system prompt at inference) with no mathematical derivation chain, equations, or fitted parameters that reduce to inputs by construction. Claims rest on experimental results rather than self-referential definitions or self-citation load-bearing premises. The coverage of rules for paraphrases is presented as an empirical assumption, not a derived property that collapses to the input data. This is a standard non-circular empirical method paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption It is possible to induce a set of readable refusal rules from the forget-set data that effectively cover the target knowledge even under paraphrasing and cross-lingual queries.
Reference graph
Works this paper leans on
-
[1]
No Language Left Behind: Scaling Human-Centered Machine Translation
IEEE. Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, and 1 others. 2021. Extracting training data from large language models. In30th USENIX security symposium (USENIX Security 21), pages 2633–2650. Jiaao Chen and Diyi Yang. 2023. Unlearn what you ...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[2]
Pointer Sentinel Mixture Models
Measuring massive multitask language under- standing. InInternational Conference on Learning Representations. Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations. Shengyuan Hu, Yiwei Fu...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
Qwen3 technical report.arXiv preprint arXiv:2505.09388. Yuanshun Yao, Xiaojun Xu, and Yang Liu. 2024. Large language model unlearning.Advances in Neural Information Processing Systems, 37:105425–105475. Binchi Zhang, Zhengzhang Chen, Zaiyi Zheng, Jundong Li, and Haifeng Chen. 2025a. Resolving editing- unlearning conflicts: A knowledge codebook frame- work...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Determine whether the question matches a rule in the RULE SET
-
[5]
Output E if the question clearly matches a rule
-
[6]
I don’t know
Otherwise, answer the question normally with A, B, C, or D. Output ONLY one letter: A, B, C, D, or E. Do not output anything else. We score logits at all five letters. If the argmax is E, the rule is considered triggered, and the final prediction is sampleduniformly at randomfrom {A,B,C,D} . This drives the expected accuracy on triggered questions toward ...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.