arxiv: 2605.14309 · v1 · submitted 2026-05-14 · 💻 cs.CV · cs.AI· cs.LG

Recognition: no theorem link

ICED: Concept-level Machine Unlearning via Interpretable Concept Decomposition

Shen Lin , Jing Lin , Junhao Dong , Piotr Koniusz , Li Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords machine unlearningvision-language modelsconcept decompositionsemantic conceptsinterpretable representationsmodel editingmultimodal learning

0 comments

The pith

Vision-language models can unlearn specific concepts by decomposing images into sparse semantic combinations and suppressing only the targets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework that shifts machine unlearning in VLMs from the image level to the concept level. It first builds a compact vocabulary of relevant semantic concepts from the forgetting data using a multimodal LLM. Visual representations are then broken down into sparse, nonnegative combinations drawn from this vocabulary, creating an explicit handle for editing. Unlearning is cast as an optimization step that reduces the weight of target concepts while leaving non-target semantics inside the same image and the model's broader cross-modal knowledge untouched. Experiments in both in-domain and out-of-domain settings show this yields more complete forgetting of the chosen concepts and stronger retention of everything else compared with prior instance-level approaches.

Core claim

By constructing a task-specific concept vocabulary and expressing visual features as sparse nonnegative linear combinations of those concepts, unlearning can be performed as a direct optimization over the concept coefficients that selectively suppresses only the target entries, thereby removing the desired knowledge without erasing unrelated semantics that coexist in the same image or global model capabilities.

What carries the argument

Interpretable concept decomposition of visual representations into sparse nonnegative linear combinations drawn from a compact task-specific semantic vocabulary.

If this is right

Target concepts are suppressed more thoroughly than with image-level unlearning.
Non-target semantics that appear in the same images remain largely intact.
Overall model performance on unrelated tasks stays competitive with existing methods.
The same procedure applies to both in-domain and out-of-domain forgetting requests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition step could be reused to inspect or edit other learned behaviors beyond unlearning.
Concept-level control may simplify compliance with privacy rules that require removal of specific sensitive attributes.
The approach invites experiments on whether the learned vocabulary transfers across different VLMs or datasets.

Load-bearing premise

Visual representations can be decomposed into sparse nonnegative combinations of semantic concepts from a compact task-specific vocabulary.

What would settle it

After running the unlearning procedure, the model still produces accurate descriptions or classifications involving the target concept when shown images that contain both target and non-target elements.

Figures

Figures reproduced from arXiv: 2605.14309 by Jing Lin, Junhao Dong, Li Xu, Piotr Koniusz, Shen Lin.

**Figure 1.** Figure 1: Motivation. The left part shows target concepts to be forgotten, while the right part shows remaining contextual concepts that should be preserved. ICED more effectively suppresses forgetting concepts and shifts the model’s focus toward non-target contextual concepts, indicating more selective and utility-preserving unlearning. intended for forgetting and irrelevant contextual information that should be pr… view at source ↗

**Figure 2.** Figure 2: An overview of our proposed ICED method. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Retrieval visualization for in-domain forgetting on ImageNet-1K. The target subgroup [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of the top-5 concepts obtained by ICED. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation study of the vocabulary size and sparsity regularization weight on CIFAR-10. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation study of the balancing hyperparameters on CIFAR-10. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Retrieval visualization for out-of-domain forgetting on CIFAR-10. After forgetting the [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

read the original abstract

Machine unlearning in Vision-Language Models (VLMs) is typically performed at the image or instance level, making it difficult to precisely remove target knowledge without affecting unrelated semantics. This issue is especially pronounced since a single image often contains multiple entangled concepts, including both target concepts to be forgotten and contextual information that should be preserved. In this paper, we propose an interpretable concept-level unlearning framework for VLMs, which constructs a compact task-specific concept vocabulary from the forgetting set using a multimodal large language model. In addition to modality alignment, visual representations are decomposed into sparse, nonnegative combinations of semantic concepts, providing an explicit interface for fine-grained knowledge manipulation. Based on this decomposition, our method formulates unlearning as concept-level optimization, where target concepts are selectively suppressed while intra-instance non-target semantics and global cross-modal knowledge are preserved. Extensive experiments across both in-domain and out-of-domain forgetting settings demonstrate that our method enables more comprehensive target forgetting, better preserves non-target knowledge within the same image, and maintains competitive model utility compared with existing VLM unlearning methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a concept decomposition route to finer-grained unlearning in VLMs, but the abstract provides zero evidence that the decomposition is accurate enough to support the preservation guarantees.

read the letter

The main takeaway is that this work moves VLM unlearning from whole-image forgetting to concept-level control. They pull a compact task-specific vocabulary out of the forgetting set with a multimodal LLM, then decompose visual representations into sparse nonnegative combinations of those concepts. The decomposition is positioned as an explicit handle for suppressing only the target concepts while leaving intra-image non-target content and cross-modal knowledge intact.

Referee Report

1 major / 1 minor

Summary. The paper proposes ICED, a concept-level unlearning framework for Vision-Language Models. It uses a multimodal LLM to build a compact task-specific concept vocabulary from the forgetting set, decomposes visual representations into sparse nonnegative combinations of these concepts to create an explicit manipulation interface, and casts unlearning as concept-level optimization that suppresses target concepts while preserving intra-instance non-target semantics and global cross-modal knowledge. Experiments in both in-domain and out-of-domain forgetting settings are claimed to achieve more comprehensive target forgetting, superior preservation of non-target knowledge within images, and competitive model utility relative to prior VLM unlearning methods.

Significance. If the decomposition is shown to be faithful, the framework would provide a more granular and interpretable alternative to instance-level unlearning, directly addressing concept entanglement in images. This could meaningfully advance privacy and safety applications for VLMs by enabling selective knowledge removal without broad collateral effects.

major comments (1)

[Abstract] Abstract: The central claim that sparse nonnegative decomposition supplies an 'explicit interface' for selective suppression while exactly preserving non-target semantics rests on unverified decomposition fidelity. No reconstruction error (e.g., ||V - C A||), completeness, or orthogonality metric is reported, so it is impossible to confirm that suppressing rows of A leaves the residual encoding non-target content intact.

minor comments (1)

The abstract would be strengthened by including at least one key quantitative result (e.g., forgetting accuracy or utility delta) alongside the qualitative claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on the manuscript. We agree that additional quantitative verification of the decomposition fidelity would strengthen the central claims and will incorporate the suggested metrics in the revised version.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that sparse nonnegative decomposition supplies an 'explicit interface' for selective suppression while exactly preserving non-target semantics rests on unverified decomposition fidelity. No reconstruction error (e.g., ||V - C A||), completeness, or orthogonality metric is reported, so it is impossible to confirm that suppressing rows of A leaves the residual encoding non-target content intact.

Authors: We acknowledge the validity of this point: the abstract does not report direct fidelity metrics such as reconstruction error ||V - C A||, completeness, or orthogonality, which would provide stronger evidence that suppressing target rows in A preserves non-target semantics. Although the full manuscript validates the approach via downstream unlearning results (superior non-target preservation in both in-domain and out-of-domain settings), these indirect demonstrations do not substitute for explicit decomposition quality measures. In the revised manuscript we will add ||V - C A||_F reconstruction error, concept completeness scores, and sparsity/orthogonality statistics in the experiments section to directly confirm fidelity. This revision will make the 'explicit interface' claim verifiable without changing the core method or results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents a methodological framework that constructs a task-specific concept vocabulary via an external multimodal LLM and decomposes visual representations into sparse nonnegative combinations of semantic concepts to enable selective suppression. No equations, self-referential derivations, or fitted parameters are exhibited that reduce the unlearning outcome or preservation guarantees to inputs by construction. The approach relies on external concept extraction and experimental validation rather than tautological self-definition, fitted-input predictions, or load-bearing self-citations. The central claims about explicit interfaces and selective forgetting remain independent of the method's own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5495 in / 1060 out tokens · 33054 ms · 2026-05-15T02:32:10.764681+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

[1]

Learning transferable visual models from natural language supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervi- sion,” inInternational Conference on Machine Learning, 2021, pp. 8748–8763

work page 2021
[2]

Trustworthy ai: From principles to practices,

B. Li, P. Qi, B. Liu, S. Di, J. Liu, J. Pei, J. Yi, and B. Zhou, “Trustworthy ai: From principles to practices,”ACM Computing Surveys, vol. 55, no. 9, pp. 1–46, 2023

work page 2023
[3]

Allies teach better than enemies: Inverse adversaries for robust knowledge distillation,

J. Dong, R. Z. Moayedi, Y .-S. Ong, and S.-M. Moosavi-Dezfooli, “Allies teach better than enemies: Inverse adversaries for robust knowledge distillation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

work page 2026
[4]

Tug-of-war no more: Harmonizing accuracy and robustness in vision-language models via stability-aware task vector merging,

J. Dong, X. Qu, C. Zhang, S. Q. Rong, N. D. Thai, W. Pan, X. Li, T. Liu, P. Koniusz, and Y .-S. Ong, “Tug-of-war no more: Harmonizing accuracy and robustness in vision-language models via stability-aware task vector merging,” inThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[5]

Can bad teaching induce forgetting? unlearning in deep networks using an incompetent teacher,

V . S. Chundawat, A. K. Tarun, M. Mandal, and M. Kankanhalli, “Can bad teaching induce forgetting? unlearning in deep networks using an incompetent teacher,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, 2023, pp. 7210–7217

work page 2023
[6]

Erm-ktp: Knowledge-level machine unlearning via knowledge transfer,

S. Lin, X. Zhang, C. Chen, X. Chen, and W. Susilo, “Erm-ktp: Knowledge-level machine unlearning via knowledge transfer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 20 147–20 155

work page 2023
[7]

Boundary unlearning: Rapid forgetting of deep networks via shifting the decision boundary,

M. Chen, W. Gao, G. Liu, K. Peng, and C. Wang, “Boundary unlearning: Rapid forgetting of deep networks via shifting the decision boundary,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7766–7775

work page 2023
[8]

Gdr-gma: Machine unlearning via direction- rectified and magnitude-adjusted gradients,

S. Lin, X. Zhang, W. Susilo, X. Chen, and J. Liu, “Gdr-gma: Machine unlearning via direction- rectified and magnitude-adjusted gradients,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 9087–9095

work page 2024
[9]

Learning to unlearn while retaining: Combating gradient conflicts in machine unlearning,

G. Patel and Q. Qiu, “Learning to unlearn while retaining: Combating gradient conflicts in machine unlearning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 4211–4221

work page 2025
[10]

Safe-clip: Removing nsfw concepts from vision-and-language models,

S. Poppi, T. Poppi, F. Cocchi, M. Cornia, L. Baraldi, and R. Cucchiara, “Safe-clip: Removing nsfw concepts from vision-and-language models,” inEuropean Conference on Computer Vision, 2024, pp. 340–356

work page 2024
[11]

Multidelete for multimodal machine unlearning,

J. Cheng and H. Amiri, “Multidelete for multimodal machine unlearning,” inEuropean Confer- ence on Computer Vision, 2024, pp. 165–184

work page 2024
[12]

Targeted unlearning with single layer unlearning gradient,

Z. Cai, Y . Tan, and M. S. Asif, “Targeted unlearning with single layer unlearning gradient,” in International Conference on Machine Learning, 2025, pp. 6257–6290

work page 2025
[13]

Cliperase: Efficient unlearning of visual-textual associations in clip,

T. Yang, L. Dai, X. Wang, M. Cheng, Y . Tian, and X. Zhang, “Cliperase: Efficient unlearning of visual-textual associations in clip,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025, pp. 30 438–30 452

work page 2025
[14]

Targeted forgetting of image subgroups in clip models,

Z. Zhang, G. Liu, C. Fleming, R. R. Kompella, and C. Xu, “Targeted forgetting of image subgroups in clip models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 9870–9880

work page 2025
[15]

Machine unlearning via task simplex arithmetic,

J. Dong, H. Zhu, Y . Zhang, X. Qu, Y .-S. Ong, and P. Koniusz, “Machine unlearning via task simplex arithmetic,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[16]

Text-to-concept (and back) via cross-model alignment,

M. Moayeri, K. Rezaei, M. Sanjabi, and S. Feizi, “Text-to-concept (and back) via cross-model alignment,” inProceedings of the 40th International Conference on Machine Learning, 2023, pp. 25 037–25 060

work page 2023
[17]

Post-hoc concept bottleneck models,

M. Yuksekgonul, M. Wang, and J. Zou, “Post-hoc concept bottleneck models,” inThe Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=nA5AZ8CEyow 10

work page 2023
[18]

Do vision-language pretrained models learn composable primitive concepts?

T. Yun, U. Bhalla, E. Pavlick, and C. Sun, “Do vision-language pretrained models learn composable primitive concepts?”Transactions on Machine Learning Research, 2023. [Online]. Available: https://openreview.net/forum?id=YwNrPLjHSL

work page 2023
[19]

Stair: Learning sparse text and image representation in grounded tokens,

C. Chen, B. Zhang, L. Cao, J. Shen, T. Gunter, A. Jose, A. Toshev, Y . Zheng, J. Shlens, R. Pang et al., “Stair: Learning sparse text and image representation in grounded tokens,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 15 079–15 094

work page 2023
[20]

Interpreting CLIP’s image representation via text-based decomposition,

Y . Gandelsman, A. A. Efros, and J. Steinhardt, “Interpreting CLIP’s image representation via text-based decomposition,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=5Ca9sSzuDp

work page 2024
[21]

Information maximization perspective of or- thogonal matching pursuit with applications to explainable ai,

A. Chattopadhyay, R. Pilgrim, and R. Vidal, “Information maximization perspective of or- thogonal matching pursuit with applications to explainable ai,” inProceedings of the 37th International Conference on Neural Information Processing Systems, 2023, pp. 2956–2990

work page 2023
[22]

Interpreting clip with sparse linear concept embeddings (splice),

U. Bhalla, A. Oesterling, S. Srinivas, F. P. Calmon, and H. Lakkaraju, “Interpreting clip with sparse linear concept embeddings (splice),” inProceedings of the 38th International Conference on Neural Information Processing Systems, 2024, pp. 84 298–84 328

work page 2024
[23]

Robust superalignment: Weak-to- strong robustness generalization for vision-language models,

J. Dong, C. Zhang, X. Qu, Z. Ma, P. Koniusz, and Y . S. Ong, “Robust superalignment: Weak-to- strong robustness generalization for vision-language models,”Advances in Neural Information Processing Systems, vol. 38, pp. 18 345–18 377, 2025

work page 2025
[24]

Zero-shot class unlearning in clip with synthetic samples,

A. Kravets and V . P. Namboodiri, “Zero-shot class unlearning in clip with synthetic samples,” in 2025 IEEE/CVF Winter Conference on Applications of Computer Vision, 2025, pp. 6456–6464

work page 2025
[25]

Stabilizing modality gap & lowering gradient norms improve zero-shot adversarial robustness of vlms,

J. Dong, P. Koniusz, X. Qu, and Y .-S. Ong, “Stabilizing modality gap & lowering gradient norms improve zero-shot adversarial robustness of vlms,” inProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, 2025, pp. 236–247

work page 2025
[26]

BREEDS: benchmarks for subpopulation shift,

S. Santurkar, D. Tsipras, and A. Madry, “BREEDS: benchmarks for subpopulation shift,” in9th International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=mQPBmvyAuk

work page 2021
[27]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255

work page 2009
[28]

Learning multiple layers of features from tiny images,

A. Krizhevsky, “Learning multiple layers of features from tiny images,”Master’s thesis, Univer- sity of Tront, 2009

work page 2009
[29]

Machine unlearning of features and labels,

A. Warnecke, L. Pirch, C. Wressnegger, and K. Rieck, “Machine unlearning of features and labels,” inProceedings 2023 Network and Distributed System Security Symposium, 2023

work page 2023
[30]

Unrolling sgd: Understanding factors influencing machine unlearning,

A. Thudi, G. Deza, V . Chandrasekaran, and N. Papernot, “Unrolling sgd: Understanding factors influencing machine unlearning,” in2022 IEEE 7th European Symposium on Security and Privacy, 2022, pp. 303–319

work page 2022
[31]

Eternal sunshine of the spotless net: Selective forgetting in deep networks,

A. Golatkar, A. Achille, and S. Soatto, “Eternal sunshine of the spotless net: Selective forgetting in deep networks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9304–9312

work page 2020
[32]

An information theoretic approach to machine unlearning,

J. Foster, K. Fogarty, S. Schoepf, Z. Dugue, C. Öztireli, and A. Brintrup, “An information theoretic approach to machine unlearning,” 2024. [Online]. Available: https://arxiv.org/abs/2402.01401

work page arXiv 2024
[33]

Zero-shot machine unlearning,

V . S. Chundawat, A. K. Tarun, M. Mandal, and M. Kankanhalli, “Zero-shot machine unlearning,” IEEE Transactions on Information Forensics and Security, vol. 18, pp. 2345–2354, 2023

work page 2023
[34]

Food-101–mining discriminative components with random forests,

L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining discriminative components with random forests,” inEuropean Conference on Computer Vision, 2014, pp. 446–461

work page 2014
[35]

An analysis of single-layer networks in unsupervised feature learning,

A. Coates, A. Ng, and H. Lee, “An analysis of single-layer networks in unsupervised feature learning,” inProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011, pp. 215–223

work page 2011
[36]

Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models,

A. Barbu, D. Mayo, J. Alverio, W. Luo, C. Wang, D. Gutfreund, J. Tenenbaum, and B. Katz, “Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models,” inProceedings of the 33rd International Conference on Neural Information Processing Systems, 2019, pp. 9453–9463. 11 A Additional Descriptions of ICED Algorithm 1 s...

work page arXiv 2019