Recognition: no theorem link
ICED: Concept-level Machine Unlearning via Interpretable Concept Decomposition
Pith reviewed 2026-05-15 02:32 UTC · model grok-4.3
The pith
Vision-language models can unlearn specific concepts by decomposing images into sparse semantic combinations and suppressing only the targets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By constructing a task-specific concept vocabulary and expressing visual features as sparse nonnegative linear combinations of those concepts, unlearning can be performed as a direct optimization over the concept coefficients that selectively suppresses only the target entries, thereby removing the desired knowledge without erasing unrelated semantics that coexist in the same image or global model capabilities.
What carries the argument
Interpretable concept decomposition of visual representations into sparse nonnegative linear combinations drawn from a compact task-specific semantic vocabulary.
If this is right
- Target concepts are suppressed more thoroughly than with image-level unlearning.
- Non-target semantics that appear in the same images remain largely intact.
- Overall model performance on unrelated tasks stays competitive with existing methods.
- The same procedure applies to both in-domain and out-of-domain forgetting requests.
Where Pith is reading between the lines
- The same decomposition step could be reused to inspect or edit other learned behaviors beyond unlearning.
- Concept-level control may simplify compliance with privacy rules that require removal of specific sensitive attributes.
- The approach invites experiments on whether the learned vocabulary transfers across different VLMs or datasets.
Load-bearing premise
Visual representations can be decomposed into sparse nonnegative combinations of semantic concepts from a compact task-specific vocabulary.
What would settle it
After running the unlearning procedure, the model still produces accurate descriptions or classifications involving the target concept when shown images that contain both target and non-target elements.
Figures
read the original abstract
Machine unlearning in Vision-Language Models (VLMs) is typically performed at the image or instance level, making it difficult to precisely remove target knowledge without affecting unrelated semantics. This issue is especially pronounced since a single image often contains multiple entangled concepts, including both target concepts to be forgotten and contextual information that should be preserved. In this paper, we propose an interpretable concept-level unlearning framework for VLMs, which constructs a compact task-specific concept vocabulary from the forgetting set using a multimodal large language model. In addition to modality alignment, visual representations are decomposed into sparse, nonnegative combinations of semantic concepts, providing an explicit interface for fine-grained knowledge manipulation. Based on this decomposition, our method formulates unlearning as concept-level optimization, where target concepts are selectively suppressed while intra-instance non-target semantics and global cross-modal knowledge are preserved. Extensive experiments across both in-domain and out-of-domain forgetting settings demonstrate that our method enables more comprehensive target forgetting, better preserves non-target knowledge within the same image, and maintains competitive model utility compared with existing VLM unlearning methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ICED, a concept-level unlearning framework for Vision-Language Models. It uses a multimodal LLM to build a compact task-specific concept vocabulary from the forgetting set, decomposes visual representations into sparse nonnegative combinations of these concepts to create an explicit manipulation interface, and casts unlearning as concept-level optimization that suppresses target concepts while preserving intra-instance non-target semantics and global cross-modal knowledge. Experiments in both in-domain and out-of-domain forgetting settings are claimed to achieve more comprehensive target forgetting, superior preservation of non-target knowledge within images, and competitive model utility relative to prior VLM unlearning methods.
Significance. If the decomposition is shown to be faithful, the framework would provide a more granular and interpretable alternative to instance-level unlearning, directly addressing concept entanglement in images. This could meaningfully advance privacy and safety applications for VLMs by enabling selective knowledge removal without broad collateral effects.
major comments (1)
- [Abstract] Abstract: The central claim that sparse nonnegative decomposition supplies an 'explicit interface' for selective suppression while exactly preserving non-target semantics rests on unverified decomposition fidelity. No reconstruction error (e.g., ||V - C A||), completeness, or orthogonality metric is reported, so it is impossible to confirm that suppressing rows of A leaves the residual encoding non-target content intact.
minor comments (1)
- The abstract would be strengthened by including at least one key quantitative result (e.g., forgetting accuracy or utility delta) alongside the qualitative claims.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on the manuscript. We agree that additional quantitative verification of the decomposition fidelity would strengthen the central claims and will incorporate the suggested metrics in the revised version.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that sparse nonnegative decomposition supplies an 'explicit interface' for selective suppression while exactly preserving non-target semantics rests on unverified decomposition fidelity. No reconstruction error (e.g., ||V - C A||), completeness, or orthogonality metric is reported, so it is impossible to confirm that suppressing rows of A leaves the residual encoding non-target content intact.
Authors: We acknowledge the validity of this point: the abstract does not report direct fidelity metrics such as reconstruction error ||V - C A||, completeness, or orthogonality, which would provide stronger evidence that suppressing target rows in A preserves non-target semantics. Although the full manuscript validates the approach via downstream unlearning results (superior non-target preservation in both in-domain and out-of-domain settings), these indirect demonstrations do not substitute for explicit decomposition quality measures. In the revised manuscript we will add ||V - C A||_F reconstruction error, concept completeness scores, and sparsity/orthogonality statistics in the experiments section to directly confirm fidelity. This revision will make the 'explicit interface' claim verifiable without changing the core method or results. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents a methodological framework that constructs a task-specific concept vocabulary via an external multimodal LLM and decomposes visual representations into sparse nonnegative combinations of semantic concepts to enable selective suppression. No equations, self-referential derivations, or fitted parameters are exhibited that reduce the unlearning outcome or preservation guarantees to inputs by construction. The approach relies on external concept extraction and experimental validation rather than tautological self-definition, fitted-input predictions, or load-bearing self-citations. The central claims about explicit interfaces and selective forgetting remain independent of the method's own outputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Learning transferable visual models from natural language supervi- sion,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervi- sion,” inInternational Conference on Machine Learning, 2021, pp. 8748–8763
work page 2021
-
[2]
Trustworthy ai: From principles to practices,
B. Li, P. Qi, B. Liu, S. Di, J. Liu, J. Pei, J. Yi, and B. Zhou, “Trustworthy ai: From principles to practices,”ACM Computing Surveys, vol. 55, no. 9, pp. 1–46, 2023
work page 2023
-
[3]
Allies teach better than enemies: Inverse adversaries for robust knowledge distillation,
J. Dong, R. Z. Moayedi, Y .-S. Ong, and S.-M. Moosavi-Dezfooli, “Allies teach better than enemies: Inverse adversaries for robust knowledge distillation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026
work page 2026
-
[4]
J. Dong, X. Qu, C. Zhang, S. Q. Rong, N. D. Thai, W. Pan, X. Li, T. Liu, P. Koniusz, and Y .-S. Ong, “Tug-of-war no more: Harmonizing accuracy and robustness in vision-language models via stability-aware task vector merging,” inThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[5]
Can bad teaching induce forgetting? unlearning in deep networks using an incompetent teacher,
V . S. Chundawat, A. K. Tarun, M. Mandal, and M. Kankanhalli, “Can bad teaching induce forgetting? unlearning in deep networks using an incompetent teacher,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, 2023, pp. 7210–7217
work page 2023
-
[6]
Erm-ktp: Knowledge-level machine unlearning via knowledge transfer,
S. Lin, X. Zhang, C. Chen, X. Chen, and W. Susilo, “Erm-ktp: Knowledge-level machine unlearning via knowledge transfer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 20 147–20 155
work page 2023
-
[7]
Boundary unlearning: Rapid forgetting of deep networks via shifting the decision boundary,
M. Chen, W. Gao, G. Liu, K. Peng, and C. Wang, “Boundary unlearning: Rapid forgetting of deep networks via shifting the decision boundary,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7766–7775
work page 2023
-
[8]
Gdr-gma: Machine unlearning via direction- rectified and magnitude-adjusted gradients,
S. Lin, X. Zhang, W. Susilo, X. Chen, and J. Liu, “Gdr-gma: Machine unlearning via direction- rectified and magnitude-adjusted gradients,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 9087–9095
work page 2024
-
[9]
Learning to unlearn while retaining: Combating gradient conflicts in machine unlearning,
G. Patel and Q. Qiu, “Learning to unlearn while retaining: Combating gradient conflicts in machine unlearning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 4211–4221
work page 2025
-
[10]
Safe-clip: Removing nsfw concepts from vision-and-language models,
S. Poppi, T. Poppi, F. Cocchi, M. Cornia, L. Baraldi, and R. Cucchiara, “Safe-clip: Removing nsfw concepts from vision-and-language models,” inEuropean Conference on Computer Vision, 2024, pp. 340–356
work page 2024
-
[11]
Multidelete for multimodal machine unlearning,
J. Cheng and H. Amiri, “Multidelete for multimodal machine unlearning,” inEuropean Confer- ence on Computer Vision, 2024, pp. 165–184
work page 2024
-
[12]
Targeted unlearning with single layer unlearning gradient,
Z. Cai, Y . Tan, and M. S. Asif, “Targeted unlearning with single layer unlearning gradient,” in International Conference on Machine Learning, 2025, pp. 6257–6290
work page 2025
-
[13]
Cliperase: Efficient unlearning of visual-textual associations in clip,
T. Yang, L. Dai, X. Wang, M. Cheng, Y . Tian, and X. Zhang, “Cliperase: Efficient unlearning of visual-textual associations in clip,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025, pp. 30 438–30 452
work page 2025
-
[14]
Targeted forgetting of image subgroups in clip models,
Z. Zhang, G. Liu, C. Fleming, R. R. Kompella, and C. Xu, “Targeted forgetting of image subgroups in clip models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 9870–9880
work page 2025
-
[15]
Machine unlearning via task simplex arithmetic,
J. Dong, H. Zhu, Y . Zhang, X. Qu, Y .-S. Ong, and P. Koniusz, “Machine unlearning via task simplex arithmetic,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[16]
Text-to-concept (and back) via cross-model alignment,
M. Moayeri, K. Rezaei, M. Sanjabi, and S. Feizi, “Text-to-concept (and back) via cross-model alignment,” inProceedings of the 40th International Conference on Machine Learning, 2023, pp. 25 037–25 060
work page 2023
-
[17]
Post-hoc concept bottleneck models,
M. Yuksekgonul, M. Wang, and J. Zou, “Post-hoc concept bottleneck models,” inThe Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=nA5AZ8CEyow 10
work page 2023
-
[18]
Do vision-language pretrained models learn composable primitive concepts?
T. Yun, U. Bhalla, E. Pavlick, and C. Sun, “Do vision-language pretrained models learn composable primitive concepts?”Transactions on Machine Learning Research, 2023. [Online]. Available: https://openreview.net/forum?id=YwNrPLjHSL
work page 2023
-
[19]
Stair: Learning sparse text and image representation in grounded tokens,
C. Chen, B. Zhang, L. Cao, J. Shen, T. Gunter, A. Jose, A. Toshev, Y . Zheng, J. Shlens, R. Pang et al., “Stair: Learning sparse text and image representation in grounded tokens,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 15 079–15 094
work page 2023
-
[20]
Interpreting CLIP’s image representation via text-based decomposition,
Y . Gandelsman, A. A. Efros, and J. Steinhardt, “Interpreting CLIP’s image representation via text-based decomposition,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=5Ca9sSzuDp
work page 2024
-
[21]
A. Chattopadhyay, R. Pilgrim, and R. Vidal, “Information maximization perspective of or- thogonal matching pursuit with applications to explainable ai,” inProceedings of the 37th International Conference on Neural Information Processing Systems, 2023, pp. 2956–2990
work page 2023
-
[22]
Interpreting clip with sparse linear concept embeddings (splice),
U. Bhalla, A. Oesterling, S. Srinivas, F. P. Calmon, and H. Lakkaraju, “Interpreting clip with sparse linear concept embeddings (splice),” inProceedings of the 38th International Conference on Neural Information Processing Systems, 2024, pp. 84 298–84 328
work page 2024
-
[23]
Robust superalignment: Weak-to- strong robustness generalization for vision-language models,
J. Dong, C. Zhang, X. Qu, Z. Ma, P. Koniusz, and Y . S. Ong, “Robust superalignment: Weak-to- strong robustness generalization for vision-language models,”Advances in Neural Information Processing Systems, vol. 38, pp. 18 345–18 377, 2025
work page 2025
-
[24]
Zero-shot class unlearning in clip with synthetic samples,
A. Kravets and V . P. Namboodiri, “Zero-shot class unlearning in clip with synthetic samples,” in 2025 IEEE/CVF Winter Conference on Applications of Computer Vision, 2025, pp. 6456–6464
work page 2025
-
[25]
Stabilizing modality gap & lowering gradient norms improve zero-shot adversarial robustness of vlms,
J. Dong, P. Koniusz, X. Qu, and Y .-S. Ong, “Stabilizing modality gap & lowering gradient norms improve zero-shot adversarial robustness of vlms,” inProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, 2025, pp. 236–247
work page 2025
-
[26]
BREEDS: benchmarks for subpopulation shift,
S. Santurkar, D. Tsipras, and A. Madry, “BREEDS: benchmarks for subpopulation shift,” in9th International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=mQPBmvyAuk
work page 2021
-
[27]
Imagenet: A large-scale hierarchical image database,
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255
work page 2009
-
[28]
Learning multiple layers of features from tiny images,
A. Krizhevsky, “Learning multiple layers of features from tiny images,”Master’s thesis, Univer- sity of Tront, 2009
work page 2009
-
[29]
Machine unlearning of features and labels,
A. Warnecke, L. Pirch, C. Wressnegger, and K. Rieck, “Machine unlearning of features and labels,” inProceedings 2023 Network and Distributed System Security Symposium, 2023
work page 2023
-
[30]
Unrolling sgd: Understanding factors influencing machine unlearning,
A. Thudi, G. Deza, V . Chandrasekaran, and N. Papernot, “Unrolling sgd: Understanding factors influencing machine unlearning,” in2022 IEEE 7th European Symposium on Security and Privacy, 2022, pp. 303–319
work page 2022
-
[31]
Eternal sunshine of the spotless net: Selective forgetting in deep networks,
A. Golatkar, A. Achille, and S. Soatto, “Eternal sunshine of the spotless net: Selective forgetting in deep networks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9304–9312
work page 2020
-
[32]
An information theoretic approach to machine unlearning,
J. Foster, K. Fogarty, S. Schoepf, Z. Dugue, C. Öztireli, and A. Brintrup, “An information theoretic approach to machine unlearning,” 2024. [Online]. Available: https://arxiv.org/abs/2402.01401
-
[33]
V . S. Chundawat, A. K. Tarun, M. Mandal, and M. Kankanhalli, “Zero-shot machine unlearning,” IEEE Transactions on Information Forensics and Security, vol. 18, pp. 2345–2354, 2023
work page 2023
-
[34]
Food-101–mining discriminative components with random forests,
L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining discriminative components with random forests,” inEuropean Conference on Computer Vision, 2014, pp. 446–461
work page 2014
-
[35]
An analysis of single-layer networks in unsupervised feature learning,
A. Coates, A. Ng, and H. Lee, “An analysis of single-layer networks in unsupervised feature learning,” inProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011, pp. 215–223
work page 2011
-
[36]
A. Barbu, D. Mayo, J. Alverio, W. Luo, C. Wang, D. Gutfreund, J. Tenenbaum, and B. Katz, “Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models,” inProceedings of the 33rd International Conference on Neural Information Processing Systems, 2019, pp. 9453–9463. 11 A Additional Descriptions of ICED Algorithm 1 s...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.