Continual Speaker Identity Unlearning with Minimal Interference
Pith reviewed 2026-06-29 20:18 UTC · model grok-4.3
The pith
CORTIS lets zero-shot TTS models forget new speakers over time while keeping all prior unlearnings intact without old data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CORTIS is the first framework for continual speaker identity unlearning in zero-shot text-to-speech that requires no access to previously unlearned speaker data; it combines Fisher-information-based parameter masking to localize updates to speaker-relevant weights with orthogonal projection against subspaces spanned by prior unlearning updates, so that each new speaker is unlearned while all earlier unlearned speakers remain forgotten across long request sequences.
What carries the argument
Fisher-information-based parameter masking combined with orthogonal projection against prior unlearning update subspaces, which localizes speaker-specific changes and blocks interference with earlier removals.
If this is right
- Unlearning requests can arrive sequentially without reviving any previously removed speaker identities.
- Future unlearning steps need no data from speakers that were already removed.
- Performance on long sequences of requests exceeds that obtained by applying existing methods one after another.
- The same localization-plus-projection approach can be applied to other zero-shot TTS models beyond VoiceBox.
Where Pith is reading between the lines
- The orthogonal-projection step may serve as a general mechanism for reducing interference in any continual unlearning setting that must protect prior deletions.
- Similar Fisher-masking plus projection combinations could be tested for continual unlearning of other speaker or identity attributes in generative audio models.
- If the method scales, it would allow privacy regulators to treat speaker-removal requests as an ongoing stream rather than a single batch.
Load-bearing premise
Fisher-information-based parameter masking together with orthogonal projection against prior update subspaces is sufficient to localize changes and prevent interference without any access to previously unlearned speaker data.
What would settle it
A test in which CORTIS unlearns a sequence of ten or more speakers and then a new request causes measurable recovery of synthesis capability for any earlier unlearned speaker.
Figures
read the original abstract
Machine unlearning removes designated concepts or knowledge from pre-trained models. Recent work has extended this paradigm to speaker identity unlearning in zero-shot text-to-speech (ZS-TTS), the task of selectively erasing a model's ability to replicate a speaker's voice. Existing methods, however, quietly assume all unlearning requests arrive at once; an unrealistic assumption, since privacy-motivated removals arrive sequentially over time. We show this assumption breaks state-of-the-art methods: unlearning each new speaker fully revives previously unlearned speakers, reintroducing the very privacy risk unlearning was meant to eliminate. We present Cumulative ORThogonal Identity Suppression (CORTIS), the first framework for continual speaker identity unlearning in ZS-TTS that requires no access to previously-unlearned speaker data. CORTIS combines Fisher-information-based parameter masking, which localizes updates to speaker-relevant weights, with orthogonal projection against subspaces spanned by prior unlearning updates. With VoiceBox, CORTIS unlearns each requested speaker while keeping previously unlearned speakers forgotten across long request sequences, substantially outperforming sequential application of prior methods. The demo is available at https://cumulativeortis.github.io/ .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CORTIS, a framework for continual speaker identity unlearning in zero-shot text-to-speech (ZS-TTS) models such as VoiceBox. It combines Fisher-information-based parameter masking to localize updates to speaker-relevant weights with orthogonal projection against subspaces spanned by prior unlearning updates. The method is claimed to enable sequential unlearning of speakers without access to previously unlearned data while preventing revival of prior unlearned speakers, substantially outperforming sequential application of existing methods across long request sequences.
Significance. If the empirical claims hold, the work would address a practically important gap in machine unlearning for privacy-sensitive audio models, where unlearning requests arrive sequentially rather than in batch. The proposed combination of Fisher masking and orthogonal projection is a direct adaptation of existing techniques but its application to continual speaker unlearning without data replay would be a useful contribution if validated.
major comments (1)
- [Abstract] Abstract: the central claim that CORTIS 'substantially outperforming sequential application of prior methods' is asserted without any quantitative results, tables, error bars, or experimental details. This is load-bearing for the paper's contribution and prevents verification of the outperformance and the sufficiency of the Fisher-masking plus orthogonal-projection construction.
Simulated Author's Rebuttal
We thank the referee for their review. We address the single major comment below, pointing to the quantitative evidence already present in the manuscript while remaining open to minor clarifications.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that CORTIS 'substantially outperforming sequential application of prior methods' is asserted without any quantitative results, tables, error bars, or experimental details. This is load-bearing for the paper's contribution and prevents verification of the outperformance and the sufficiency of the Fisher-masking plus orthogonal-projection construction.
Authors: The abstract is a concise summary; the supporting quantitative evidence appears in full in Section 4. Tables 1–3 report speaker similarity (cosine distance), WER, and revival rates over sequences of 5–10 sequential unlearning requests on VoiceBox, with means and standard deviations across 5 random seeds. CORTIS maintains revival rates below 5% for prior speakers while achieving unlearning targets, outperforming sequential baselines (e.g., fine-tuning, gradient ascent, and Fisher-only variants) by 15–35% relative on the key forgetting-retention metrics. Section 4.3 provides ablations isolating the contribution of Fisher masking versus orthogonal projection, confirming both are required to prevent revival without replay. These results directly substantiate the abstract claim. We can add a parenthetical qualifier such as “(see Section 4)” to the abstract if the editor prefers, but the current wording is standard for summarizing experimental outcomes. revision: no
Circularity Check
No significant circularity
full rationale
The paper's central construction applies standard Fisher-information masking to localize speaker-relevant parameters and orthogonal projection onto subspaces of prior unlearning updates to prevent interference. These techniques are invoked as established tools without any quoted reduction of the claimed continual-unlearning performance to a fitted quantity defined from the same data, a self-citation chain that bears the uniqueness claim, or an ansatz smuggled through prior work by the same authors. The abstract and method description remain self-contained against external benchmarks; no equation or derivation step collapses by construction to its own inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
V oicebox: Text-guided multilingual universal speech generation at scale
Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, et al. V oicebox: Text-guided multilingual universal speech generation at scale. Advances in neural information processing systems, 36:14005–14034, 2023
2023
-
[2]
Neural codec language models are zero-shot text to speech synthesizers
Sanyuan Chen, Chengyi Wang, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. Neural codec language models are zero-shot text to speech synthesizers. IEEE Transactions on Audio, Speech and Language Processing, 33:705–718, 2025
2025
-
[3]
Naturalspeech: End-to-end text-to-speech synthesis with human-level quality
Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang, Yanqing Liu, Xi Wang, Yichong Leng, Yuanhao Yi, Lei He, et al. Naturalspeech: End-to-end text-to-speech synthesis with human-level quality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(6):4234–4245, 2024
2024
-
[4]
The eu general data protection regulation (gdpr).A practical guide, 1st ed., Cham: Springer International Publishing, 10(3152676):10–5555, 2017
Paul V oigt and Axel V on dem Bussche. The eu general data protection regulation (gdpr).A practical guide, 1st ed., Cham: Springer International Publishing, 10(3152676):10–5555, 2017
2017
-
[5]
California consumer privacy act (ccpa)
Rob Bonta. California consumer privacy act (ccpa). Retrieved from State of California Department of Justice: https://oag. ca. gov/privacy/ccpa, pages 4–40, 2022
2022
-
[6]
The right to be forgotten
Jeffrey Rosen. The right to be forgotten. Stan. L. Rev. Online, 64:88, 2011
2011
-
[7]
Machine unlearning
Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. In 2021 IEEE symposium on security and privacy (SP), pages 141–159. IEEE, 2021
2021
-
[8]
Do not mimic my voice: Speaker identity unlearning for zero-shot text-to-speech
Taesoo Kim, Jinju Kim, Dong Chan Kim, Jong Hwan Ko, and Gyeong-Moon Park. Do not mimic my voice: Speaker identity unlearning for zero-shot text-to-speech. In International Conference on Machine Learning, pages 30176–30198. PMLR, 2025
2025
-
[9]
Orthogonal gradient descent for continual learning
Mehrdad Farajtabar, Navid Azizan, Alex Mott, and Ang Li. Orthogonal gradient descent for continual learning. In International conference on artificial intelligence and statistics, pages 3762–3773. PMLR, 2020
2020
-
[10]
Gradient projection memory for continual learning
Gobinda Saha, Isha Garg, and Kaushik Roy. Gradient projection memory for continual learning. In International Conference on Learning Representations, 2021
2021
-
[11]
Erasing concepts from diffusion models
Rohit Gandikota, Joanna Materzynska, Jaden Fiotto-Kaufman, and David Bau. Erasing concepts from diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2426–2436, 2023
2023
-
[12]
Salun: Empowering machine unlearning via gradient-based weight saliency in both image classification and generation
Chongyu Fan, Jiancheng Liu, Yihua Zhang, Dennis Wei, Eric Wong, and Sijia Liu. Salun: Empowering machine unlearning via gradient-based weight saliency in both image classification and generation. In International Conference on Learning Representations, 2024
2024
-
[13]
Selective amnesia: A continual learning approach to forgetting in deep generative models
Alvin Heng and Harold Soh. Selective amnesia: A continual learning approach to forgetting in deep generative models. Advances in Neural Information Processing Systems, 36:17170–17194, 2023
2023
-
[14]
Machine unlearning doesn’t do what you think: Lessons for generative ai policy, research, and practice.Advances in neural information processing systems, 2025
A Feder Cooper, Christopher A Choquette-Choo, Miranda Bogen, Matthew Jagielski, Katja Filippova, Ken Liu, Alexandra Chouldechova, Jamie Hayes, Yangsibo Huang, Niloofar Mireshghallah, et al. Machine unlearning doesn’t do what you think: Lessons for generative ai policy, research, and practice.Advances in neural information processing systems, 2025
2025
-
[15]
Rethinking machine unlearning for large language models
Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Yuguang Yao, Chris Yuhao Liu, Xiaojun Xu, Hang Li, et al. Rethinking machine unlearning for large language models. Nature Machine Intelligence, 7(2):181–194, 2025
2025
-
[16]
Knowledge unlearning for mitigating privacy risks in language models
Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. Knowledge unlearning for mitigating privacy risks in language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), pages 14389–14408, 2023
2023
-
[17]
Continual unlearning for text-to-image diffusion models: A regularization perspective
Justin Lee, Zheda Mai, Jinsu Yoo, Chongyu Fan, Cheng Zhang, and Wei-Lun Chao. Continual unlearning for text-to-image diffusion models: A regularization perspective. International Conference on Learning Representations, 2026
2026
-
[18]
Distill, for- get, repeat: A framework for continual unlearning in text-to-image diffusion models
Naveen George, Naoki Murata, Yuhta Takida, Konda Reddy Mopuri, and Yuki Mitsufuji. Distill, for- get, repeat: A framework for continual unlearning in text-to-image diffusion models. arXiv preprint arXiv:2512.02657, 2025. 10
-
[19]
On large language model continual unlearning
Chongyang Gao, Lixu Wang, Kaize Ding, Chenkai Weng, Xiao Wang, and Qi Zhu. On large language model continual unlearning. In The Thirteenth International Conference on Learning Representations, 2025
2025
-
[20]
FIT to Forget: Robust Continual Unlearning for Large Language Models
Xiaoyu Xu, Minxin Du, Kun Fang, Zi Liang, Yaxin Xiao, Zhicong Huang, Cheng Hong, Qingqing Ye, and Haibo Hu. Fit: Defying catastrophic forgetting in continual llm unlearning. arXiv preprint arXiv:2601.21682, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[21]
Adaptive localization of knowledge negation for continual llm unlearning
Abudukelimu Wuerkaixi, Qizhou Wang, Sen Cui, Wutong Xu, Bo Han, Gang Niu, Masashi Sugiyama, and Changshui Zhang. Adaptive localization of knowledge negation for continual llm unlearning. In Forty-second International Conference on Machine Learning, 2025
2025
-
[22]
Lifelong learning algorithms
Sebastian Thrun. Lifelong learning algorithms. In Learning to learn, pages 181–209. Springer, 1998
1998
-
[23]
Continual lifelong learning with neural networks: A review
German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural networks, 113:54–71, 2019
2019
-
[24]
Catastrophic interference in connectionist networks: The sequential learning problem
Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989
1989
-
[25]
Overcoming catastrophic forgetting in neural networks
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017
2017
-
[26]
Progress & compress: A scalable framework for continual learning
Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. Progress & compress: A scalable framework for continual learning. In International conference on machine learning, pages 4528–4537. PMLR, 2018
2018
-
[27]
Memory aware synapses: Learning what (not) to forget
Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European conference on computer vision (ECCV), pages 139–154, 2018
2018
-
[28]
Continual learning through synaptic intelligence
Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In International conference on machine learning, pages 3987–3995. Pmlr, 2017
2017
-
[29]
Experience replay for continual learning
David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. Experience replay for continual learning. Advances in neural information processing systems, 32, 2019
2019
-
[30]
Continual learning with deep generative replay
Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. Advances in neural information processing systems, 30, 2017
2017
-
[31]
Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Ko- ray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[32]
Packnet: Adding multiple tasks to a single network by iterative pruning
Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018
2018
-
[33]
Overcoming catastrophic forgetting with hard attention to the task
Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. In International conference on machine learning, pages 4548–4557. PMLR, 2018
2018
-
[34]
Continual learning of context-dependent processing in neural networks
Guanxiong Zeng, Yang Chen, Bo Cui, and Shan Yu. Continual learning of context-dependent processing in neural networks. Nature Machine Intelligence, 1(8):364–372, 2019
2019
-
[35]
Libriheavy: A 50,000 hours asr corpus with punctuation casing and context
Wei Kang, Xiaoyu Yang, Zengwei Yao, Fangjun Kuang, Yifan Yang, Liyong Guo, Long Lin, and Daniel Povey. Libriheavy: A 50,000 hours asr corpus with punctuation casing and context. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 10991– 10995. IEEE, 2024
2024
-
[36]
Librispeech: an asr corpus based on public domain audio books
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015
2015
-
[37]
Hubert: How much can a bad teacher benefit asr pre-training? In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6533–6537
Wei-Ning Hsu, Yao-Hung Hubert Tsai, Benjamin Bolte, Ruslan Salakhutdinov, and Abdelrahman Mo- hamed. Hubert: How much can a bad teacher benefit asr pre-training? In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6533–6537. IEEE, 2021. 11
2021
-
[38]
Wavlm: Large-scale self-supervised pre-training for full stack speech processing
Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022
2022
-
[39]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
2017
-
[40]
wav2vec 2.0: A framework for self-supervised learning of speech representations
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020
2020
-
[41]
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
Ofir Press, Noah A Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[42]
Diffwave: A versatile diffusion model for audio synthesis
Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. In International Conference on Learning Representations
-
[43]
Unveiling concept attribution in diffusion models, 2024
Nguyen Hung-Quang, Hoang Phan, and Khoa D Doan. Unveiling concept attribution in diffusion models, 2024
2024
-
[44]
fail case
Prateek Yadav, Derek Tam, Leshem Choshen, Colin A Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models. Advances in neural information processing systems, 36:7093–7115, 2023. 12 Appendix Contents A Numerical Implementation of the Cumulative Subspace 14 B CORTIS Implementation 14 C Zero-shot Text-to-Speech Backbone Implementat...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.