REMEDI: A Benchmark for Retention and Unlearning Evaluation in Multi-label Clinical Disease Inference

Anurag Sharma; Koustav Rudra; Prasenjit Mitra; Sai Teja Chunchu; Sandipan Sikdar

arxiv: 2606.07141 · v1 · pith:Y5B5RXDKnew · submitted 2026-06-05 · 💻 cs.LG · cs.AI

REMEDI: A Benchmark for Retention and Unlearning Evaluation in Multi-label Clinical Disease Inference

Anurag Sharma , Sai Teja Chunchu , Prasenjit Mitra , Sandipan Sikdar , Koustav Rudra This is my paper

Pith reviewed 2026-06-27 22:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords machine unlearningclinical disease inferencemulti-label classificationMIMIC-IIIprivacy in machine learningbenchmark for unlearningmedical AI safety

0 comments

The pith

Existing machine unlearning methods show a clear trade-off between retained model utility and successful removal of specific patient data in multi-label clinical disease inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents REMEDI, a benchmark built on the MIMIC-III clinical database, to test how well machine unlearning works when models must predict multiple diseases at once from patient records. It evaluates several existing unlearning approaches across different sets of data to forget, covering both multi-label and multi-class settings while measuring both how much the model still performs well and how completely the targeted information has been removed. Results indicate that stronger unlearning tends to hurt predictive accuracy and that current techniques are especially ill-suited to the correlated labels typical in medical data. This matters because hospitals and patients may request removal of individual records for privacy reasons, yet retraining from scratch is too costly and exact unlearning remains intractable.

Core claim

REMEDI supplies a standardized evaluation suite for retention and unlearning on real clinical records, where experiments with prior unlearning algorithms reveal an inherent utility-unlearning trade-off and show that those algorithms remain largely unsuitable for multi-label classification tasks.

What carries the argument

The REMEDI benchmark, which defines diverse forget-instance sets, multi-label and multi-class tasks, and joint metrics for utility retention and unlearning success on longitudinal MIMIC-III data.

If this is right

New unlearning algorithms will need to handle label correlations explicitly to succeed in medical settings.
Practical deployment of clinical inference models may require hybrid retention-unlearning pipelines rather than pure removal techniques.
Regulatory or patient-driven data-deletion requests will likely force accuracy losses unless the trade-off is resolved.
Benchmarking on real longitudinal records, rather than synthetic data, exposes limitations hidden by simpler testbeds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the observed trade-off holds across larger models, hospitals may need to maintain separate models for different patient cohorts rather than a single shared model.
The longitudinal structure noted in the benchmark suggests that time-aware forgetting mechanisms could become a useful research direction.
Success on REMEDI could serve as a minimum viability test before any clinical unlearning method is considered for regulatory review.

Load-bearing premise

That the chosen forget instances and metrics capture the real privacy-removal requests that would arise in clinical practice without introducing selection bias.

What would settle it

An experiment in which at least one existing unlearning method achieves both high unlearning scores and undiminished multi-label accuracy on the REMEDI test splits would falsify the reported trade-off.

Figures

Figures reproduced from arXiv: 2606.07141 by Anurag Sharma, Koustav Rudra, Prasenjit Mitra, Sai Teja Chunchu, Sandipan Sikdar.

**Figure 2.** Figure 2: Utility evaluation across forget set sizes for distinct and large-scale unlearning scenarios. (a) BioBERT: Only SCRUB [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

Language models trained for clinical disease inference are trained on patient data, which may include sensitive and private information, and data owners may request the removal of their data from a trained model due to privacy or copyright concerns. However, exactly unlearning patient-specific data is intractable, and retraining with minor data removal is resource-intensive. While there exists several machine unlearning methods that can be used, their utility is generally restricted to non-medical domains. Moreover, the existing benchmarks for evaluating such unlearning methods primarily utilize synthetically curated datasets, which are not truly representative of real-world systems. Hence, the effectiveness of these unlearning methods in the medical domain is largely unclear. To this end, we introduce REMEDI, an extensive benchmark for machine unlearning tailored to multi-label and multiclass clinical disease inference, where label correlations, longitudinal structure, and safety constraints make unlearning particularly challenging. Unlike the existing benchmarks, REMEDI considers: (1) a relevant application domain (medical), (2) comprehensive unlearning setups involving diverse sets of forget instances, (3) challenging unlearning scenarios including multi-label and multi-class classification tasks, and (4) evaluation metrics involving performance both in terms of utility and extent of unlearning achieved. REMEDI is developed using the MIMIC-III clinical database that contains comprehensive clinical data of patients. Experiments with existing unlearning methods indicate that there exists a trade-off between utility and unlearning performance. They are also largely unsuited to multi-label classification tasks. To facilitate reproducibility, we make our benchmark publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

REMEDI builds a real-data benchmark for clinical unlearning but the forget-set choices need tighter justification before the trade-off claims land cleanly.

read the letter

REMEDI puts together a benchmark on MIMIC-III records for testing unlearning in multi-label disease inference. The main point is that it shifts evaluation away from synthetic data toward actual longitudinal patient records with label correlations and safety angles, then runs existing methods and finds a utility-forgetting trade-off plus weak results on multi-label tasks.

The paper does the useful work of spelling out the gaps in prior benchmarks and then delivering one that includes varied forget setups, both multi-label and multi-class tasks, and metrics that look at retained performance alongside forgetting. Making the whole thing public is the practical part that lets other groups run their own tests.

The soft spot sits in how the forget instances were picked. The design claims the chosen patients and metrics match real privacy requests, but the paper does not lay out explicit selection rules from MIMIC-III. Without that, the reported trade-off and the unsuitability conclusion for multi-label work could be tied to the specific construction rather than a general property. That is the main place where extra detail would strengthen the results.

This is for people building or evaluating unlearning methods who want a medical-domain testbed, or for clinical ML groups tracking privacy requirements. It deserves a serious referee because the domain matters and a released benchmark can shape later work, even if the forget-set documentation needs tightening.

Referee Report

1 major / 0 minor

Summary. The paper introduces REMEDI, a benchmark for machine unlearning in multi-label and multiclass clinical disease inference built on the MIMIC-III database. It claims that label correlations, longitudinal structure, and safety constraints make unlearning particularly challenging in this domain; that existing unlearning methods exhibit a utility-unlearning trade-off; and that these methods are largely unsuited to multi-label tasks. The benchmark incorporates diverse forget-instance sets, utility and unlearning metrics, and is released publicly to support reproducibility.

Significance. If the forget sets and metrics prove representative of real clinical privacy requests, REMEDI would supply a needed domain-specific evaluation resource for privacy-preserving ML in medicine, where data removal requests are increasingly common. The use of real longitudinal clinical records rather than synthetic data and the public release of the benchmark are concrete strengths that would aid future method development.

major comments (1)

[Abstract (REMEDI design paragraph)] Abstract (REMEDI design paragraph): the claim that 'the chosen forget instances and metrics in REMEDI accurately capture real-world privacy requests' is load-bearing for the central finding of a utility-unlearning trade-off and unsuitability for multi-label tasks, yet the text supplies no explicit selection criteria for forget patients from MIMIC-III (random sampling, disease-specific, longitudinal span, or full-record deletion). Without such criteria or a matching argument to typical clinical removal requests, the observed trade-off could be an artifact of the benchmark construction rather than a general property.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comment on the abstract's description of forget-instance selection below.

read point-by-point responses

Referee: Abstract (REMEDI design paragraph): the claim that 'the chosen forget instances and metrics in REMEDI accurately capture real-world privacy requests' is load-bearing for the central finding of a utility-unlearning trade-off and unsuitability for multi-label tasks, yet the text supplies no explicit selection criteria for forget patients from MIMIC-III (random sampling, disease-specific, longitudinal span, or full-record deletion). Without such criteria or a matching argument to typical clinical removal requests, the observed trade-off could be an artifact of the benchmark construction rather than a general property.

Authors: We agree that the abstract would benefit from greater precision on this point. The full manuscript (Section 3.2) constructs the forget sets via a combination of random sampling stratified by disease prevalence and record length, plus targeted selection of patients with multi-visit longitudinal data to simulate full-record deletion requests. We will revise the abstract to briefly state these criteria and add a short paragraph in the introduction linking them to common clinical privacy scenarios (e.g., GDPR Article 17 or HIPAA patient-requested record removal). This revision will make explicit that the benchmark aims to reflect representative rather than exhaustive real-world conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction with empirical reporting only

full rationale

The paper introduces REMEDI as a new benchmark dataset and reports experimental results on existing unlearning methods. No derivation chain, first-principles predictions, fitted parameters renamed as predictions, or self-referential equations exist. Claims about trade-offs are direct empirical observations from running methods on the benchmark, not reductions to inputs by construction. The design choices (forget sets, metrics) are presented as deliberate but are not shown to be equivalent to the reported outcomes via any definitional loop. Self-citations, if present, are not load-bearing for any central result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that MIMIC-III data and the defined unlearning scenarios represent the key real-world challenges; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption MIMIC-III clinical database contains comprehensive clinical data of patients suitable for multi-label disease inference benchmarks.
The benchmark is developed using this database as stated.

pith-pipeline@v0.9.1-grok · 5831 in / 1263 out tokens · 36275 ms · 2026-06-27T22:17:15.897301+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 2 canonical work pages

[1]

Health Care Financing Administration

United States. Health Care Financing Administration. 1991.ICD-9-CM Official Guidelines for Coding and Reporting. US Department of Health and Human Services

1991
[2]

Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hen- grui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. 2021. Machine unlearning. In2021 IEEE symposium on security and privacy (SP). IEEE, 141–159

2021
[3]

Yinzhi Cao and Junfeng Yang. 2015. Towards making systems forget with machine unlearning. In2015 IEEE symposium on security and privacy. IEEE, 463–480

2015
[4]

Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramer. 2022. Membership inference attacks from first principles. In2022 IEEE symposium on security and privacy (SP). IEEE, 1897–1914

2022
[5]

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert- Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. 2021. Extracting training data from large language models. In30th USENIX security symposium (USENIX Security 21). 2633–2650

2021
[6]

Sungmin Cha, Sungjun Cho, Dasol Hwang, Honglak Lee, Taesup Moon, and Moontae Lee. 2024. Learning to unlearn: Instance-wise unlearning for pre- trained classifiers. InProceedings of the AAAI conference on artificial intelligence, Vol. 38. 11186–11194

2024
[7]

Jiali Cheng and Hadi Amiri. 2024. Mu-bench: A multitask multimodal benchmark for machine unlearning.arXiv preprint arXiv:2406.14796(2024)

arXiv 2024
[8]

Dasol Choi and Dongbin Na. 2023. Towards machine unlearning benchmarks: Forgetting the personal identities in facial recognition systems.arXiv preprint arXiv:2311.02240(2023)

arXiv 2023
[9]

Stewart, and Jimeng Sun

Edward Choi, Siddharth Biswal, Bradley Malin, Jon Duke, Walter F. Stewart, and Jimeng Sun. 2017. Generating Multi-label Discrete Patient Records using Generative Adversarial Networks. InProceedings of the 2nd Machine Learning for Healthcare Conference (Proceedings of Machine Learning Research, Vol. 68), Finale Doshi-Velez, Jim Fackler, David Kale, Rajesh ...

2017
[10]

Vikram S Chundawat, Ayush K Tarun, Murari Mandal, and Mohan Kankanhalli
[11]

InProceedings of the AAAI Conference on Artificial Intelligence, Vol

Can bad teaching induce forgetting? unlearning in deep networks us- ing an incompetent teacher. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 7210–7217
[12]

Ronen Eldan and Mark Russinovich. [n. d.]. Who’s harry potter? approximate unlearning in llms, 2023.URL https://arxiv. org/abs/2310.022381, 2 ([n. d.]), 8

arXiv 2023
[13]

União Europeia. 2006. European Parliament and Council of the European Union. Recommendation of the European Parliament and of the Council of18 (2006)

2006
[14]

Antonio Ginart, Melody Guan, Gregory Valiant, and James Y Zou. 2019. Making ai forget you: Data deletion in machine learning.Advances in neural information processing systems32 (2019)

2019
[15]

Aditya Golatkar, Alessandro Achille, and Stefano Soatto. 2020. Eternal sunshine of the spotless net: Selective forgetting in deep networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9304–9312

2020
[16]

Aditya Golatkar, Alessandro Achille, and Stefano Soatto. 2020. Forgetting outside the box: Scrubbing deep networks of information accessible from input-output observations. InEuropean Conference on Computer Vision. Springer, 383–398

2020
[17]

Lawrence O Gostin, Laura A Levit, and Sharyl J Nass. 2009. Beyond the HIPAA privacy rule: enhancing privacy, improving health through research.National Academies Press(2009)

2009
[18]

Varun Gupta, Christopher Jung, Seth Neel, Aaron Roth, Saeed Sharifi-Malvajerdi, and Chris Waites. 2021. Adaptive machine unlearning.Advances in Neural Information Processing Systems34 (2021), 16319–16330

2021
[19]

Esraa Hassan, Tarek Abd El-Hafeez, and Mahmoud Y Shams. 2024. Optimizing classification of diseases through language model analysis of symptoms.Scientific reports14, 1 (2024), 1507

2024
[20]

Shengyuan Hu, Neil Kale, Pratiksha Thaker, Yiwei Fu, Steven Wu, and Virginia Smith. 2025. BLUR: A Benchmark for LLM Unlearning Robust to Forget-Retain Overlap.arXiv preprint arXiv:2506.15699(2025)

arXiv 2025
[21]

Zhuoran Jin, Pengfei Cao, Chenhao Wang, Zhitao He, Hongbang Yuan, Jiachun Li, Yubo Chen, Kang Liu, and Jun Zhao. 2024. Rwku: Benchmarking real-world knowledge unlearning for large language models.Advances in Neural Information Processing Systems37 (2024), 98213–98263

2024
[22]

Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database.Scientific data3, 1 (2016), 1–9

2016
[23]

Jitendra Jonnagaddala and Zoie Shui-Yee Wong. 2025. Privacy preserving strate- gies for electronic health records in the era of large language models.npj Digital Medicine8, 1 (2025), 34

2025
[24]

Meghdad Kurmanji, Peter Triantafillou, Jamie Hayes, and Eleni Triantafillou
[25]

Towards unbounded machine unlearning.Advances in neural information processing systems36 (2023), 1957–1987

2023
[26]

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language representation model for biomedical text mining.Bioinformatics36, 4 (2020), 1234–1240

2020
[27]

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, et al
[28]

InProceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol

The WMDP Benchmark: Measuring and Reducing Malicious Use with Un- learning. InProceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235), Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (Eds.). PMLR, 28525–28550. https:/...
[29]

Gaoyang Liu, Xiaoqiang Ma, Yang Yang, Chen Wang, and Jiangchuan Liu. 2020. Federated unlearning.arXiv preprint arXiv:2012.13891(2020)

arXiv 2020
[30]

Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. 2024. Tofu: A task of fictitious unlearning for llms.arXiv preprint arXiv:2401.06121(2024)

Pith/arXiv arXiv 2024
[31]

Zabir Al Nazi and Wei Peng. 2024. Large language models in healthcare and medical domain: A review. InInformatics, Vol. 11. MDPI, 57

2024
[32]

CA OAG. 2021. Ccpa regulations: Final regulation text.Office of the Attorney General, California Department of Justice(2021), 1

2021
[33]

Jasmine Chiat Ling Ong, Shelley Yin-Hsi Chang, Wasswa William, Atul J Butte, Nigam H Shah, Lita Sui Tjien Chew, Nan Liu, Finale Doshi-Velez, Wei Lu, Julian Savulescu, et al. 2024. Ethical and regulatory challenges of large language models in medicine.The Lancet Digital Health6, 6 (2024), e428–e432

2024
[34]

Ayush Sekhari, Jayadev Acharya, Gautam Kamath, and Ananda Theertha Suresh
[35]

Advances in Neural Information Processing Systems34 (2021), 18075–18086

Remember what you want to forget: Algorithms for machine unlearning. Advances in Neural Information Processing Systems34 (2021), 18075–18086

2021
[36]

Weijia Shi, Jaechan Lee, Yangsibo Huang, Sadhika Malladi, Jieyu Zhao, Ari Holtzman, Daogao Liu, Luke Zettlemoyer, Noah A Smith, and Chiyuan Zhang
[37]

Muse: Machine unlearning six-way evaluation for language models.arXiv preprint arXiv:2407.06460(2024)

arXiv 2024
[38]

Betty van Aken, Jens-Michalis Papaioannou, Manuel Mayrdorfer, Klemens Budde, Felix Gers, and Alexander Loeser. 2021. Clinical Outcome Prediction from Ad- mission Notes using Self-Supervised Knowledge Integration. InProceedings of the 16th Conference of the European Chapter of the Association for Com- putational Linguistics: Main Volume, Paola Merlo, Jorg ...

work page doi:10.18653/v1/2021.eacl-main.75 2021
[39]

Leon Wichert and Sandipan Sikdar. 2024. Rethinking Evaluation Methods for Machine Unlearning. InFindings of the Association for Computational Linguis- tics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 4727–4739. doi:10.18653/v1/2024.findings-emnlp.271

work page doi:10.18653/v1/2024.findings-emnlp.271 2024
[40]

Jin Yao, Eli Chien, Minxin Du, Xinyao Niu, Tianhao Wang, Zezhou Cheng, and Xiang Yue. 2024. Machine unlearning of pre-trained large language models.arXiv preprint arXiv:2402.15159(2024)

arXiv 2024
[41]

Michihiro Yasunaga, Jure Leskovec, and Percy Liang. 2022. LinkBERT: Pretrain- ing Language Models with Document Links. InAssociation for Computational Linguistics (ACL)

2022
[42]

Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. 2018. Privacy risk in machine learning: Analyzing the connection to overfitting. In2018 IEEE 31st computer security foundations symposium (CSF). IEEE, 268–282

2018
[43]

Ziyang Zhang. 2025. MMDU-Bench: Multi-modal Deep Unlearning Benchmark. InThe First Workshop on Multimodal Knowledge and Language Modeling

2025

[1] [1]

Health Care Financing Administration

United States. Health Care Financing Administration. 1991.ICD-9-CM Official Guidelines for Coding and Reporting. US Department of Health and Human Services

1991

[2] [2]

Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hen- grui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. 2021. Machine unlearning. In2021 IEEE symposium on security and privacy (SP). IEEE, 141–159

2021

[3] [3]

Yinzhi Cao and Junfeng Yang. 2015. Towards making systems forget with machine unlearning. In2015 IEEE symposium on security and privacy. IEEE, 463–480

2015

[4] [4]

Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramer. 2022. Membership inference attacks from first principles. In2022 IEEE symposium on security and privacy (SP). IEEE, 1897–1914

2022

[5] [5]

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert- Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. 2021. Extracting training data from large language models. In30th USENIX security symposium (USENIX Security 21). 2633–2650

2021

[6] [6]

Sungmin Cha, Sungjun Cho, Dasol Hwang, Honglak Lee, Taesup Moon, and Moontae Lee. 2024. Learning to unlearn: Instance-wise unlearning for pre- trained classifiers. InProceedings of the AAAI conference on artificial intelligence, Vol. 38. 11186–11194

2024

[7] [7]

Jiali Cheng and Hadi Amiri. 2024. Mu-bench: A multitask multimodal benchmark for machine unlearning.arXiv preprint arXiv:2406.14796(2024)

arXiv 2024

[8] [8]

Dasol Choi and Dongbin Na. 2023. Towards machine unlearning benchmarks: Forgetting the personal identities in facial recognition systems.arXiv preprint arXiv:2311.02240(2023)

arXiv 2023

[9] [9]

Stewart, and Jimeng Sun

Edward Choi, Siddharth Biswal, Bradley Malin, Jon Duke, Walter F. Stewart, and Jimeng Sun. 2017. Generating Multi-label Discrete Patient Records using Generative Adversarial Networks. InProceedings of the 2nd Machine Learning for Healthcare Conference (Proceedings of Machine Learning Research, Vol. 68), Finale Doshi-Velez, Jim Fackler, David Kale, Rajesh ...

2017

[10] [10]

Vikram S Chundawat, Ayush K Tarun, Murari Mandal, and Mohan Kankanhalli

[11] [11]

InProceedings of the AAAI Conference on Artificial Intelligence, Vol

Can bad teaching induce forgetting? unlearning in deep networks us- ing an incompetent teacher. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 7210–7217

[12] [12]

Ronen Eldan and Mark Russinovich. [n. d.]. Who’s harry potter? approximate unlearning in llms, 2023.URL https://arxiv. org/abs/2310.022381, 2 ([n. d.]), 8

arXiv 2023

[13] [13]

União Europeia. 2006. European Parliament and Council of the European Union. Recommendation of the European Parliament and of the Council of18 (2006)

2006

[14] [14]

Antonio Ginart, Melody Guan, Gregory Valiant, and James Y Zou. 2019. Making ai forget you: Data deletion in machine learning.Advances in neural information processing systems32 (2019)

2019

[15] [15]

Aditya Golatkar, Alessandro Achille, and Stefano Soatto. 2020. Eternal sunshine of the spotless net: Selective forgetting in deep networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9304–9312

2020

[16] [16]

Aditya Golatkar, Alessandro Achille, and Stefano Soatto. 2020. Forgetting outside the box: Scrubbing deep networks of information accessible from input-output observations. InEuropean Conference on Computer Vision. Springer, 383–398

2020

[17] [17]

Lawrence O Gostin, Laura A Levit, and Sharyl J Nass. 2009. Beyond the HIPAA privacy rule: enhancing privacy, improving health through research.National Academies Press(2009)

2009

[18] [18]

Varun Gupta, Christopher Jung, Seth Neel, Aaron Roth, Saeed Sharifi-Malvajerdi, and Chris Waites. 2021. Adaptive machine unlearning.Advances in Neural Information Processing Systems34 (2021), 16319–16330

2021

[19] [19]

Esraa Hassan, Tarek Abd El-Hafeez, and Mahmoud Y Shams. 2024. Optimizing classification of diseases through language model analysis of symptoms.Scientific reports14, 1 (2024), 1507

2024

[20] [20]

Shengyuan Hu, Neil Kale, Pratiksha Thaker, Yiwei Fu, Steven Wu, and Virginia Smith. 2025. BLUR: A Benchmark for LLM Unlearning Robust to Forget-Retain Overlap.arXiv preprint arXiv:2506.15699(2025)

arXiv 2025

[21] [21]

Zhuoran Jin, Pengfei Cao, Chenhao Wang, Zhitao He, Hongbang Yuan, Jiachun Li, Yubo Chen, Kang Liu, and Jun Zhao. 2024. Rwku: Benchmarking real-world knowledge unlearning for large language models.Advances in Neural Information Processing Systems37 (2024), 98213–98263

2024

[22] [22]

Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database.Scientific data3, 1 (2016), 1–9

2016

[23] [23]

Jitendra Jonnagaddala and Zoie Shui-Yee Wong. 2025. Privacy preserving strate- gies for electronic health records in the era of large language models.npj Digital Medicine8, 1 (2025), 34

2025

[24] [24]

Meghdad Kurmanji, Peter Triantafillou, Jamie Hayes, and Eleni Triantafillou

[25] [25]

Towards unbounded machine unlearning.Advances in neural information processing systems36 (2023), 1957–1987

2023

[26] [26]

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language representation model for biomedical text mining.Bioinformatics36, 4 (2020), 1234–1240

2020

[27] [27]

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, et al

[28] [28]

InProceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol

The WMDP Benchmark: Measuring and Reducing Malicious Use with Un- learning. InProceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235), Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (Eds.). PMLR, 28525–28550. https:/...

[29] [29]

Gaoyang Liu, Xiaoqiang Ma, Yang Yang, Chen Wang, and Jiangchuan Liu. 2020. Federated unlearning.arXiv preprint arXiv:2012.13891(2020)

arXiv 2020

[30] [30]

Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. 2024. Tofu: A task of fictitious unlearning for llms.arXiv preprint arXiv:2401.06121(2024)

Pith/arXiv arXiv 2024

[31] [31]

Zabir Al Nazi and Wei Peng. 2024. Large language models in healthcare and medical domain: A review. InInformatics, Vol. 11. MDPI, 57

2024

[32] [32]

CA OAG. 2021. Ccpa regulations: Final regulation text.Office of the Attorney General, California Department of Justice(2021), 1

2021

[33] [33]

Jasmine Chiat Ling Ong, Shelley Yin-Hsi Chang, Wasswa William, Atul J Butte, Nigam H Shah, Lita Sui Tjien Chew, Nan Liu, Finale Doshi-Velez, Wei Lu, Julian Savulescu, et al. 2024. Ethical and regulatory challenges of large language models in medicine.The Lancet Digital Health6, 6 (2024), e428–e432

2024

[34] [34]

Ayush Sekhari, Jayadev Acharya, Gautam Kamath, and Ananda Theertha Suresh

[35] [35]

Advances in Neural Information Processing Systems34 (2021), 18075–18086

Remember what you want to forget: Algorithms for machine unlearning. Advances in Neural Information Processing Systems34 (2021), 18075–18086

2021

[36] [36]

Weijia Shi, Jaechan Lee, Yangsibo Huang, Sadhika Malladi, Jieyu Zhao, Ari Holtzman, Daogao Liu, Luke Zettlemoyer, Noah A Smith, and Chiyuan Zhang

[37] [37]

Muse: Machine unlearning six-way evaluation for language models.arXiv preprint arXiv:2407.06460(2024)

arXiv 2024

[38] [38]

Betty van Aken, Jens-Michalis Papaioannou, Manuel Mayrdorfer, Klemens Budde, Felix Gers, and Alexander Loeser. 2021. Clinical Outcome Prediction from Ad- mission Notes using Self-Supervised Knowledge Integration. InProceedings of the 16th Conference of the European Chapter of the Association for Com- putational Linguistics: Main Volume, Paola Merlo, Jorg ...

work page doi:10.18653/v1/2021.eacl-main.75 2021

[39] [39]

Leon Wichert and Sandipan Sikdar. 2024. Rethinking Evaluation Methods for Machine Unlearning. InFindings of the Association for Computational Linguis- tics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 4727–4739. doi:10.18653/v1/2024.findings-emnlp.271

work page doi:10.18653/v1/2024.findings-emnlp.271 2024

[40] [40]

Jin Yao, Eli Chien, Minxin Du, Xinyao Niu, Tianhao Wang, Zezhou Cheng, and Xiang Yue. 2024. Machine unlearning of pre-trained large language models.arXiv preprint arXiv:2402.15159(2024)

arXiv 2024

[41] [41]

Michihiro Yasunaga, Jure Leskovec, and Percy Liang. 2022. LinkBERT: Pretrain- ing Language Models with Document Links. InAssociation for Computational Linguistics (ACL)

2022

[42] [42]

Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. 2018. Privacy risk in machine learning: Analyzing the connection to overfitting. In2018 IEEE 31st computer security foundations symposium (CSF). IEEE, 268–282

2018

[43] [43]

Ziyang Zhang. 2025. MMDU-Bench: Multi-modal Deep Unlearning Benchmark. InThe First Workshop on Multimodal Knowledge and Language Modeling

2025