A Case Study on the Impact of Anonymization Along the RAG Pipeline

Andreea-Elena Bodea , Stephen Meisenbacher , Florian Matthes

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:53 UTC · model grok-4.3

classification 💻 cs.CR cs.CL

keywords anonymizationalongpipelinecasedespiteimpactinformationmitigation

0 comments

The pith

Anonymization placement in RAG—at the dataset or at the generated answer—creates observable differences in privacy protection versus response utility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Retrieval-Augmented Generation systems pull relevant documents from a private dataset and feed them to a large language model to produce answers. When the dataset contains personal details, this creates privacy exposure risks to both the model and the end user. Anonymization removes names, identifiers, and other sensitive markers from text. The authors test two timings for this removal: before the data enters the retrieval index, or after the model has generated an answer but before it reaches the user. Their case study measures how each choice affects both the level of privacy achieved and the quality or usefulness of the final output. The abstract reports that measurable differences appear depending on the chosen stage, indicating that the timing of anonymization is not neutral.

Core claim

We show that differences in privacy-utility trade-offs can be observed depending on where anonymization took place, demonstrating the significance of privacy risk mitigation placement in RAG.

Load-bearing premise

The specific anonymization techniques, datasets, and RAG configurations used in the case study are representative enough to generalize the observed placement effects.

Figures

Figures reproduced from arXiv: 2604.15958 by Andreea-Elena Bodea, Florian Matthes, Stephen Meisenbacher.

read the original abstract

Despite the considerable promise of Retrieval-Augmented Generation (RAG), many real-world use cases may create privacy concerns, where the purported utility of RAG-enabled insights comes at the risk of exposing private information to either the LLM or the end user requesting the response. As a potential mitigation, using anonymization techniques to remove personally identifiable information (PII) and other sensitive markers in the underlying data represents a practical and sensible course of action for RAG administrators. Despite a wealth of literature on the topic, no works consider the placement of anonymization along the RAG pipeline, i.e., asking the question, where should anonymization happen? In this case study, we systematically and empirically measure the impact of anonymization at two important points along the RAG pipeline: the dataset and generated answer. We show that differences in privacy-utility trade-offs can be observed depending on where anonymization took place, demonstrating the significance of privacy risk mitigation placement in RAG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical case study with no mathematical derivations, free parameters, or postulated entities; the ledger is therefore empty.

pith-pipeline@v0.9.0 · 5468 in / 972 out tokens · 20515 ms · 2026-05-10T08:53:37.519339+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 37 canonical work pages

[1]

Andreea-Elena Bodea, Stephen Meisenbacher, Alexandra Klymenko, and Florian Matthes. 2026. SoK: Privacy Risks and Mitigations in Retrieval-Augmented Generation Systems. arXiv:2601.03979 [cs.CR] doi:10.48550/arXiv.2601.03979

work page doi:10.48550/arxiv.2601.03979 2026
[2]

Yihang Cheng, Lan Zhang, Junyang Wang, Mu Yuan, and Yunhao Yao. 2025. RemoteRAG: A Privacy-Preserving LLM Cloud RAG Service. InFindings of the As- sociation for Computational Linguistics: ACL 2025. Association for Computational Linguistics, Vienna, Austria, 3820–3837. doi:10.18653/v1/2025.findings-acl.197

work page doi:10.18653/v1/2025.findings-acl.197 2025
[3]

Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tai, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Web- son, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping...

work page doi:10.5555/3722577.3722647 2024
[4]

Stav Cohen, Ron Bitton, and Ben Nassi. 2024. Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks against RAG-based Inference in Scale and Severity Using Jailbreaking. doi:10.48550/arXiv.2409.08045 arXiv:2409.08045

work page doi:10.48550/arxiv.2409.08045 2024
[5]

Cynthia Dwork. 2006. Differential privacy. InInternational colloquium on au- tomata, languages, and programming. Springer, 1–12. doi:10.1007/11787006

work page doi:10.1007/11787006 2006
[6]

Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD IWSPA ’26, June 23–25, 2026, Frankfurt am Main, Germany Andreea-Elena Bodea,...

work page doi:10.1145/3637528.3671470 2024
[7]

Xi Fang, Liang Qiao, Jun Shi, and Hong An. 2025. Guardian Angel: A Secure and Efficient Retrieval-Augmented Generation Framework. In2025 5th International Conference on Artificial Intelligence and Industrial Technology Applications (AIITA). 1773–1777. doi:10.1109/AIITA65135.2025.11047845

work page doi:10.1109/aiita65135.2025.11047845 2025
[8]

Julian Garcia, Jiaqi Gong, Michal Zajac, and Andrew Hahn. 2025. DF-RAG: A Dual Federated Retrieval-Augmented Generation Framework for Collaborative Medical AI. InProceedings of the ACM/IEEE International Conference on Connected Health: Applications, Systems and Engineering Technologies(Yeshiva University Museum, New York, NY, USA)(CHASE ’25). Association ...

work page doi:10.1145/3721201.3725426 2025
[9]

Derek Greene and Pádraig Cunningham. 2006. Practical solutions to the problem of diagonal dominance in kernel document clustering. InProceedings of the 23rd International Conference on Machine Learning(Pittsburgh, Pennsylvania, USA) (ICML ’06). Association for Computing Machinery, New York, NY, USA, 377–384. doi:10.1145/1143844.1143892

work page doi:10.1145/1143844.1143892 2006
[10]

Nicolas Grislain. 2025. RAG with Differential Privacy. In2025 IEEE Conference on Artificial Intelligence (CAI). 847–852. doi:10.1109/CAI64502.2025.00150

work page doi:10.1109/cai64502.2025.00150 2025
[11]

Jiaming He, Cheng Liu, Guanyu Hou, Wenbo Jiang, and Jiachen Li. 2025. PRESS: Defending Privacy in Retrieval-Augmented Generation via Embedding Space Shifting. InICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1–5. doi:10.1109/ICASSP49660.2025.10887843

work page doi:10.1109/icassp49660.2025.10887843 2025
[12]

Longzhu He, Peng Tang, Yuanhe Zhang, Pengpeng Zhou, and Sen Su. 2025. Mitigating privacy risks in Retrieval-Augmented Generation via locally private entity perturbation.Information Processing & Management62, 4 (2025), 104150. doi:10.1016/j.ipm.2025.104150

work page doi:10.1016/j.ipm.2025.104150 2025
[13]

Lijie Hu, Ivan Habernal, Lei Shen, and Di Wang. 2024. Differentially Private Natural Language Models: Recent Advances and Future Directions. InFindings of the Association for Computational Linguistics: EACL 2024. Association for Compu- tational Linguistics, St. Julian’s, Malta, 478–499. https://aclanthology.org/2024. findings-eacl.33

2024
[14]

Yangsibo Huang, Samyak Gupta, Zexuan Zhong, Kai Li, and Danqi Chen. 2023. Privacy Implications of Retrieval-Based Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 14887–14902. doi:10.18653/v1/2023. emnlp-main.921

work page doi:10.18653/v1/2023 2023
[15]

Waqar Hussain. 2025. Mitigating Values Debt in Generative AI: Responsible Engineering with Graph RAG. In2025 IEEE/ACM International Workshop on Responsible AI Engineering (RAIE). 9–12. doi:10.1109/RAIE66699.2025.00006

work page doi:10.1109/raie66699.2025.00006 2025
[16]

Bryan Klimt and Yiming Yang. 2004. The enron corpus: A new dataset for email classification research. InEuropean conference on machine learning. Springer, 217–226. doi:10.1007/978-3-540-30115-8_22

work page doi:10.1007/978-3-540-30115-8_22 2004
[17]

Oleksandra Klymenko, Stephen Meisenbacher, and Florian Matthes. 2022. Differ- ential Privacy in Natural Language Processing: The Story So Far. InProceedings of the Fourth Workshop on Privacy in Natural Language Processing. Association for Computational Linguistics, Seattle, United States, 1–11. doi:10.18653/v1/2022. privatenlp-1.1

work page doi:10.18653/v1/2022 2022
[18]

Agrim Kulshreshtha, Aditya Choudhary, Tejas Taneja, and Seema Verma. 2025. Enhancing Healthcare Accessibility: A RAG-Based Medical Chatbot Using Trans- former Models. In2024 International Conference on IT Innovation and Knowledge Discovery (ITIKD). 1–4. doi:10.1109/ITIKD63574.2025.11005179

work page doi:10.1109/itikd63574.2025.11005179 2025
[19]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. InProceedings of the 34th International Conference on Neural Information Processing Systems(Van...

work page arXiv 2020
[20]

Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, Jie Huang, Fanpu Meng, and Yangqiu Song. 2023. Multi-step Jailbreaking Privacy Attacks on ChatGPT. InFindings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, Singapore, 4138–4153. doi:10.18653/v1/2023.findings- emnlp.272

work page doi:10.18653/v1/2023.findings- 2023
[21]

Moxin Li, Yong Zhao, Wenxuan Zhang, Shuaiyi Li, Wenya Xie, See-Kiong Ng, Tat-Seng Chua, and Yang Deng. 2025. Knowledge Boundary of Large Language Models: A Survey. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vienna, Austria, 5131–5157. doi:10....

work page doi:10.18653/v1/2025.acl-long.256 2025
[22]

Pierre Lison, Ildikó Pilán, David Sanchez, Montserrat Batet, and Lilja Øvrelid
[23]

Anonymisation Models for Text Data: State of the art, Challenges and Future Directions. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 4188–4203. doi:10.18653/v1/2...

work page doi:10.18653/v1/2021.acl-long.323 2021
[24]

Yiwei Liu, Duo Li, Hu Wang, Pengpeng Zhou, Yuying Xie, and Peng Yin. 2025. Woodpecker: A Locally Deployed Large Language Model for Protecting Sensitive Information via RAG and Semantic Recognition. In2025 IEEE 19th International Conference on Big Data Science and Engineering (BigDataSE). 70–78. doi:10.1109/ BigDataSE66491.2025.00018

work page arXiv 2025
[25]

Anupam Mehta and Aditya Patel. 2025. Secure Framework for Retrieval- Augmented Generation: Challenges and Solutions.IJARCCE01 (Jan. 2025). doi:10.17148/ijarcce.2025.14114

work page doi:10.17148/ijarcce.2025.14114 2025
[26]

Stephen Meisenbacher, Maulik Chevli, and Florian Matthes. 2024. 1-Diffractor: Efficient and Utility-Preserving Text Obfuscation Leveraging Word-Level Metric Differential Privacy. InProceedings of the 10th ACM International Workshop on Security and Privacy Analytics(Porto, Portugal)(IWSPA ’24). Association for Computing Machinery, New York, NY, USA, 23–33....

work page doi:10.1145/3643651.3659896 2024
[27]

Stephen Meisenbacher, Maulik Chevli, and Florian Matthes. 2025. On the Impact of Noise in Differentially Private Text Rewriting. InFindings of the Association for Computational Linguistics: NAACL 2025. Association for Computational Linguis- tics, Albuquerque, New Mexico, 514–532. doi:10.18653/v1/2025.findings-naacl.32

work page doi:10.18653/v1/2025.findings-naacl.32 2025
[29]

Nidhi Mishra, Kanchan Rai, and Dhruv Sharma. 2025. SecureRag: Preventing Sensitive Information Leakage In Rag Pipelines. In2025 Second International Conference on Pioneering Developments in Computer Science & Digital Technologies (IC2SDT). 600–605. doi:10.1109/IC2SDT68218.2025.11383622

work page doi:10.1109/ic2sdt68218.2025.11383622 2025
[30]

Ildikó Pilán, Pierre Lison, Lilja Øvrelid, Anthi Papadopoulou, David Sánchez, and Montserrat Batet. 2022. The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization.Computational Linguistics48, 4 (Dec. 2022), 1053–1101. doi:10.1162/coli_a_00458

work page doi:10.1162/coli_a_00458 2022
[31]

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019). https://cdn.openai.com/better-language-models/language_models_are_ unsupervised_multitask_learners.pdf

2019
[32]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 3982–3992. doi...

work page doi:10.18653/v1/d19-1410 2019
[33]

Alireza Salemi and Hamed Zamani. 2025. Comparing Retrieval-Augmentation and Parameter-Efficient Fine-Tuning for Privacy-Preserving Personalization of Large Language Models. InProceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR) (Padua, Italy)(ICTIR ’25). Association for Computing M...

work page doi:10.1145/3731120.3744595 2025
[34]

Saiteja Utpala, Sara Hooker, and Pin-Yu Chen. 2023. Locally Differentially Private Document Generation Using Zero Shot Prompting. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, Singapore, 8442–8457. doi:10.18653/v1/2023.findings-emnlp.566

work page doi:10.18653/v1/2023.findings-emnlp.566 2023
[35]

Shang Wang, Tianqing Zhu, Bo Liu, Ming Ding, Dayong Ye, Wanlei Zhou, and Philip Yu. 2025. Unique Security and Privacy Threats of Large Language Models: A Comprehensive Survey.ACM Comput. Surv.58, 4, Article 83 (Oct. 2025), 36 pages. doi:10.1145/3764113

work page doi:10.1145/3764113 2025
[36]

Ward and Josh Harguess

Chris M. Ward and Josh Harguess. 2025. Adversarial threat vectors and risk mitigation for retrieval-augmented generation systems. InAssurance and Security for AI-enabled Systems 2025, Vol. 13476. International Society for Optics and Photonics, SPIE, 134760A. doi:10.1117/12.3055931

work page doi:10.1117/12.3055931 2025
[37]

Huanyi Ye, Jiale Guo, Ziyao Liu, and Kwok-Yan Lam. 2025. Efficient Privacy- Preserving Retrieval Augmented Generation with Distance-Preserving Encryp- tion. In2025 3rd International Conference on Foundation and Large Language Models (FLLM). 668–676. doi:10.1109/FLLM67465.2025.11391120

work page doi:10.1109/fllm67465.2025.11391120 2025
[38]

Shenglai Zeng, Jiankun Zhang, Pengfei He, Yiding Liu, Yue Xing, Han Xu, Jie Ren, Yi Chang, Shuaiqiang Wang, Dawei Yin, and Jiliang Tang. 2024. The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG). In Findings of the Association for Computational Linguistics: ACL 2024. Association for Computational Linguistics, Bangkok, Th...

work page doi:10.18653/v1/ 2024
[39]

Shenglai Zeng, Jiankun Zhang, Pengfei He, Jie Ren, Tianqi Zheng, Hanqing Lu, Han Xu, Hui Liu, Yue Xing, and Jiliang Tang. 2025. Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, ...

work page doi:10.18653/v1/2025.emnlp-main.1247 2025
[40]

Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, Jie Jiang, and Bin Cui. 2026. Retrieval- augmented generation for ai-generated content: A survey.Data Science and Engineering(2026), 1–29. doi:10.1007/s41019-025-00335-5

work page doi:10.1007/s41019-025-00335-5 2026
[41]

Yujia Zhou, Yan Liu, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Zheng Liu, Chaozhuo Li, Zhicheng Dou, Tsung-Yi Ho, and Philip S. Yu. 2024. Trustworthiness in Retrieval-Augmented Generation Systems: A Survey. doi:10.48550/arXiv.2409. 10102 arXiv:2409.10102. Anonymization Along the RAG Pipeline IWSPA ’26, June 23–25, 2026, Frankfurt am Main, Germany A Supplementa...

work page doi:10.48550/arxiv.2409 2024
[42]

I will essentially wear being across from family at Holiday and am hoped to search a Holiday worship concert in Kazakhstan

Which of courses is the plan if everyone acts immediately on planned–which it probably wo ’t. I will essentially wear being across from family at Holiday and am hoped to search a Holiday worship concert in Kazakhstan. I d prefer any prevention in this regarding. And could overall admire inspiration on what type of shoes are logical to bringing. From Shell...