Recognition: unknown
A Case Study on the Impact of Anonymization Along the RAG Pipeline
Pith reviewed 2026-05-10 08:53 UTC · model grok-4.3
The pith
Anonymization placement in RAG—at the dataset or at the generated answer—creates observable differences in privacy protection versus response utility.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We show that differences in privacy-utility trade-offs can be observed depending on where anonymization took place, demonstrating the significance of privacy risk mitigation placement in RAG.
Load-bearing premise
The specific anonymization techniques, datasets, and RAG configurations used in the case study are representative enough to generalize the observed placement effects.
Figures
read the original abstract
Despite the considerable promise of Retrieval-Augmented Generation (RAG), many real-world use cases may create privacy concerns, where the purported utility of RAG-enabled insights comes at the risk of exposing private information to either the LLM or the end user requesting the response. As a potential mitigation, using anonymization techniques to remove personally identifiable information (PII) and other sensitive markers in the underlying data represents a practical and sensible course of action for RAG administrators. Despite a wealth of literature on the topic, no works consider the placement of anonymization along the RAG pipeline, i.e., asking the question, where should anonymization happen? In this case study, we systematically and empirically measure the impact of anonymization at two important points along the RAG pipeline: the dataset and generated answer. We show that differences in privacy-utility trade-offs can be observed depending on where anonymization took place, demonstrating the significance of privacy risk mitigation placement in RAG.
Editorial analysis
A structured set of objections, weighed in public.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Andreea-Elena Bodea, Stephen Meisenbacher, Alexandra Klymenko, and Florian Matthes. 2026. SoK: Privacy Risks and Mitigations in Retrieval-Augmented Generation Systems. arXiv:2601.03979 [cs.CR] doi:10.48550/arXiv.2601.03979
-
[2]
Yihang Cheng, Lan Zhang, Junyang Wang, Mu Yuan, and Yunhao Yao. 2025. RemoteRAG: A Privacy-Preserving LLM Cloud RAG Service. InFindings of the As- sociation for Computational Linguistics: ACL 2025. Association for Computational Linguistics, Vienna, Austria, 3820–3837. doi:10.18653/v1/2025.findings-acl.197
-
[3]
Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tai, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Web- son, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping...
-
[4]
Stav Cohen, Ron Bitton, and Ben Nassi. 2024. Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks against RAG-based Inference in Scale and Severity Using Jailbreaking. doi:10.48550/arXiv.2409.08045 arXiv:2409.08045
-
[5]
Cynthia Dwork. 2006. Differential privacy. InInternational colloquium on au- tomata, languages, and programming. Springer, 1–12. doi:10.1007/11787006
-
[6]
Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD IWSPA ’26, June 23–25, 2026, Frankfurt am Main, Germany Andreea-Elena Bodea,...
-
[7]
Xi Fang, Liang Qiao, Jun Shi, and Hong An. 2025. Guardian Angel: A Secure and Efficient Retrieval-Augmented Generation Framework. In2025 5th International Conference on Artificial Intelligence and Industrial Technology Applications (AIITA). 1773–1777. doi:10.1109/AIITA65135.2025.11047845
-
[8]
Julian Garcia, Jiaqi Gong, Michal Zajac, and Andrew Hahn. 2025. DF-RAG: A Dual Federated Retrieval-Augmented Generation Framework for Collaborative Medical AI. InProceedings of the ACM/IEEE International Conference on Connected Health: Applications, Systems and Engineering Technologies(Yeshiva University Museum, New York, NY, USA)(CHASE ’25). Association ...
-
[9]
Derek Greene and Pádraig Cunningham. 2006. Practical solutions to the problem of diagonal dominance in kernel document clustering. InProceedings of the 23rd International Conference on Machine Learning(Pittsburgh, Pennsylvania, USA) (ICML ’06). Association for Computing Machinery, New York, NY, USA, 377–384. doi:10.1145/1143844.1143892
-
[10]
Nicolas Grislain. 2025. RAG with Differential Privacy. In2025 IEEE Conference on Artificial Intelligence (CAI). 847–852. doi:10.1109/CAI64502.2025.00150
-
[11]
Jiaming He, Cheng Liu, Guanyu Hou, Wenbo Jiang, and Jiachen Li. 2025. PRESS: Defending Privacy in Retrieval-Augmented Generation via Embedding Space Shifting. InICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1–5. doi:10.1109/ICASSP49660.2025.10887843
-
[12]
Longzhu He, Peng Tang, Yuanhe Zhang, Pengpeng Zhou, and Sen Su. 2025. Mitigating privacy risks in Retrieval-Augmented Generation via locally private entity perturbation.Information Processing & Management62, 4 (2025), 104150. doi:10.1016/j.ipm.2025.104150
-
[13]
Lijie Hu, Ivan Habernal, Lei Shen, and Di Wang. 2024. Differentially Private Natural Language Models: Recent Advances and Future Directions. InFindings of the Association for Computational Linguistics: EACL 2024. Association for Compu- tational Linguistics, St. Julian’s, Malta, 478–499. https://aclanthology.org/2024. findings-eacl.33
2024
-
[14]
Yangsibo Huang, Samyak Gupta, Zexuan Zhong, Kai Li, and Danqi Chen. 2023. Privacy Implications of Retrieval-Based Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 14887–14902. doi:10.18653/v1/2023. emnlp-main.921
-
[15]
Waqar Hussain. 2025. Mitigating Values Debt in Generative AI: Responsible Engineering with Graph RAG. In2025 IEEE/ACM International Workshop on Responsible AI Engineering (RAIE). 9–12. doi:10.1109/RAIE66699.2025.00006
-
[16]
Bryan Klimt and Yiming Yang. 2004. The enron corpus: A new dataset for email classification research. InEuropean conference on machine learning. Springer, 217–226. doi:10.1007/978-3-540-30115-8_22
-
[17]
Oleksandra Klymenko, Stephen Meisenbacher, and Florian Matthes. 2022. Differ- ential Privacy in Natural Language Processing: The Story So Far. InProceedings of the Fourth Workshop on Privacy in Natural Language Processing. Association for Computational Linguistics, Seattle, United States, 1–11. doi:10.18653/v1/2022. privatenlp-1.1
-
[18]
Agrim Kulshreshtha, Aditya Choudhary, Tejas Taneja, and Seema Verma. 2025. Enhancing Healthcare Accessibility: A RAG-Based Medical Chatbot Using Trans- former Models. In2024 International Conference on IT Innovation and Knowledge Discovery (ITIKD). 1–4. doi:10.1109/ITIKD63574.2025.11005179
-
[19]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. InProceedings of the 34th International Conference on Neural Information Processing Systems(Van...
-
[20]
Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, Jie Huang, Fanpu Meng, and Yangqiu Song. 2023. Multi-step Jailbreaking Privacy Attacks on ChatGPT. InFindings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, Singapore, 4138–4153. doi:10.18653/v1/2023.findings- emnlp.272
-
[21]
Moxin Li, Yong Zhao, Wenxuan Zhang, Shuaiyi Li, Wenya Xie, See-Kiong Ng, Tat-Seng Chua, and Yang Deng. 2025. Knowledge Boundary of Large Language Models: A Survey. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vienna, Austria, 5131–5157. doi:10....
-
[22]
Pierre Lison, Ildikó Pilán, David Sanchez, Montserrat Batet, and Lilja Øvrelid
-
[23]
Anonymisation Models for Text Data: State of the art, Challenges and Future Directions. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 4188–4203. doi:10.18653/v1/2...
-
[24]
Yiwei Liu, Duo Li, Hu Wang, Pengpeng Zhou, Yuying Xie, and Peng Yin. 2025. Woodpecker: A Locally Deployed Large Language Model for Protecting Sensitive Information via RAG and Semantic Recognition. In2025 IEEE 19th International Conference on Big Data Science and Engineering (BigDataSE). 70–78. doi:10.1109/ BigDataSE66491.2025.00018
-
[25]
Anupam Mehta and Aditya Patel. 2025. Secure Framework for Retrieval- Augmented Generation: Challenges and Solutions.IJARCCE01 (Jan. 2025). doi:10.17148/ijarcce.2025.14114
-
[26]
Stephen Meisenbacher, Maulik Chevli, and Florian Matthes. 2024. 1-Diffractor: Efficient and Utility-Preserving Text Obfuscation Leveraging Word-Level Metric Differential Privacy. InProceedings of the 10th ACM International Workshop on Security and Privacy Analytics(Porto, Portugal)(IWSPA ’24). Association for Computing Machinery, New York, NY, USA, 23–33....
-
[27]
Stephen Meisenbacher, Maulik Chevli, and Florian Matthes. 2025. On the Impact of Noise in Differentially Private Text Rewriting. InFindings of the Association for Computational Linguistics: NAACL 2025. Association for Computational Linguis- tics, Albuquerque, New Mexico, 514–532. doi:10.18653/v1/2025.findings-naacl.32
-
[29]
Nidhi Mishra, Kanchan Rai, and Dhruv Sharma. 2025. SecureRag: Preventing Sensitive Information Leakage In Rag Pipelines. In2025 Second International Conference on Pioneering Developments in Computer Science & Digital Technologies (IC2SDT). 600–605. doi:10.1109/IC2SDT68218.2025.11383622
-
[30]
Ildikó Pilán, Pierre Lison, Lilja Øvrelid, Anthi Papadopoulou, David Sánchez, and Montserrat Batet. 2022. The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization.Computational Linguistics48, 4 (Dec. 2022), 1053–1101. doi:10.1162/coli_a_00458
-
[31]
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019). https://cdn.openai.com/better-language-models/language_models_are_ unsupervised_multitask_learners.pdf
2019
-
[32]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 3982–3992. doi...
-
[33]
Alireza Salemi and Hamed Zamani. 2025. Comparing Retrieval-Augmentation and Parameter-Efficient Fine-Tuning for Privacy-Preserving Personalization of Large Language Models. InProceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR) (Padua, Italy)(ICTIR ’25). Association for Computing M...
-
[34]
Saiteja Utpala, Sara Hooker, and Pin-Yu Chen. 2023. Locally Differentially Private Document Generation Using Zero Shot Prompting. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, Singapore, 8442–8457. doi:10.18653/v1/2023.findings-emnlp.566
-
[35]
Shang Wang, Tianqing Zhu, Bo Liu, Ming Ding, Dayong Ye, Wanlei Zhou, and Philip Yu. 2025. Unique Security and Privacy Threats of Large Language Models: A Comprehensive Survey.ACM Comput. Surv.58, 4, Article 83 (Oct. 2025), 36 pages. doi:10.1145/3764113
-
[36]
Chris M. Ward and Josh Harguess. 2025. Adversarial threat vectors and risk mitigation for retrieval-augmented generation systems. InAssurance and Security for AI-enabled Systems 2025, Vol. 13476. International Society for Optics and Photonics, SPIE, 134760A. doi:10.1117/12.3055931
-
[37]
Huanyi Ye, Jiale Guo, Ziyao Liu, and Kwok-Yan Lam. 2025. Efficient Privacy- Preserving Retrieval Augmented Generation with Distance-Preserving Encryp- tion. In2025 3rd International Conference on Foundation and Large Language Models (FLLM). 668–676. doi:10.1109/FLLM67465.2025.11391120
-
[38]
Shenglai Zeng, Jiankun Zhang, Pengfei He, Yiding Liu, Yue Xing, Han Xu, Jie Ren, Yi Chang, Shuaiqiang Wang, Dawei Yin, and Jiliang Tang. 2024. The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG). In Findings of the Association for Computational Linguistics: ACL 2024. Association for Computational Linguistics, Bangkok, Th...
-
[39]
Shenglai Zeng, Jiankun Zhang, Pengfei He, Jie Ren, Tianqi Zheng, Hanqing Lu, Han Xu, Hui Liu, Yue Xing, and Jiliang Tang. 2025. Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, ...
-
[40]
Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, Jie Jiang, and Bin Cui. 2026. Retrieval- augmented generation for ai-generated content: A survey.Data Science and Engineering(2026), 1–29. doi:10.1007/s41019-025-00335-5
-
[41]
Yujia Zhou, Yan Liu, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Zheng Liu, Chaozhuo Li, Zhicheng Dou, Tsung-Yi Ho, and Philip S. Yu. 2024. Trustworthiness in Retrieval-Augmented Generation Systems: A Survey. doi:10.48550/arXiv.2409. 10102 arXiv:2409.10102. Anonymization Along the RAG Pipeline IWSPA ’26, June 23–25, 2026, Frankfurt am Main, Germany A Supplementa...
-
[42]
I will essentially wear being across from family at Holiday and am hoped to search a Holiday worship concert in Kazakhstan
Which of courses is the plan if everyone acts immediately on planned–which it probably wo ’t. I will essentially wear being across from family at Holiday and am hoped to search a Holiday worship concert in Kazakhstan. I d prefer any prevention in this regarding. And could overall admire inspiration on what type of shoes are logical to bringing. From Shell...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.