Recognition: unknown
A Systematic Exploration of Text Decomposition and Budget Distribution in Differentially Private Text Obfuscation
Pith reviewed 2026-05-09 18:53 UTC · model grok-4.3
The pith
Text decomposition and budget allocation choices significantly affect outcomes in differentially private text obfuscation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our experiments reveal that such design choices are very important, as even with comparable privacy budgets, significantly different results can occur based on which methods are chosen. In this, we provide credible evidence of the feasibility of maximizing empirical trade-offs by optimizing DP obfuscation procedures.
What carries the argument
Techniques for decomposing input texts into component pieces combined with methods for distributing an overall ε privacy budget among those pieces prior to applying differentially private perturbations.
Load-bearing premise
That the chosen evaluation metrics and datasets adequately capture real-world privacy leakage and downstream utility, and that the tested decomposition and allocation techniques are representative of practical use cases.
What would settle it
An experiment on standard benchmarks where all tested combinations of text decomposition and budget allocation produce identical privacy-utility curves would falsify the claim that these design choices are very important.
Figures
read the original abstract
The goal of differentially private text obfuscation is to obfuscate, or "perturb", input texts with Differential Privacy (DP) guarantees, such that the private output texts are quantifiably indistinguishable from the originals. While perturbation at the word level is intuitive, meaningful text privatization happens on complete documents. Recent research has laid the groundwork for reasoning about privacy budget distribution, namely, how an overall $\varepsilon$ budget can be sensibly distributed among the component pieces of a text. We perform a systematic evaluation of multiple text decomposition and budget distribution techniques in the context of DP text obfuscation, testing how different methods for chunking texts can be combined with techniques for allocating $\varepsilon$ to these chunks. Our experiments reveal that such design choices are very important, as even with comparable privacy budgets, significantly different results can occur based on which methods are chosen. In this, we provide credible evidence of the feasibility of maximizing empirical trade-offs by optimizing DP obfuscation procedures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that in differentially private text obfuscation, choices of text decomposition (chunking) methods and privacy budget allocation techniques substantially affect empirical privacy-utility trade-offs, even when the total ε budget is held constant. Through a systematic comparison of multiple chunking and allocation strategies, it argues that these design decisions are important and that optimizing them enables better maximization of observed trade-offs.
Significance. If the empirical results are robust, the work would usefully demonstrate that DP text obfuscation is sensitive to decomposition and allocation design, moving the field beyond word-level perturbations toward document-level reasoning. It provides concrete evidence that empirical optimization of these procedures is feasible and can improve trade-offs, which could inform practical implementations in privacy-preserving NLP. The contribution is primarily empirical and would be strengthened by more rigorous privacy and utility validation.
major comments (2)
- Experimental evaluation: the central claim that design choices produce 'significantly different results' even at comparable ε rests on experiments whose metrics (utility scores and ε-composition) are described at a high level. Without reported membership-inference, attribute-inference, or reconstruction attack results, or tests on diverse downstream tasks and out-of-distribution data, it remains unclear whether the observed differences reflect practically meaningful privacy leakage or utility gains rather than artifacts of the chosen proxies.
- Results reporting: the manuscript asserts that 'significantly different results can occur' but provides no quantitative effect sizes, statistical significance tests, confidence intervals, or full data-split details to support the magnitude and reliability of these differences. This weakens the evidence for the feasibility of 'maximizing empirical trade-offs by optimizing DP obfuscation procedures.'
minor comments (1)
- Abstract: the description of the specific decomposition and allocation techniques evaluated could be more precise to allow readers to immediately understand the scope of the systematic comparison.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our paper. We address the major comments below and have made revisions to improve the clarity and rigor of our experimental evaluation and results reporting.
read point-by-point responses
-
Referee: Experimental evaluation: the central claim that design choices produce 'significantly different results' even at comparable ε rests on experiments whose metrics (utility scores and ε-composition) are described at a high level. Without reported membership-inference, attribute-inference, or reconstruction attack results, or tests on diverse downstream tasks and out-of-distribution data, it remains unclear whether the observed differences reflect practically meaningful privacy leakage or utility gains rather than artifacts of the chosen proxies.
Authors: We agree that empirical attack-based evaluations would provide stronger evidence of practical privacy guarantees. Our manuscript focuses on the impact of text decomposition and budget allocation strategies on the privacy-utility trade-off using standard DP composition for privacy and established utility metrics such as semantic similarity and downstream task performance. The central contribution is to show that these design choices lead to different observed trade-offs even under the same total ε, which is a valid empirical observation independent of specific attack models. We have revised the manuscript to include more detailed descriptions of the metrics used and added a limitations section discussing the use of proxy metrics versus attack-based evaluations. Comprehensive attack experiments on diverse tasks are left for future work as they would require a separate study. revision: partial
-
Referee: Results reporting: the manuscript asserts that 'significantly different results can occur' but provides no quantitative effect sizes, statistical significance tests, confidence intervals, or full data-split details to support the magnitude and reliability of these differences. This weakens the evidence for the feasibility of 'maximizing empirical trade-offs by optimizing DP obfuscation procedures.'
Authors: We appreciate this point and have updated the results section to include quantitative effect sizes (e.g., relative improvements in utility at fixed ε), statistical significance tests (paired t-tests with p-values), and confidence intervals where applicable. We have also added details on the data splits used in our experiments, including the number of samples and cross-validation procedures. These additions strengthen the support for our claims regarding the importance of optimizing decomposition and allocation methods. revision: yes
Circularity Check
No circularity: purely empirical comparison without derivation or self-referential reduction
full rationale
The paper conducts a systematic empirical evaluation of text decomposition and budget-distribution techniques for DP text obfuscation, comparing methods via experiments on utility and privacy metrics. No mathematical derivation, first-principles result, or predictive claim is advanced that reduces by construction to fitted inputs, self-definitions, or self-citation chains. Central claims rest on observed experimental differences at comparable ε budgets, which are independent of any internal fitting or renaming. This matches the default expectation for non-circular empirical work; any self-citations (e.g., to prior DP obfuscation groundwork) are non-load-bearing background and do not substitute for the reported results.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Differential Privacy in Natural Language Processing: The Story So Far
Klymenko, Oleksandra and Meisenbacher, Stephen and Matthes, Florian. Differential Privacy in Natural Language Processing: The Story So Far. Proceedings of the Fourth Workshop on Privacy in Natural Language Processing. 2022. doi:10.18653/v1/2022.privatenlp-1.1
-
[2]
High-Confidence Computing , volume=
A survey on large language model (llm) security and privacy: The good, the bad, and the ugly , author=. High-Confidence Computing , volume=. 2024 , publisher=
2024
-
[3]
, journal=
Mahendran, Darshini and Luo, Changqing and Mcinnes, Bridget T. , journal=. Review: Privacy-Preservation in the Context of Natural Language Processing , year=
-
[4]
Artificial Intelligence Review , volume=
How to keep text private? A systematic review of deep learning methods for privacy-preserving natural language processing , author=. Artificial Intelligence Review , volume=. 2023 , publisher=
2023
-
[5]
Privacy Risks of General-Purpose Language Models , year=
Pan, Xudong and Zhang, Mi and Ji, Shouling and Yang, Min , booktitle=. Privacy Risks of General-Purpose Language Models , year=
-
[6]
Differentially Private Natural Language Models: Recent Advances and Future Directions
Hu, Lijie and Habernal, Ivan and Shen, Lei and Wang, Di. Differentially Private Natural Language Models: Recent Advances and Future Directions. Findings of the Association for Computational Linguistics: EACL 2024. 2024
2024
-
[7]
DP -Rewrite: Towards Reproducibility and Transparency in Differentially Private Text Rewriting
Igamberdiev, Timour and Arnold, Thomas and Habernal, Ivan. DP -Rewrite: Towards Reproducibility and Transparency in Differentially Private Text Rewriting. Proceedings of the 29th International Conference on Computational Linguistics. 2022
2022
-
[8]
Feyisetan, Oluwaseyi and Balle, Borja and Drake, Thomas and Diethe, Tom , title =. 2020 , isbn =. doi:10.1145/3336191.3371856 , booktitle =
-
[9]
The Limits of Word Level Differential Privacy
Mattern, Justus and Weggenmann, Benjamin and Kerschbaum, Florian. The Limits of Word Level Differential Privacy. Findings of the Association for Computational Linguistics: NAACL 2022. 2022. doi:10.18653/v1/2022.findings-naacl.65
-
[10]
International colloquium on automata, languages, and programming , pages=
Differential privacy , author=. International colloquium on automata, languages, and programming , pages=. 2006 , organization=
2006
-
[11]
A Collocation-based Method for Addressing Challenges in Word-level Metric Differential Privacy
Meisenbacher, Stephen and Chevli, Maulik and Matthes, Florian. A Collocation-based Method for Addressing Challenges in Word-level Metric Differential Privacy. Proceedings of the Fifth Workshop on Privacy in Natural Language Processing. 2024
2024
-
[12]
De Faveri, Francesco Luigi and Faggioli, Guglielmo and Ferro, Nicola , title =. Proceedings of the 34th ACM International Conference on Information and Knowledge Management , pages =. 2025 , isbn =. doi:10.1145/3746252.3760888 , abstract =
-
[13]
Proceedings of the Fifteenth ACM Conference on Data and Application Security and Privacy , pages =
Meisenbacher, Stephen and Lee, Chaeeun Joy and Matthes, Florian , title =. Proceedings of the Fifteenth ACM Conference on Data and Application Security and Privacy , pages =. 2025 , isbn =. doi:10.1145/3714393.3726504 , abstract =
-
[14]
Sentence-level Privacy for Document Embeddings
Meehan, Casey and Mrini, Khalil and Chaudhuri, Kamalika. Sentence-level Privacy for Document Embeddings. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.238
-
[15]
Differentially Private n-gram Extraction , url =
Kim, Kunho and Gopi, Sivakanth and Kulkarni, Janardhan and Yekhanin, Sergey , booktitle =. Differentially Private n-gram Extraction , url =
-
[16]
and Nissim, Kobbi and Raskhodnikova, Sofya and Smith, Adam , booktitle=
Kasiviswanathan, Shiva Prasad and Lee, Homin K. and Nissim, Kobbi and Raskhodnikova, Sofya and Smith, Adam , booktitle=. What Can We Learn Privately? , year=
-
[17]
A Comparative Analysis of Word-Level Metric Differential Privacy: Benchmarking the Privacy-Utility Trade-off
Meisenbacher, Stephen and Nandakumar, Nihildev and Klymenko, Alexandra and Matthes, Florian. A Comparative Analysis of Word-Level Metric Differential Privacy: Benchmarking the Privacy-Utility Trade-off. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024
2024
-
[18]
DP - BART for Privatized Text Rewriting under Local Differential Privacy
Igamberdiev, Timour and Habernal, Ivan. DP - BART for Privatized Text Rewriting under Local Differential Privacy. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.874
-
[19]
Meisenbacher, Stephen and Matthes, Florian. Thinking Outside of the Differential Privacy Box: A Case Study in Text Privatization with Language Model Prompting. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.324
-
[20]
Vu, Doan Nam Long and Igamberdiev, Timour and Habernal, Ivan. Granularity is crucial when applying differential privacy to text: An investigation for neural machine translation. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.29
-
[21]
Privacy Enhancing Technologies: 13th International Symposium, PETS 2013, Bloomington, IN, USA, July 10-12, 2013
Broadening the scope of differential privacy using metrics , author=. Privacy Enhancing Technologies: 13th International Symposium, PETS 2013, Bloomington, IN, USA, July 10-12, 2013. Proceedings 13 , pages=. 2013 , doi=
2013
-
[22]
Generalised differential privacy for text document processing , author=. Principles of Security and Trust: 8th International Conference, POST 2019, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2019, Prague, Czech Republic, April 6--11, 2019, Proceedings 8 , pages=. doi:10.1007/978-3-030-17138-4_6 , year=
-
[23]
Yue, Xiang and Du, Minxin and Wang, Tianhao and Li, Yaliang and Sun, Huan and Chow, Sherman S. M. Differential Privacy for Text Analytics via Natural Text Sanitization. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021. doi:10.18653/v1/2021.findings-acl.337
-
[24]
2023 , organization=
Carvalho, Ricardo Silva and Vasiloudis, Theodore and Feyisetan, Oluwaseyi and Wang, Ke , booktitle=. 2023 , organization=
2023
-
[25]
Guiding Text-to-Text Privatization by Syntax
Arnold, Stefan and Yesilbas, Dilara and Weinzierl, Sven. Guiding Text-to-Text Privatization by Syntax. Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023). 2023. doi:10.18653/v1/2023.trustnlp-1.14
-
[26]
Driving Context into Text-to-Text Privatization
Arnold, Stefan and Yesilbas, Dilara and Weinzierl, Sven. Driving Context into Text-to-Text Privatization. Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023). 2023. doi:10.18653/v1/2023.trustnlp-1.2
-
[27]
Combining Association Measures for Collocation Extraction
Pecina, Pavel and Schlesinger, Pavel. Combining Association Measures for Collocation Extraction. Proceedings of the COLING / ACL 2006 Main Conference Poster Sessions. 2006
2006
-
[28]
A Comparative Study of Collocation Extraction Methods from the Perspectives of Vocabulary and Grammar: A Case Study in the Field of Journalism
Gu, Lulu and Pan, Yue and Liu, Pengyuan. A Comparative Study of Collocation Extraction Methods from the Perspectives of Vocabulary and Grammar: A Case Study in the Field of Journalism. Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation. 2021
2021
-
[29]
Evaluation of automatic collocation extraction methods for language learning
Bhalla, Vishal and Klimcikova, Klara. Evaluation of automatic collocation extraction methods for language learning. Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications. 2019. doi:10.18653/v1/W19-4428
-
[30]
Machine learning , volume=
Statistical models for text segmentation , author=. Machine learning , volume=. 1999 , publisher=
1999
-
[31]
Innovative Computing, Optimization and Its Applications: Modelling and Simulations , pages=
Text segmentation techniques: a critical review , author=. Innovative Computing, Optimization and Its Applications: Modelling and Simulations , pages=. 2017 , publisher=
2017
-
[32]
S urvey: Multiword Expression Processing: A S urvey
Constant, Mathieu and Eryiǧit, G. S urvey: Multiword Expression Processing: A S urvey. Computational Linguistics. 2017. doi:10.1162/COLI_a_00302
-
[33]
The FineWeb datasets: decanting the web for the finest text data at scale , year =
Penedo, Guilherme and Kydl\'. The FineWeb datasets: decanting the web for the finest text data at scale , year =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =
-
[34]
2004 , school=
Extending the Log Likelihood Measure to Improve Collection Identification , author=. 2004 , school=
2004
-
[35]
International conference on intelligent text processing and computational linguistics , pages=
Multiword expressions: A pain in the neck for NLP , author=. International conference on intelligent text processing and computational linguistics , pages=. 2002 , organization=
2002
-
[36]
Miller, George A. , title =. Commun. ACM , month = nov, pages =. 1995 , issue_date =. doi:10.1145/219717.219748 , abstract =
-
[37]
2009 , publisher=
Natural language processing with Python: analyzing text with the natural language toolkit , author=. 2009 , publisher=
2009
-
[38]
Word Association Norms, Mutual Information, and Lexicography
Church, Kenneth Ward and Hanks, Patrick. Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics. 1990
1990
-
[39]
Accurate Methods for the Statistics of Surprise and Coincidence
Dunning, Ted. Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics. 1993
1993
-
[40]
and Buchholz, Sabine
Tjong Kim Sang, Erik F. and Buchholz, Sabine. Introduction to the C o NLL -2000 Shared Task Chunking. Fourth Conference on Computational Natural Language Learning and the Second Learning Language in Logic Workshop. 2000
2000
-
[41]
Maarten Grootendorst , title =. doi:10.5281/zenodo.4461265 , url =
-
[42]
Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks
Reimers, Nils and Gurevych, Iryna. Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1410
-
[43]
Information Sciences , volume=
YAKE! Keyword extraction from single documents using multiple local features , author=. Information Sciences , volume=. 2020 , publisher=
2020
-
[44]
BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...
-
[45]
Proceedings of the 34th International Conference on Machine Learning , pages =
Axiomatic Attribution for Deep Networks , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =
2017
-
[46]
2013 , eprint=
Efficient Estimation of Word Representations in Vector Space , author=. 2013 , eprint=
2013
-
[47]
European conference on machine learning , pages=
The enron corpus: A new dataset for email classification research , author=. European conference on machine learning , pages=. 2004 , organization=
2004
-
[48]
On the Impact of Noise in Differentially Private Text Rewriting
Meisenbacher, Stephen and Chevli, Maulik and Matthes, Florian. On the Impact of Noise in Differentially Private Text Rewriting. Findings of the Association for Computational Linguistics: NAACL 2025. 2025. doi:10.18653/v1/2025.findings-naacl.32
-
[49]
User Review Sites as a Resource for Large-Scale Sociolinguistic Studies , year =
Hovy, Dirk and Johannsen, Anders and S. User Review Sites as a Resource for Large-Scale Sociolinguistic Studies , year =. doi:10.1145/2736277.2741141 , booktitle =
-
[50]
Locally Differentially Private Document Generation Using Zero Shot Prompting
Utpala, Saiteja and Hooker, Sara and Chen, Pin-Yu. Locally Differentially Private Document Generation Using Zero Shot Prompting. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.566
-
[51]
2021 , eprint=
DeBERTa: Decoding-enhanced BERT with Disentangled Attention , author=. 2021 , eprint=
2021
-
[52]
2023 , eprint=
Towards General Text Embeddings with Multi-stage Contrastive Learning , author=. 2023 , eprint=
2023
-
[53]
2019 , journal=
Language Models are Unsupervised Multitask Learners , author=. 2019 , journal=
2019
-
[54]
Proceedings of the ACM Web Conference 2022 , pages =
Weggenmann, Benjamin and Rublack, Valentin and Andrejczuk, Michael and Mattern, Justus and Kerschbaum, Florian , title =. Proceedings of the ACM Web Conference 2022 , pages =. 2022 , isbn =. doi:10.1145/3485447.3512232 , abstract =
-
[55]
Tukey, John W. , title =. Transactions of the New York Academy of Sciences , volume =. doi:https://doi.org/10.1111/j.2164-0947.1953.tb01326.x , url =. https://nyaspubs.onlinelibrary.wiley.com/doi/pdf/10.1111/j.2164-0947.1953.tb01326.x , year =
-
[56]
Fisher, R. A. Statistical Methods for Research Workers. Breakthroughs in Statistics: Methodology and Distribution. 1992. doi:10.1007/978-1-4612-4380-9_6
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.