Recognition: no theorem link
How Does Differential Privacy Affect Social Bias in LLMs? A Systematic Evaluation
Pith reviewed 2026-05-13 03:12 UTC · model grok-4.3
The pith
Differential privacy reduces social bias in LLM sentence scoring but the reduction does not hold across other tasks or bias measures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Training an LLM with DP-SGD reduces bias when measured through controlled likelihood comparisons in sentence scoring tasks, but this improvement does not generalize to text completion, tabular classification, or question answering. A discrepancy exists between logit-level bias and output-level bias. Decreasing memorization through differential privacy does not necessarily reduce unfairness, which shows that multi-paradigm evaluation is required to assess fairness in LLMs.
What carries the argument
Four-paradigm bias evaluation that contrasts DP-SGD models against non-private baselines, with sentence scoring performed via likelihood ratios and the other three paradigms measured on generated outputs.
If this is right
- Fairness checks for private LLMs must use several complementary measurement methods rather than any single paradigm.
- Reducing memorization via DP does not substitute for explicit fairness interventions.
- Logit-based bias scores can diverge from the bias that appears in the model's actual text outputs.
- Privacy training can improve one class of bias metrics without additional changes to the model.
Where Pith is reading between the lines
- Separate debiasing steps may still be needed even when differential privacy is already applied for data protection.
- The apparent benefit of DP for bias depends heavily on which measurement method is chosen, so results from one paradigm alone can mislead.
- Deployments that care about both privacy and fairness should test the specific combination of DP strength and bias metric relevant to their use case.
- The independence of memorization reduction and bias reduction suggests these two goals may require distinct training adjustments.
Load-bearing premise
The four chosen bias paradigms are sufficient to detect whether differential privacy affects social bias in general.
What would settle it
Running the same DP versus non-DP comparison on a fifth independent bias paradigm and finding either uniform reduction or an increase in bias would directly test whether the observed task dependence is general.
Figures
read the original abstract
Large language models (LLMs) trained on web-scale corpora can memorize sensitive training data, posing significant privacy risks. Differential privacy (DP) has emerged as a principled framework that limits the influence of individual data points during training, yet the relationship between differential privacy and social bias in LLMs remains poorly understood. To investigate this, we present a systematic evaluation of social bias in a pretrained LLM trained with DP-SGD, comparing a DP model against non-DP baselines across four complementary paradigms: sentence scoring, text completion, tabular classification, and question answering. We find that DP reduces bias in sentence scoring tasks, where bias is measured through controlled likelihood comparisons, yet this improvement does not generalize across all tasks. Our results reveal a discrepancy between logit-level bias and output-level bias. Moreover, decreasing memorization does not necessarily reduce unfairness, underscoring the importance of multi-paradigm evaluation when assessing fairness in LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that training LLMs with differential privacy (DP-SGD) reduces social bias relative to non-DP baselines in sentence-scoring tasks when bias is measured via controlled likelihood comparisons, but that this reduction does not generalize to text-completion, tabular-classification, or question-answering paradigms. It further reports a discrepancy between logit-level and output-level bias measures and concludes that reduced memorization does not necessarily reduce unfairness, underscoring the value of multi-paradigm evaluation.
Significance. If the quantitative results and statistical controls hold, the work would provide useful empirical evidence on the privacy-fairness interplay in LLMs and would strengthen the case for evaluating fairness across multiple output regimes rather than relying on any single paradigm. The emphasis on the logit-versus-output discrepancy and the memorization-unfairness dissociation are potentially actionable for practitioners choosing DP training regimes.
major comments (2)
- [Abstract] Abstract: the central claims are stated only in directional terms ('DP reduces bias... yet this improvement does not generalize') with no reported effect sizes, confidence intervals, statistical tests, model scales, dataset sizes, or precise definitions of the four bias metrics. This absence prevents assessment of whether the reported discrepancy is robust or practically meaningful.
- [Evaluation Paradigms] Evaluation Paradigms section (or equivalent): the non-generalization conclusion rests on the implicit premise that the four chosen paradigms are jointly sufficient to detect bias effects. The manuscript supplies no coverage argument, no comparison to other standard probes (e.g., WEAT-style association tests, BOLD, or toxicity-prompt suites), and no analysis of whether logit-level versus output-level discrepancies would be expected to appear uniformly across embedding-based or open-ended generation metrics. If additional standard probes were to show DP-induced bias reduction, the headline claim would be weakened.
minor comments (2)
- [Abstract] Abstract: adding one sentence summarizing the concrete models, training datasets, and bias-metric implementations would immediately improve readability and allow readers to gauge scope.
- [Experimental Setup] The manuscript should clarify whether the non-DP baselines are matched for compute, data, and hyper-parameters or whether any observed differences could be confounded by training regime disparities.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight opportunities to improve the precision and justification in our presentation. We respond to each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claims are stated only in directional terms ('DP reduces bias... yet this improvement does not generalize') with no reported effect sizes, confidence intervals, statistical tests, model scales, dataset sizes, or precise definitions of the four bias metrics. This absence prevents assessment of whether the reported discrepancy is robust or practically meaningful.
Authors: We agree that the abstract would be strengthened by greater quantitative specificity. In the revised manuscript we will incorporate key effect sizes (e.g., the observed bias-score reductions in sentence-scoring tasks), reference the model scales and training dataset sizes employed, and supply concise definitions of the four bias metrics. Statistical significance tests and confidence intervals already appear in the main results; we will briefly reference them in the abstract where length permits. These changes will make the directional claims more informative while preserving the abstract's high-level character. revision: yes
-
Referee: [Evaluation Paradigms] Evaluation Paradigms section (or equivalent): the non-generalization conclusion rests on the implicit premise that the four chosen paradigms are jointly sufficient to detect bias effects. The manuscript supplies no coverage argument, no comparison to other standard probes (e.g., WEAT-style association tests, BOLD, or toxicity-prompt suites), and no analysis of whether logit-level versus output-level discrepancies would be expected to appear uniformly across embedding-based or open-ended generation metrics. If additional standard probes were to show DP-induced bias reduction, the headline claim would be weakened.
Authors: We will add an explicit coverage argument and expanded discussion to the Evaluation Paradigms section. The revision will justify the selection of the four paradigms as complementary probes spanning controlled likelihood comparisons, open-ended generation, tabular fairness, and question-answering fairness. We will relate these to other standard probes (WEAT, BOLD, toxicity suites), noting that our focus is on output-level and task-specific unfairness rather than purely representational associations. We will also discuss the paradigm-dependent nature of the logit-versus-output discrepancy observed in our results. While we cannot exhaustively test every possible probe, the empirical dissociation we report already demonstrates that bias reduction does not hold uniformly across the evaluated regimes, reinforcing the value of multi-paradigm assessment. The headline claim will be scoped accordingly. revision: partial
Circularity Check
Empirical evaluation with no derivation chain
full rationale
The paper reports direct empirical comparisons of bias metrics (likelihood ratios, completion probabilities, classification disparities, QA accuracy gaps) between a DP-SGD model and non-DP baselines across four fixed paradigms. No equations, parameters, or predictions are defined in terms of the target bias quantities; all reported differences are computed from model forward passes on held-out test sets. No self-citations, ansatzes, or uniqueness theorems are invoked to justify the measurement pipeline or to force the observed non-generalization result. The work is therefore self-contained against external benchmarks and contains no circular steps.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Standard assumptions of differential privacy (bounded sensitivity, proper noise calibration) hold for the DP-SGD training procedure used.
- domain assumption The chosen bias metrics (likelihood ratios, output distributions, classification fairness) capture the intended notion of social bias.
Reference graph
Works this paper leans on
-
[1]
Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang
Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security
work page 2016
-
[2]
Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2016. Machine bias: There's software used across the country to predict future criminals. and it's biased against blacks. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
work page 2016
-
[3]
Eugene Bagdasaryan, Omid Poursaeed, and Vitaly Shmatikov. 2019. Differential privacy has disparate impact on model accuracy. Advances in neural information processing systems, 32
work page 2019
-
[4]
Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT)
work page 2021
-
[5]
Sarah Bird, Miro Dud \' k, Richard Edgar, Brandon Horn, Roman Lutz, Vanessa Milan, Mehrnoosh Sameki, Hanna Wallach, and Kathleen Walker. 2020. Fairlearn: A toolkit for assessing and improving fairness in ai
work page 2020
-
[6]
Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. 2016. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29
work page 2016
-
[7]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901
work page 2020
-
[8]
Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. 2017. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183--186
work page 2017
-
[9]
Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2022. Quantifying memorization across neural language models. In The Eleventh International Conference on Learning Representations
work page 2022
-
[10]
Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, and 1 others. 2021. Extracting training data from large language models. USENIX Security Symposium
work page 2021
-
[11]
Valeriia Cherepanova, Chia-Jung Lee, Nil-Jana Akpinar, Riccardo Fogliato, Martin Andres Bertran, Michael Kearns, and James Zou. 2025. Improving llm group fairness on tabular data via in-context learning. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society
work page 2025
-
[12]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL
work page 2019
-
[13]
Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta. 2021. Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 862--872
work page 2021
-
[14]
Dheeru Dua, Casey Graff, and 1 others. 2017. Uci machine learning repository, 2017. URL http://archive. ics. uci. edu/ml, 7(1):62
work page 2017
-
[15]
Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214--226
work page 2012
-
[16]
Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265--284. Springer
work page 2006
- [17]
-
[18]
Isabel O Gallegos, Ryan A Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K Ahmed. 2024. Bias and fairness in large language models: A survey. Computational linguistics, 50(3):1097--1179
work page 2024
-
[19]
Laura Hanu and Unitary team . 2020. Detoxify. Github. https://github.com/unitaryai/detoxify
work page 2020
-
[20]
Moritz Hardt, Eric Price, and Nati Srebro. 2016. Equality of opportunity in supervised learning. Advances in neural information processing systems, 29
work page 2016
-
[21]
Jingyu Hu, Weiru Liu, and Mengnan Du. 2024. Strategic demonstration selection for improved fairness in llm in-context learning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7460--7475
work page 2024
-
[22]
Clayton Hutto and Eric Gilbert. 2014. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media, volume 8, pages 216--225
work page 2014
- [23]
-
[24]
Moin Nadeem, Anna Bethke, and Siva Reddy. 2021. Stereoset: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), pages 5356--5371
work page 2021
-
[25]
Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel Bowman. 2020. Crows-pairs: A challenge dataset for measuring social biases in masked language models. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 1953--1967
work page 2020
-
[26]
Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel Bowman. 2022. Bbq: A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2086--2105
work page 2022
-
[27]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, and 1 others. 2019. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024--8035
work page 2019
-
[28]
Jingang Qu, David Holzm \"u ller, Ga \"e l Varoquaux, and Marine Le Morvan. 2025. Tabicl: A tabular foundation model for in-context learning on large data. In Proceedings of the 42nd International Conference on Machine Learning
work page 2025
-
[29]
Emily Sheng, Kai-Wei Chang, Prem Natarajan, and Nanyun Peng. 2019. The woman worked as a babysitter: On biases in language generation. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 3407--3412
work page 2019
- [30]
-
[31]
Eric Michael Smith, Melissa Hall, Melanie Kambadur, Eleonora Presani, and Adina Williams. 2022. "i'm sorry to hear that": Finding new biases in language models with a holistic descriptor dataset. In Proceedings of the 2022 conference on empirical methods in natural language processing, pages 9180--9211
work page 2022
- [32]
-
[33]
Gemma Team. 2025. Gemma 3 technical report. arXiv preprint arXiv:2503.19786
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Gemma Team, Morgane Riviere, Shreya Pathak, and 1 others. 2024. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. Bert rediscovers the classical nlp pipeline. In ACL
work page 2019
-
[36]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R \'e mi Louf, Morgan Funtowicz, and 1 others. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38--45
work page 2020
-
[37]
Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2018. Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 15--20
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.