arxiv: 2605.11195 · v1 · submitted 2026-05-11 · 💻 cs.CL

Recognition: no theorem link

How Does Differential Privacy Affect Social Bias in LLMs? A Systematic Evaluation

Eduardo Tenorio , Karuna Bhaila , Xintao Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 03:12 UTC · model grok-4.3

classification 💻 cs.CL

keywords differential privacysocial biaslarge language modelsDP-SGDfairness evaluationmemorizationbias measurementmulti-paradigm evaluation

0 comments

The pith

Differential privacy reduces social bias in LLM sentence scoring but the reduction does not hold across other tasks or bias measures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether adding differential privacy during LLM training changes the amount of social bias the models exhibit. It runs the same pretrained model with and without DP-SGD and measures bias through four separate methods: controlled likelihood comparisons on sentences, text completion, tabular classification, and question answering. DP lowers the bias score in the sentence-scoring setup, yet the same models show no consistent improvement on the other three tasks. The work also finds that the drop in memorization from DP does not automatically produce fairer outputs, and that logit-level bias can differ from the bias seen in actual model responses.

Core claim

Training an LLM with DP-SGD reduces bias when measured through controlled likelihood comparisons in sentence scoring tasks, but this improvement does not generalize to text completion, tabular classification, or question answering. A discrepancy exists between logit-level bias and output-level bias. Decreasing memorization through differential privacy does not necessarily reduce unfairness, which shows that multi-paradigm evaluation is required to assess fairness in LLMs.

What carries the argument

Four-paradigm bias evaluation that contrasts DP-SGD models against non-private baselines, with sentence scoring performed via likelihood ratios and the other three paradigms measured on generated outputs.

If this is right

Fairness checks for private LLMs must use several complementary measurement methods rather than any single paradigm.
Reducing memorization via DP does not substitute for explicit fairness interventions.
Logit-based bias scores can diverge from the bias that appears in the model's actual text outputs.
Privacy training can improve one class of bias metrics without additional changes to the model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Separate debiasing steps may still be needed even when differential privacy is already applied for data protection.
The apparent benefit of DP for bias depends heavily on which measurement method is chosen, so results from one paradigm alone can mislead.
Deployments that care about both privacy and fairness should test the specific combination of DP strength and bias metric relevant to their use case.
The independence of memorization reduction and bias reduction suggests these two goals may require distinct training adjustments.

Load-bearing premise

The four chosen bias paradigms are sufficient to detect whether differential privacy affects social bias in general.

What would settle it

Running the same DP versus non-DP comparison on a fifth independent bias paradigm and finding either uniform reduction or an increase in bias would directly test whether the observed task dependence is general.

Figures

Figures reproduced from arXiv: 2605.11195 by Eduardo Tenorio, Karuna Bhaila, Xintao Wu.

**Figure 1.** Figure 1: Overview of our evaluation framework for assessing social bias in pretrained LLMs. The figure presents [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Example of tabular data serialization for Adult [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

read the original abstract

Large language models (LLMs) trained on web-scale corpora can memorize sensitive training data, posing significant privacy risks. Differential privacy (DP) has emerged as a principled framework that limits the influence of individual data points during training, yet the relationship between differential privacy and social bias in LLMs remains poorly understood. To investigate this, we present a systematic evaluation of social bias in a pretrained LLM trained with DP-SGD, comparing a DP model against non-DP baselines across four complementary paradigms: sentence scoring, text completion, tabular classification, and question answering. We find that DP reduces bias in sentence scoring tasks, where bias is measured through controlled likelihood comparisons, yet this improvement does not generalize across all tasks. Our results reveal a discrepancy between logit-level bias and output-level bias. Moreover, decreasing memorization does not necessarily reduce unfairness, underscoring the importance of multi-paradigm evaluation when assessing fairness in LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that training LLMs with differential privacy (DP-SGD) reduces social bias relative to non-DP baselines in sentence-scoring tasks when bias is measured via controlled likelihood comparisons, but that this reduction does not generalize to text-completion, tabular-classification, or question-answering paradigms. It further reports a discrepancy between logit-level and output-level bias measures and concludes that reduced memorization does not necessarily reduce unfairness, underscoring the value of multi-paradigm evaluation.

Significance. If the quantitative results and statistical controls hold, the work would provide useful empirical evidence on the privacy-fairness interplay in LLMs and would strengthen the case for evaluating fairness across multiple output regimes rather than relying on any single paradigm. The emphasis on the logit-versus-output discrepancy and the memorization-unfairness dissociation are potentially actionable for practitioners choosing DP training regimes.

major comments (2)

[Abstract] Abstract: the central claims are stated only in directional terms ('DP reduces bias... yet this improvement does not generalize') with no reported effect sizes, confidence intervals, statistical tests, model scales, dataset sizes, or precise definitions of the four bias metrics. This absence prevents assessment of whether the reported discrepancy is robust or practically meaningful.
[Evaluation Paradigms] Evaluation Paradigms section (or equivalent): the non-generalization conclusion rests on the implicit premise that the four chosen paradigms are jointly sufficient to detect bias effects. The manuscript supplies no coverage argument, no comparison to other standard probes (e.g., WEAT-style association tests, BOLD, or toxicity-prompt suites), and no analysis of whether logit-level versus output-level discrepancies would be expected to appear uniformly across embedding-based or open-ended generation metrics. If additional standard probes were to show DP-induced bias reduction, the headline claim would be weakened.

minor comments (2)

[Abstract] Abstract: adding one sentence summarizing the concrete models, training datasets, and bias-metric implementations would immediately improve readability and allow readers to gauge scope.
[Experimental Setup] The manuscript should clarify whether the non-DP baselines are matched for compute, data, and hyper-parameters or whether any observed differences could be confounded by training regime disparities.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight opportunities to improve the precision and justification in our presentation. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims are stated only in directional terms ('DP reduces bias... yet this improvement does not generalize') with no reported effect sizes, confidence intervals, statistical tests, model scales, dataset sizes, or precise definitions of the four bias metrics. This absence prevents assessment of whether the reported discrepancy is robust or practically meaningful.

Authors: We agree that the abstract would be strengthened by greater quantitative specificity. In the revised manuscript we will incorporate key effect sizes (e.g., the observed bias-score reductions in sentence-scoring tasks), reference the model scales and training dataset sizes employed, and supply concise definitions of the four bias metrics. Statistical significance tests and confidence intervals already appear in the main results; we will briefly reference them in the abstract where length permits. These changes will make the directional claims more informative while preserving the abstract's high-level character. revision: yes
Referee: [Evaluation Paradigms] Evaluation Paradigms section (or equivalent): the non-generalization conclusion rests on the implicit premise that the four chosen paradigms are jointly sufficient to detect bias effects. The manuscript supplies no coverage argument, no comparison to other standard probes (e.g., WEAT-style association tests, BOLD, or toxicity-prompt suites), and no analysis of whether logit-level versus output-level discrepancies would be expected to appear uniformly across embedding-based or open-ended generation metrics. If additional standard probes were to show DP-induced bias reduction, the headline claim would be weakened.

Authors: We will add an explicit coverage argument and expanded discussion to the Evaluation Paradigms section. The revision will justify the selection of the four paradigms as complementary probes spanning controlled likelihood comparisons, open-ended generation, tabular fairness, and question-answering fairness. We will relate these to other standard probes (WEAT, BOLD, toxicity suites), noting that our focus is on output-level and task-specific unfairness rather than purely representational associations. We will also discuss the paradigm-dependent nature of the logit-versus-output discrepancy observed in our results. While we cannot exhaustively test every possible probe, the empirical dissociation we report already demonstrates that bias reduction does not hold uniformly across the evaluated regimes, reinforcing the value of multi-paradigm assessment. The headline claim will be scoped accordingly. revision: partial

Circularity Check

0 steps flagged

Empirical evaluation with no derivation chain

full rationale

The paper reports direct empirical comparisons of bias metrics (likelihood ratios, completion probabilities, classification disparities, QA accuracy gaps) between a DP-SGD model and non-DP baselines across four fixed paradigms. No equations, parameters, or predictions are defined in terms of the target bias quantities; all reported differences are computed from model forward passes on held-out test sets. No self-citations, ansatzes, or uniqueness theorems are invoked to justify the measurement pipeline or to force the observed non-generalization result. The work is therefore self-contained against external benchmarks and contains no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of the four bias paradigms as representative measures and on the assumption that the DP-SGD implementation was correctly applied to the base model. No new mathematical axioms or invented entities are introduced.

axioms (2)

domain assumption Standard assumptions of differential privacy (bounded sensitivity, proper noise calibration) hold for the DP-SGD training procedure used.
Invoked implicitly when claiming that DP training was performed.
domain assumption The chosen bias metrics (likelihood ratios, output distributions, classification fairness) capture the intended notion of social bias.
Required for interpreting the reported reduction or lack of reduction in bias.

pith-pipeline@v0.9.0 · 5458 in / 1305 out tokens · 158699 ms · 2026-05-13T03:12:05.007871+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 2 internal anchors

[1]

Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang

Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security

work page 2016
[2]

Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2016. Machine bias: There's software used across the country to predict future criminals. and it's biased against blacks. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

work page 2016
[3]

Eugene Bagdasaryan, Omid Poursaeed, and Vitaly Shmatikov. 2019. Differential privacy has disparate impact on model accuracy. Advances in neural information processing systems, 32

work page 2019
[4]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT)

work page 2021
[5]

Sarah Bird, Miro Dud \' k, Richard Edgar, Brandon Horn, Roman Lutz, Vanessa Milan, Mehrnoosh Sameki, Hanna Wallach, and Kathleen Walker. 2020. Fairlearn: A toolkit for assessing and improving fairness in ai

work page 2020
[6]

Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. 2016. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29

work page 2016
[7]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901

work page 2020
[8]

Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. 2017. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183--186

work page 2017
[9]

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2022. Quantifying memorization across neural language models. In The Eleventh International Conference on Learning Representations

work page 2022
[10]

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, and 1 others. 2021. Extracting training data from large language models. USENIX Security Symposium

work page 2021
[11]

Valeriia Cherepanova, Chia-Jung Lee, Nil-Jana Akpinar, Riccardo Fogliato, Martin Andres Bertran, Michael Kearns, and James Zou. 2025. Improving llm group fairness on tabular data via in-context learning. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society

work page 2025
[12]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL

work page 2019
[13]

Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta. 2021. Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 862--872

work page 2021
[14]

Dheeru Dua, Casey Graff, and 1 others. 2017. Uci machine learning repository, 2017. URL http://archive. ics. uci. edu/ml, 7(1):62

work page 2017
[15]

Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214--226

work page 2012
[16]

Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265--284. Springer

work page 2006
[17]

Ferdinando Fioretto, Cuong Tran, Pascal Van Hentenryck, and Keyu Zhu. 2022. Differential privacy and fairness in decisions and learning tasks: A survey. arXiv preprint arXiv:2202.08187

work page arXiv 2022
[18]

Isabel O Gallegos, Ryan A Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K Ahmed. 2024. Bias and fairness in large language models: A survey. Computational linguistics, 50(3):1097--1179

work page 2024
[19]

Laura Hanu and Unitary team . 2020. Detoxify. Github. https://github.com/unitaryai/detoxify

work page 2020
[20]

Moritz Hardt, Eric Price, and Nati Srebro. 2016. Equality of opportunity in supervised learning. Advances in neural information processing systems, 29

work page 2016
[21]

Jingyu Hu, Weiru Liu, and Mengnan Du. 2024. Strategic demonstration selection for improved fairness in llm in-context learning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7460--7475

work page 2024
[22]

Clayton Hutto and Eric Gilbert. 2014. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media, volume 8, pages 216--225

work page 2014
[23]

Md Khairul Islam, Andrew Wang, Tianhao Wang, Yangfeng Ji, Judy Fox, and Jieyu Zhao. 2024. Does differential privacy impact bias in pretrained nlp models? arXiv preprint arXiv:2410.18749

work page arXiv 2024
[24]

Moin Nadeem, Anna Bethke, and Siva Reddy. 2021. Stereoset: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), pages 5356--5371

work page 2021
[25]

Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel Bowman. 2020. Crows-pairs: A challenge dataset for measuring social biases in masked language models. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 1953--1967

work page 2020
[26]

Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel Bowman. 2022. Bbq: A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2086--2105

work page 2022
[27]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, and 1 others. 2019. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024--8035

work page 2019
[28]

u ller, Ga \

Jingang Qu, David Holzm \"u ller, Ga \"e l Varoquaux, and Marine Le Morvan. 2025. Tabicl: A tabular foundation model for in-context learning on large data. In Proceedings of the 42nd International Conference on Machine Learning

work page 2025
[29]

Emily Sheng, Kai-Wei Chang, Prem Natarajan, and Nanyun Peng. 2019. The woman worked as a babysitter: On biases in language generation. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 3407--3412

work page 2019
[30]

Amer Sinha, Thomas Mesnard, Ryan McKenna, Daogao Liu, Christopher A Choquette-Choo, Yangsibo Huang, Da Yu, George Kaissis, Zachary Charles, Ruibo Liu, and 1 others. 2025. Vaultgemma: A differentially private gemma model. arXiv preprint arXiv:2510.15001

work page arXiv 2025
[31]

i'm sorry to hear that

Eric Michael Smith, Melissa Hall, Melanie Kambadur, Eleonora Presani, and Adina Williams. 2022. "i'm sorry to hear that": Finding new biases in language models with a holistic descriptor dataset. In Proceedings of the 2022 conference on empirical methods in natural language processing, pages 9180--9211

work page 2022
[32]

Mitchell

Sanjari Srivastava, Piotr Mardziel, Zhikhun Zhang, Archana Ahlawat, Anupam Datta, and John C. Mitchell. 2024. De-amplifying bias from differential privacy in language model fine-tuning. arXiv preprint arXiv:2402.04489

work page arXiv 2024
[33]

Gemma Team. 2025. Gemma 3 technical report. arXiv preprint arXiv:2503.19786

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Gemma Team, Morgane Riviere, Shreya Pathak, and 1 others. 2024. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. Bert rediscovers the classical nlp pipeline. In ACL

work page 2019
[36]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R \'e mi Louf, Morgan Funtowicz, and 1 others. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38--45

work page 2020
[37]

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2018. Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 15--20

work page 2018