pith. sign in

arxiv: 2606.12088 · v1 · pith:HW6KS7EQnew · submitted 2026-06-10 · 💻 cs.CL

Debiasing Without Protected Attributes: Latent Concept Erasure from Textual Profiles

Pith reviewed 2026-06-27 09:57 UTC · model grok-4.3

classification 💻 cs.CL
keywords fairness in NLPdebiasing without protected attributesconcept erasureself-description textlanguage modelshelpfulness predictionimplicit signalsStack Exchange benchmark
0
0 comments X

The pith

Debiasing language models works using self-description text in place of explicit protected attributes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that fairness interventions in NLP can proceed without direct access to sensitive attributes such as gender or race. It introduces H-SAL to erase biased concepts post-hoc by treating self-description text as the debiasing signal instead. A new Stack Exchange benchmark for helpfulness prediction supplies both explicit and implicit signals to enable direct comparison. Experiments across encoder and decoder-only models show the implicit route matches or exceeds the explicit one. This matters because many real deployments lack the protected labels that standard debiasing methods require.

Core claim

Post-hoc concept and attribute erasure can be performed using self-description text as an implicit debiasing signal, and across encoder and decoder-only language models this implicit approach often matches or outperforms explicit-label-based debiasing on a new multi-domain fairness benchmark for helpfulness prediction.

What carries the argument

H-SAL, which performs post-hoc concept and attribute erasure using self-description text as an implicit debiasing signal

If this is right

  • Debiasing remains feasible in privacy-constrained or metadata-absent settings.
  • Task performance on downstream prediction does not degrade when switching to implicit signals.
  • Both encoder-only and decoder-only models respond comparably to the implicit erasure method.
  • A multi-domain benchmark now exists for testing debiasing under realistic data constraints.
  • Representation-level fairness methods can be applied without collecting protected attributes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be tested on other prediction tasks such as toxicity detection or resume screening where self-descriptions are available.
  • If self-descriptions prove stable over time, models could be periodically re-debiased without re-collecting labels.
  • The approach raises the question of whether implicit signals themselves encode secondary biases that explicit methods avoid.
  • Deployment pipelines could default to implicit erasure when explicit attributes are legally unavailable.

Load-bearing premise

Self-description text contains reliable implicit signals that can substitute for direct protected attributes in concept erasure without introducing new biases or losing task performance.

What would settle it

An experiment showing that H-SAL with self-descriptions produces lower fairness scores or worse helpfulness prediction accuracy than explicit-label debiasing on the Stack Exchange benchmark would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.12088 by Anna Korhonen, Shay B. Cohen, Shun Shao, Yftah Ziser, Zheng Zhao.

Figure 1
Figure 1. Figure 1: Self-description cues used for textual era￾sure. Real examples from Stack Exchange user pro￾files, where H denotes the user’s self-description. The highlighted spans mark words or phrases that may im￾plicitly reveal attributes such as education, reputation or location, and are therefore used to learn directions for textual erasure. in latent representations, and let them shape re￾sponses even without expli… view at source ↗
Figure 2
Figure 2. Figure 2: A Venn diagram showing the information content overlap between the different r.v.s. A success of H-SAL depends on how much the part of Z that overlaps with X also overlaps with H. where X is the representation of the main-task in￾put, Y is the task label, Z is the guarded attribute (Z denotes the set of possible attribute values), and H is the raw self-description text. We encode H with a text encoder ϕ in… view at source ↗
Figure 4
Figure 4. Figure 4: Plots for the bound for ρ1 where u ⊤ 3 v2 = 0.5 (left) and = 1 (right). Note that the bound can be symmetrically flipped to have ρ2 on the left, dependent on ρ1 [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Engagement bias reduction across three Stack Exchange communities under the natural-disjoint split. Each subfigure plots TPR-Gap against the number of removed directions (k) for BERT, Mistral-7B, and Llama-3.1-8B, comparing random, explicit, and implicit SAL. TPR-Gap is the generalized disparity (2× StdDev) of true positive rates across groups, equivalent to the difference in true positive rates in the bin… view at source ↗
read the original abstract

Most fairness research in NLP assumes direct access to protected attributes such as gender, race, or nationality. In practice, however, such information is often unavailable due to privacy constraints, missing metadata, or legal restrictions, even though models may infer it from indirect textual cues. This raises a key question: can debiasing succeed without direct access to sensitive attributes? We propose H-SAL, which performs post-hoc concept and attribute erasure using self-description text as an implicit debiasing signal. To support this setting, we introduce a multi-domain Stack Exchange-based fairness benchmark for helpfulness prediction that includes both explicit and implicit signals, enabling comparison between standard debiasing with protected labels and debiasing without access to sensitive information. Across encoder and decoder-only language models, we find that implicit self-description often matches or outperforms explicit-label-based debiasing. Our results broaden representation-level fairness research and provide a new benchmark for studying debiasing under realistic data constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes H-SAL, a post-hoc method for latent concept and attribute erasure that uses self-description text as an implicit debiasing signal in place of direct protected attributes (gender, race, etc.). To enable evaluation, the authors introduce a multi-domain Stack Exchange benchmark for helpfulness prediction containing both explicit labels and implicit self-descriptions. Empirical results across encoder and decoder-only LMs indicate that implicit self-description often matches or exceeds explicit-label debiasing on fairness and task metrics.

Significance. If the central empirical claim holds after verification of signal reliability, the work would meaningfully extend representation-level fairness methods to privacy-constrained regimes where protected attributes are unavailable. The new benchmark is a concrete contribution that supports controlled comparison of explicit vs. implicit debiasing. No machine-checked proofs or parameter-free derivations are present; the contribution is empirical.

major comments (2)
  1. [Benchmark and Experiments] The central claim that implicit self-description supplies signals 'sufficient' for concept erasure to match explicit-label performance (§ on experiments / results) is load-bearing, yet the manuscript provides no reported correlation coefficients, mutual information, or ablation quantifying how strongly self-descriptions align with protected attributes in the Stack Exchange data. Without this, observed parity could be an artifact of benchmark construction rather than evidence that H-SAL erases the intended concept.
  2. [Results] Table or figure reporting helpfulness-prediction results: the claim of 'often matches or outperforms' lacks error bars, statistical significance tests, or per-domain breakdowns. If variance is high or gains are driven by a subset of domains, the cross-model generalization statement is not yet supported.
minor comments (2)
  1. [Method] Notation for H-SAL components (e.g., the erasure objective) should be defined once with a single equation block rather than scattered prose references.
  2. [Experimental Setup] The abstract states results 'across encoder and decoder-only language models' but the manuscript should explicitly list the exact model sizes and fine-tuning regimes used for each.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and outline revisions to strengthen the empirical support for our claims regarding implicit debiasing signals.

read point-by-point responses
  1. Referee: [Benchmark and Experiments] The central claim that implicit self-description supplies signals 'sufficient' for concept erasure to match explicit-label performance (§ on experiments / results) is load-bearing, yet the manuscript provides no reported correlation coefficients, mutual information, or ablation quantifying how strongly self-descriptions align with protected attributes in the Stack Exchange data. Without this, observed parity could be an artifact of benchmark construction rather than evidence that H-SAL erases the intended concept.

    Authors: We agree that quantifying the alignment between self-description signals and protected attributes would provide stronger evidence against the possibility of benchmark artifacts. The benchmark was constructed to include both explicit labels (for evaluation) and implicit self-descriptions (for debiasing), but we did not report alignment metrics such as mutual information or correlations in the original submission. In the revised manuscript we will add these analyses, including correlation coefficients and an ablation that measures how much of the debiasing effect is attributable to the self-description signal versus other factors. This will directly address whether the observed performance parity reflects genuine concept erasure. revision: yes

  2. Referee: [Results] Table or figure reporting helpfulness-prediction results: the claim of 'often matches or outperforms' lacks error bars, statistical significance tests, or per-domain breakdowns. If variance is high or gains are driven by a subset of domains, the cross-model generalization statement is not yet supported.

    Authors: We acknowledge that the absence of error bars, significance testing, and per-domain results limits the strength of the generalization claims. The original tables reported aggregate metrics across models and domains but did not include these details. In the revision we will augment the results section with standard error bars, paired statistical significance tests (e.g., McNemar or Wilcoxon where appropriate), and full per-domain breakdowns. These additions will allow readers to assess whether performance parity holds consistently or is driven by particular domains, thereby supporting or qualifying the cross-model statements. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison on new benchmark

full rationale

The paper introduces H-SAL as a post-hoc erasure method and a Stack Exchange benchmark, then reports experimental results showing implicit self-description often matches or exceeds explicit-label debiasing across models. No equations, fitted parameters renamed as predictions, self-citation load-bearing steps, or ansatzes are present in the provided text. The central claim rests on direct empirical measurement rather than any derivation that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only; the approach rests on the domain assumption that self-descriptions encode usable proxies for protected attributes and that post-hoc erasure can isolate those concepts without task degradation.

axioms (1)
  • domain assumption Self-description text contains implicit signals for protected attributes that can be used for debiasing
    Central to H-SAL and the benchmark comparison

pith-pipeline@v0.9.1-grok · 5699 in / 1133 out tokens · 13154 ms · 2026-06-27T09:57:32.258074+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

148 extracted references · 77 canonical work pages

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    Publications Manual , year = "1983", publisher =

  3. [3]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  4. [4]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  5. [5]

    Dan Gusfield , title =. 1997

  6. [6]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  7. [7]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  8. [8]

    Journal of Artificial Intelligence Research , volume=

    A survey of cross-lingual word embedding models , author=. Journal of Artificial Intelligence Research , volume=

  9. [9]

    Adversarial Concept Erasure in Kernel Space

    Ravfogel, Shauli and Vargas, Francisco and Goldberg, Yoav and Cotterell, Ryan. Adversarial Concept Erasure in Kernel Space. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.405

  10. [10]

    Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , url =

    Bolukbasi, Tolga and Chang, Kai-Wei and Zou, James Y and Saligrama, Venkatesh and Kalai, Adam T , booktitle =. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , url =

  11. [11]

    Bryson and Arvind Narayanan , title =

    Aylin Caliskan and Joanna J. Bryson and Arvind Narayanan , title =. Science , volume =. 2017 , doi =

  12. [12]

    2021 , isbn =

    Guo, Wei and Caliskan, Aylin , title =. 2021 , isbn =. doi:10.1145/3461702.3462536 , booktitle =

  13. [13]

    Black is to Criminal as Caucasian is to Police: Detecting and Removing Multiclass Bias in Word Embeddings

    Manzini, Thomas and Yao Chong, Lim and Black, Alan W and Tsvetkov, Yulia. Black is to Criminal as Caucasian is to Police: Detecting and Removing Multiclass Bias in Word Embeddings. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers...

  14. [14]

    Cross-lingual Embeddings Reveal Universal and Lineage-Specific Patterns in Grammatical Gender Assignment

    Veeman, Hartger and Allassonni \`e re-Tang, Marc and Berdicevskis, Aleksandrs and Basirat, Ali. Cross-lingual Embeddings Reveal Universal and Lineage-Specific Patterns in Grammatical Gender Assignment. Proceedings of the 24th Conference on Computational Natural Language Learning. 2020. doi:10.18653/v1/2020.conll-1.20

  15. [15]

    Analytical Methods for Interpretable Ultradense Word Embeddings

    Dufter, Philipp and Sch. Analytical Methods for Interpretable Ultradense Word Embeddings. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1111

  16. [16]

    Unsupervised Cross-lingual Representation Learning at Scale

    Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm \'a n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin. Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. ...

  17. [17]

    Psychometrika , volume=

    Principal component analysis of three-mode data by means of alternating least squares algorithms , author=. Psychometrika , volume=. 1980 , publisher=

  18. [18]

    and Bader, Brett W

    Kolda, Tamara G. and Bader, Brett W. , title =. SIAM Review , volume =. 2009 , doi =

  19. [19]

    Dumais, S. T. and Furnas, G. W. and Landauer, T. K. and Deerwester, S. and Harshman, R. , title =. 1988 , isbn =. doi:10.1145/57167.57214 , booktitle =

  20. [20]

    Eckart-Young

    Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition , author=. Psychometrika , volume=. 1970 , publisher=

  21. [21]

    explanatory

    Foundations of the PARAFAC procedure: Models and conditions for an" explanatory" multimodal factor analysis , author=

  22. [22]

    Proceedings of the Conference on Fairness, Accountability, and Transparency , pages =

    De-Arteaga, Maria and Romanov, Alexey and Wallach, Hanna and Chayes, Jennifer and Borgs, Christian and Chouldechova, Alexandra and Geyik, Sahin and Kenthapadi, Krishnaram and Kalai, Adam Tauman , title =. Proceedings of the Conference on Fairness, Accountability, and Transparency , pages =. 2019 , isbn =. doi:10.1145/3287560.3287572 , abstract =

  23. [23]

    R eddit B ias: A Real-World Resource for Bias Evaluation and Debiasing of Conversational Language Models

    Barikeri, Soumya and Lauscher, Anne and Vuli \'c , Ivan and Glava s , Goran. R eddit B ias: A Real-World Resource for Bias Evaluation and Debiasing of Conversational Language Models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: ...

  24. [24]

    Generating Full Length Wikipedia Biographies: The Impact of Gender Bias on the Retrieval-Based Generation of Women Biographies , publisher =

    Fan, Angela and Gardent, Claire , keywords =. Generating Full Length Wikipedia Biographies: The Impact of Gender Bias on the Retrieval-Based Generation of Women Biographies , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2204.05879 , url =

  25. [25]

    Multilingual T witter Corpus and Baselines for Evaluating Demographic Bias in Hate Speech Recognition

    Huang, Xiaolei and Xing, Linzi and Dernoncourt, Franck and Paul, Michael J. Multilingual T witter Corpus and Baselines for Evaluating Demographic Bias in Hate Speech Recognition. Proceedings of the Twelfth Language Resources and Evaluation Conference. 2020

  26. [26]

    2018 , booktitle=

    Amazon scraps secret AI recruiting tool that showed bias against women , author=. 2018 , booktitle=

  27. [27]

    , author=

    Visualizing data using t-SNE. , author=. Journal of machine learning research , volume=

  28. [28]

    Probing Classifiers are Unreliable for Concept Removal and Detection , publisher =

    Kumar, Abhinav and Tan, Chenhao and Sharma, Amit , keywords =. Probing Classifiers are Unreliable for Concept Removal and Detection , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2207.04153 , url =

  29. [29]

    Language and linguistics compass , volume=

    Construction morphology , author=. Language and linguistics compass , volume=. 2010 , publisher=

  30. [30]

    2023 , eprint=

    Measuring Inductive Biases of In-Context Learning with Underspecified Demonstrations , author=. 2023 , eprint=

  31. [31]

    2022 , eprint=

    MABEL: Attenuating Gender Bias using Textual Entailment Data , author=. 2022 , eprint=

  32. [32]

    2023 , eprint=

    Training Socially Aligned Language Models in Simulated Human Society , author=. 2023 , eprint=

  33. [33]

    Language and Linguistics Compass , volume =

    Hovy, Dirk and Prabhumoye, Shrimai , title =. Language and Linguistics Compass , volume =. doi:https://doi.org/10.1111/lnc3.12432 , url =. https://compass.onlinelibrary.wiley.com/doi/pdf/10.1111/lnc3.12432 , abstract =

  34. [34]

    You reap what you sow: On the Challenges of Bias Evaluation Under Multilingual Settings

    Talat, Zeerak and N \'e v \'e ol, Aur \'e lie and Biderman, Stella and Clinciu, Miruna and Dey, Manan and Longpre, Shayne and Luccioni, Sasha and Masoud, Maraim and Mitchell, Margaret and Radev, Dragomir and Sharma, Shanya and Subramonian, Arjun and Tae, Jaesung and Tan, Samson and Tunuguntla, Deepak and Van Der Wal, Oskar. You reap what you sow: On the C...

  35. [35]

    arXiv preprint arXiv:2404.01349 , year=

    Fairness in Large Language Models: A Taxonomic Survey , author=. arXiv preprint arXiv:2404.01349 , year=

  36. [36]

    Storkey , editor =

    Harrison Edwards and Amos J. Storkey , editor =. Censoring Representations with an Adversary , booktitle =. 2016 , url =

  37. [37]

    Encoding Prior Knowledge with Eigenword Embeddings

    Osborne, Dominique and Narayan, Shashi and Cohen, Shay B. Encoding Prior Knowledge with Eigenword Embeddings. Transactions of the Association for Computational Linguistics. 2016. doi:10.1162/tacl_a_00108

  38. [38]

    A Joint Matrix Factorization Analysis of Multilingual Representations

    Zhao, Zheng and Ziser, Yftah and Webber, Bonnie and Cohen, Shay. A Joint Matrix Factorization Analysis of Multilingual Representations. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.851

  39. [39]

    Toward Gender-Inclusive Coreference Resolution

    Cao, Yang Trista and Daum \'e III, Hal. Toward Gender-Inclusive Coreference Resolution. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.418

  40. [40]

    Sparse Activation Editing for Reliable Instruction Following in Narratives

    Zhao, Runcong and Cao, Chengyu and Zhu, Qinglin and Ly, Xiucheng and Shao, Shun and Gui, Lin and Xu, Ruifeng and He, Yulan. Sparse Activation Editing for Reliable Instruction Following in Narratives. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1311

  41. [41]

    Contemporary Mathematics , volume=

    Projectors on intersection of subspaces , author=. Contemporary Mathematics , volume=. 2015 , publisher=

  42. [42]

    FairImagen: Post-Processing for Bias Mitigation in Text-to-Image Models , url =

    Fu, Zihao and Brown, Ryan and Shao, Shun and Rawal, Kai and Delaney, Eoin and Russell, Chris , booktitle =. FairImagen: Post-Processing for Bias Mitigation in Text-to-Image Models , url =

  43. [43]

    Demographic Dialectal Variation in Social Media: A Case Study of A frican- A merican E nglish

    Blodgett, Su Lin and Green, Lisa and O ' Connor, Brendan. Demographic Dialectal Variation in Social Media: A Case Study of A frican- A merican E nglish. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1120

  44. [44]

    proceedings of the Conference on Fairness, Accountability, and Transparency , pages=

    Bias in bios: A case study of semantic representation bias in a high-stakes setting , author=. proceedings of the Conference on Fairness, Accountability, and Transparency , pages=

  45. [45]

    2024 , eprint=

    Beyond Voice Assistants: Exploring Advantages and Risks of an In-Car Social Robot in Real Driving Scenarios , author=. 2024 , eprint=

  46. [46]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  47. [47]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  48. [48]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  49. [49]

    2023 , eprint=

    Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models , author=. 2023 , eprint=

  50. [50]

    LEACE: Perfect linear concept erasure in closed form , url =

    Belrose, Nora and Schneider-Joseph, David and Ravfogel, Shauli and Cotterell, Ryan and Raff, Edward and Biderman, Stella , booktitle =. LEACE: Perfect linear concept erasure in closed form , url =

  51. [51]

    Proceedings of the 39th International Conference on Machine Learning , pages =

    Linear Adversarial Concept Erasure , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

  52. [52]

    2023 , eprint=

    Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2023 , eprint=

  53. [53]

    International Conference on Learning Representations , year=

    All-but-the-Top: Simple and Effective Postprocessing for Word Representations , author=. International Conference on Learning Representations , year=

  54. [54]

    Conceptor Debiasing of Word Representations Evaluated on WEAT

    Karve, Saket and Ungar, Lyle and Sedoc, Jo \ a o. Conceptor Debiasing of Word Representations Evaluated on WEAT. Proceedings of the First Workshop on Gender Bias in Natural Language Processing. 2019. doi:10.18653/v1/W19-3806

  55. [55]

    2023 , eprint=

    Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla , author=. 2023 , eprint=

  56. [56]

    Gold Doesn ' t Always Glitter: Spectral Removal of Linear and Nonlinear Guarded Attribute Information

    Shao, Shun and Ziser, Yftah and Cohen, Shay B. Gold Doesn ' t Always Glitter: Spectral Removal of Linear and Nonlinear Guarded Attribute Information. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2023. doi:10.18653/v1/2023.eacl-main.118

  57. [57]

    Lipstick on a Pig: D ebiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them

    Gonen, Hila and Goldberg, Yoav. Lipstick on a Pig: D ebiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v1/N19-1061

  58. [58]

    Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection

    Ravfogel, Shauli and Elazar, Yanai and Gonen, Hila and Twiton, Michael and Goldberg, Yoav. Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.647

  59. [59]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    On measuring and mitigating biased inferences of word embeddings , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  60. [60]

    Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models

    Mondorf, Philipp and Wold, Sondre and Plank, Barbara. Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.727

  61. [61]

    SIAM Journal on Matrix Analysis and Applications , volume =

    De Lathauwer, Lieven and De Moor, Bart and Vandewalle, Joos , title =. SIAM Journal on Matrix Analysis and Applications , volume =. 2000 , doi =. https://doi.org/10.1137/S0895479896305696 , abstract =

  62. [62]

    arXiv preprint arXiv:2210.12553 , year=

    Understanding domain learning in language models through subpopulation analysis , author=. arXiv preprint arXiv:2210.12553 , year=

  63. [63]

    First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward at ICML 2022 , year=

    Self-destructing models: Increasing the costs of harmful dual uses in foundation models , author=. First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward at ICML 2022 , year=

  64. [64]

    Transactions of the Association for Computational Linguistics , volume =

    Schick, Timo and Udupa, Sahana and Schütze, Hinrich , title = ". Transactions of the Association for Computational Linguistics , volume =. 2021 , month =. doi:10.1162/tacl_a_00434 , url =

  65. [65]

    Adversarial Removal of Demographic Attributes from Text Data

    Elazar, Yanai and Goldberg, Yoav. Adversarial Removal of Demographic Attributes from Text Data. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1002

  66. [66]

    2018 , isbn =

    Zhang, Brian Hu and Lemoine, Blake and Mitchell, Margaret , title =. 2018 , isbn =. doi:10.1145/3278721.3278779 , booktitle =

  67. [67]

    Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =

    Xie, Qizhe and Dai, Zihang and Du, Yulun and Hovy, Eduard and Neubig, Graham , title =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =. 2017 , isbn =

  68. [68]

    CoRR , volume =

    Kellie Webster and Xuezhi Wang and Ian Tenney and Alex Beutel and Emily Pitler and Ellie Pavlick and Jilin Chen and Slav Petrov , title =. CoRR , volume =. 2020 , url =. 2010.06032 , timestamp =

  69. [69]

    An Empirical Survey of the Effectiveness of Debiasing Techniques for Pre-trained Language Models

    Meade, Nicholas and Poole-Dayan, Elinor and Reddy, Siva. An Empirical Survey of the Effectiveness of Debiasing Techniques for Pre-trained Language Models. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.132

  70. [70]

    arXiv preprint arXiv:2306.05949 , year=

    Evaluating the Social Impact of Generative AI Systems in Systems and Society , author=. arXiv preprint arXiv:2306.05949 , year=

  71. [71]

    arXiv preprint arXiv:2108.07258 , year=

    On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=

  72. [72]

    arXiv preprint arXiv:2303.15715 , year=

    Foundation models and fair use , author=. arXiv preprint arXiv:2303.15715 , year=

  73. [73]

    2023 , eprint=

    Recent Advances towards Safe, Responsible, and Moral Dialogue Systems: A Survey , author=. 2023 , eprint=

  74. [74]

    2023 , eprint=

    Sociotechnical Safety Evaluation of Generative AI Systems , author=. 2023 , eprint=

  75. [75]

    arXiv preprint arXiv:2112.04359 , year=

    Ethical and social risks of harm from language models , author=. arXiv preprint arXiv:2112.04359 , year=

  76. [76]

    Stevie and Hendricks, Lisa Anne and Rauh, Maribeth and Wu, Boxi and Agnew, William and Kunesch, Markus and Duan, Isabella and Gabriel, Iason and Isaac, William , title =

    Bergman, A. Stevie and Hendricks, Lisa Anne and Rauh, Maribeth and Wu, Boxi and Agnew, William and Kunesch, Markus and Duan, Isabella and Gabriel, Iason and Isaac, William , title =. 2023 , isbn =. doi:10.1145/3593013.3594019 , booktitle =

  77. [77]

    2020 , eprint=

    Language (Technology) is Power: A Critical Survey of "Bias" in NLP , author=. 2020 , eprint=

  78. [78]

    Fairness in Language Models Beyond E nglish: Gaps and Challenges

    Ramesh, Krithika and Sitaram, Sunayana and Choudhury, Monojit. Fairness in Language Models Beyond E nglish: Gaps and Challenges. Findings of the Association for Computational Linguistics: EACL 2023. 2023. doi:10.18653/v1/2023.findings-eacl.157

  79. [79]

    1996 , issue_date =

    Friedman, Batya and Nissenbaum, Helen , title =. 1996 , issue_date =. doi:10.1145/230538.230561 , month =

  80. [80]

    Harms of Gender Exclusivity and Challenges in Non-Binary Representation in Language Technologies

    Dev, Sunipa and Monajatipoor, Masoud and Ovalle, Anaelia and Subramonian, Arjun and Phillips, Jeff and Chang, Kai-Wei. Harms of Gender Exclusivity and Challenges in Non-Binary Representation in Language Technologies. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.150

Showing first 80 references.