Debiasing Without Protected Attributes: Latent Concept Erasure from Textual Profiles

Anna Korhonen; Shay B. Cohen; Shun Shao; Yftah Ziser; Zheng Zhao

arxiv: 2606.12088 · v1 · pith:HW6KS7EQnew · submitted 2026-06-10 · 💻 cs.CL

Debiasing Without Protected Attributes: Latent Concept Erasure from Textual Profiles

Shun Shao , Zheng Zhao , Anna Korhonen , Yftah Ziser , Shay B. Cohen This is my paper

Pith reviewed 2026-06-27 09:57 UTC · model grok-4.3

classification 💻 cs.CL

keywords fairness in NLPdebiasing without protected attributesconcept erasureself-description textlanguage modelshelpfulness predictionimplicit signalsStack Exchange benchmark

0 comments

The pith

Debiasing language models works using self-description text in place of explicit protected attributes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that fairness interventions in NLP can proceed without direct access to sensitive attributes such as gender or race. It introduces H-SAL to erase biased concepts post-hoc by treating self-description text as the debiasing signal instead. A new Stack Exchange benchmark for helpfulness prediction supplies both explicit and implicit signals to enable direct comparison. Experiments across encoder and decoder-only models show the implicit route matches or exceeds the explicit one. This matters because many real deployments lack the protected labels that standard debiasing methods require.

Core claim

Post-hoc concept and attribute erasure can be performed using self-description text as an implicit debiasing signal, and across encoder and decoder-only language models this implicit approach often matches or outperforms explicit-label-based debiasing on a new multi-domain fairness benchmark for helpfulness prediction.

What carries the argument

H-SAL, which performs post-hoc concept and attribute erasure using self-description text as an implicit debiasing signal

If this is right

Debiasing remains feasible in privacy-constrained or metadata-absent settings.
Task performance on downstream prediction does not degrade when switching to implicit signals.
Both encoder-only and decoder-only models respond comparably to the implicit erasure method.
A multi-domain benchmark now exists for testing debiasing under realistic data constraints.
Representation-level fairness methods can be applied without collecting protected attributes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be tested on other prediction tasks such as toxicity detection or resume screening where self-descriptions are available.
If self-descriptions prove stable over time, models could be periodically re-debiased without re-collecting labels.
The approach raises the question of whether implicit signals themselves encode secondary biases that explicit methods avoid.
Deployment pipelines could default to implicit erasure when explicit attributes are legally unavailable.

Load-bearing premise

Self-description text contains reliable implicit signals that can substitute for direct protected attributes in concept erasure without introducing new biases or losing task performance.

What would settle it

An experiment showing that H-SAL with self-descriptions produces lower fairness scores or worse helpfulness prediction accuracy than explicit-label debiasing on the Stack Exchange benchmark would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.12088 by Anna Korhonen, Shay B. Cohen, Shun Shao, Yftah Ziser, Zheng Zhao.

**Figure 1.** Figure 1: Self-description cues used for textual erasure. Real examples from Stack Exchange user profiles, where H denotes the user’s self-description. The highlighted spans mark words or phrases that may implicitly reveal attributes such as education, reputation or location, and are therefore used to learn directions for textual erasure. in latent representations, and let them shape responses even without expli… view at source ↗

**Figure 2.** Figure 2: A Venn diagram showing the information content overlap between the different r.v.s. A success of H-SAL depends on how much the part of Z that overlaps with X also overlaps with H. where X is the representation of the main-task input, Y is the task label, Z is the guarded attribute (Z denotes the set of possible attribute values), and H is the raw self-description text. We encode H with a text encoder ϕ in… view at source ↗

**Figure 4.** Figure 4: Plots for the bound for ρ1 where u ⊤ 3 v2 = 0.5 (left) and = 1 (right). Note that the bound can be symmetrically flipped to have ρ2 on the left, dependent on ρ1 [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Engagement bias reduction across three Stack Exchange communities under the natural-disjoint split. Each subfigure plots TPR-Gap against the number of removed directions (k) for BERT, Mistral-7B, and Llama-3.1-8B, comparing random, explicit, and implicit SAL. TPR-Gap is the generalized disparity (2× StdDev) of true positive rates across groups, equivalent to the difference in true positive rates in the bin… view at source ↗

read the original abstract

Most fairness research in NLP assumes direct access to protected attributes such as gender, race, or nationality. In practice, however, such information is often unavailable due to privacy constraints, missing metadata, or legal restrictions, even though models may infer it from indirect textual cues. This raises a key question: can debiasing succeed without direct access to sensitive attributes? We propose H-SAL, which performs post-hoc concept and attribute erasure using self-description text as an implicit debiasing signal. To support this setting, we introduce a multi-domain Stack Exchange-based fairness benchmark for helpfulness prediction that includes both explicit and implicit signals, enabling comparison between standard debiasing with protected labels and debiasing without access to sensitive information. Across encoder and decoder-only language models, we find that implicit self-description often matches or outperforms explicit-label-based debiasing. Our results broaden representation-level fairness research and provide a new benchmark for studying debiasing under realistic data constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

H-SAL uses self-description text for debiasing without protected attributes and reports competitive results on a new Stack Exchange benchmark, but the evidence for reliable implicit signals is thin without detailed experiments.

read the letter

The main point is that this paper tackles debiasing when you cannot access protected attributes directly. It introduces H-SAL to erase concepts post-hoc from self-description text as an implicit signal and builds a multi-domain Stack Exchange benchmark for helpfulness prediction that supports both explicit-label and implicit comparisons.

The work does a solid job naming a real constraint in fairness research—privacy rules and missing metadata often block standard methods—and sets up a direct head-to-head test across encoder and decoder models. The claim that implicit self-descriptions often match or beat explicit debiasing is the central result, and the benchmark construction itself looks like a practical addition for studying this setting.

The soft spots sit around whether the self-descriptions actually supply strong enough signals. If those texts are sparse, domain-specific, or only weakly tied to the attributes in some subsets, then H-SAL could be operating on a different or weaker concept than the explicit baseline, which would make the performance parity look better than it generalizes. The abstract supplies no ablations, error bars, or derivation details for H-SAL, so it is difficult to judge robustness or rule out benchmark artifacts. The stress-test note on noisy or weakly aligned signals is worth checking against the full experiments.

This is for people working on representation-level fairness who routinely hit data-access limits. A reader who needs a benchmark or an alternative to label-dependent methods could extract value from the setup even if the results require more verification.

I would send it to peer review. The problem is concrete and the approach is distinct enough to merit referee time, though the paper will need to show stronger evidence on signal quality and experimental controls.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes H-SAL, a post-hoc method for latent concept and attribute erasure that uses self-description text as an implicit debiasing signal in place of direct protected attributes (gender, race, etc.). To enable evaluation, the authors introduce a multi-domain Stack Exchange benchmark for helpfulness prediction containing both explicit labels and implicit self-descriptions. Empirical results across encoder and decoder-only LMs indicate that implicit self-description often matches or exceeds explicit-label debiasing on fairness and task metrics.

Significance. If the central empirical claim holds after verification of signal reliability, the work would meaningfully extend representation-level fairness methods to privacy-constrained regimes where protected attributes are unavailable. The new benchmark is a concrete contribution that supports controlled comparison of explicit vs. implicit debiasing. No machine-checked proofs or parameter-free derivations are present; the contribution is empirical.

major comments (2)

[Benchmark and Experiments] The central claim that implicit self-description supplies signals 'sufficient' for concept erasure to match explicit-label performance (§ on experiments / results) is load-bearing, yet the manuscript provides no reported correlation coefficients, mutual information, or ablation quantifying how strongly self-descriptions align with protected attributes in the Stack Exchange data. Without this, observed parity could be an artifact of benchmark construction rather than evidence that H-SAL erases the intended concept.
[Results] Table or figure reporting helpfulness-prediction results: the claim of 'often matches or outperforms' lacks error bars, statistical significance tests, or per-domain breakdowns. If variance is high or gains are driven by a subset of domains, the cross-model generalization statement is not yet supported.

minor comments (2)

[Method] Notation for H-SAL components (e.g., the erasure objective) should be defined once with a single equation block rather than scattered prose references.
[Experimental Setup] The abstract states results 'across encoder and decoder-only language models' but the manuscript should explicitly list the exact model sizes and fine-tuning regimes used for each.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and outline revisions to strengthen the empirical support for our claims regarding implicit debiasing signals.

read point-by-point responses

Referee: [Benchmark and Experiments] The central claim that implicit self-description supplies signals 'sufficient' for concept erasure to match explicit-label performance (§ on experiments / results) is load-bearing, yet the manuscript provides no reported correlation coefficients, mutual information, or ablation quantifying how strongly self-descriptions align with protected attributes in the Stack Exchange data. Without this, observed parity could be an artifact of benchmark construction rather than evidence that H-SAL erases the intended concept.

Authors: We agree that quantifying the alignment between self-description signals and protected attributes would provide stronger evidence against the possibility of benchmark artifacts. The benchmark was constructed to include both explicit labels (for evaluation) and implicit self-descriptions (for debiasing), but we did not report alignment metrics such as mutual information or correlations in the original submission. In the revised manuscript we will add these analyses, including correlation coefficients and an ablation that measures how much of the debiasing effect is attributable to the self-description signal versus other factors. This will directly address whether the observed performance parity reflects genuine concept erasure. revision: yes
Referee: [Results] Table or figure reporting helpfulness-prediction results: the claim of 'often matches or outperforms' lacks error bars, statistical significance tests, or per-domain breakdowns. If variance is high or gains are driven by a subset of domains, the cross-model generalization statement is not yet supported.

Authors: We acknowledge that the absence of error bars, significance testing, and per-domain results limits the strength of the generalization claims. The original tables reported aggregate metrics across models and domains but did not include these details. In the revision we will augment the results section with standard error bars, paired statistical significance tests (e.g., McNemar or Wilcoxon where appropriate), and full per-domain breakdowns. These additions will allow readers to assess whether performance parity holds consistently or is driven by particular domains, thereby supporting or qualifying the cross-model statements. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison on new benchmark

full rationale

The paper introduces H-SAL as a post-hoc erasure method and a Stack Exchange benchmark, then reports experimental results showing implicit self-description often matches or exceeds explicit-label debiasing across models. No equations, fitted parameters renamed as predictions, self-citation load-bearing steps, or ansatzes are present in the provided text. The central claim rests on direct empirical measurement rather than any derivation that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only; the approach rests on the domain assumption that self-descriptions encode usable proxies for protected attributes and that post-hoc erasure can isolate those concepts without task degradation.

axioms (1)

domain assumption Self-description text contains implicit signals for protected attributes that can be used for debiasing
Central to H-SAL and the benchmark comparison

pith-pipeline@v0.9.1-grok · 5699 in / 1133 out tokens · 13154 ms · 2026-06-27T09:57:32.258074+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

148 extracted references · 77 canonical work pages

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[2]

Publications Manual , year = "1983", publisher =

1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[5]

Dan Gusfield , title =. 1997

1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[8]

Journal of Artificial Intelligence Research , volume=

A survey of cross-lingual word embedding models , author=. Journal of Artificial Intelligence Research , volume=
[9]

Adversarial Concept Erasure in Kernel Space

Ravfogel, Shauli and Vargas, Francisco and Goldberg, Yoav and Cotterell, Ryan. Adversarial Concept Erasure in Kernel Space. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.405

work page doi:10.18653/v1/2022.emnlp-main.405 2022
[10]

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , url =

Bolukbasi, Tolga and Chang, Kai-Wei and Zou, James Y and Saligrama, Venkatesh and Kalai, Adam T , booktitle =. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , url =
[11]

Bryson and Arvind Narayanan , title =

Aylin Caliskan and Joanna J. Bryson and Arvind Narayanan , title =. Science , volume =. 2017 , doi =

2017
[12]

2021 , isbn =

Guo, Wei and Caliskan, Aylin , title =. 2021 , isbn =. doi:10.1145/3461702.3462536 , booktitle =

work page doi:10.1145/3461702.3462536 2021
[13]

Black is to Criminal as Caucasian is to Police: Detecting and Removing Multiclass Bias in Word Embeddings

Manzini, Thomas and Yao Chong, Lim and Black, Alan W and Tsvetkov, Yulia. Black is to Criminal as Caucasian is to Police: Detecting and Removing Multiclass Bias in Word Embeddings. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers...

work page doi:10.18653/v1/n19-1062 2019
[14]

Cross-lingual Embeddings Reveal Universal and Lineage-Specific Patterns in Grammatical Gender Assignment

Veeman, Hartger and Allassonni \`e re-Tang, Marc and Berdicevskis, Aleksandrs and Basirat, Ali. Cross-lingual Embeddings Reveal Universal and Lineage-Specific Patterns in Grammatical Gender Assignment. Proceedings of the 24th Conference on Computational Natural Language Learning. 2020. doi:10.18653/v1/2020.conll-1.20

work page doi:10.18653/v1/2020.conll-1.20 2020
[15]

Analytical Methods for Interpretable Ultradense Word Embeddings

Dufter, Philipp and Sch. Analytical Methods for Interpretable Ultradense Word Embeddings. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1111

work page doi:10.18653/v1/d19-1111 2019
[16]

Unsupervised Cross-lingual Representation Learning at Scale

Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm \'a n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin. Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. ...

work page doi:10.18653/v1/2020.acl-main.747 2020
[17]

Psychometrika , volume=

Principal component analysis of three-mode data by means of alternating least squares algorithms , author=. Psychometrika , volume=. 1980 , publisher=

1980
[18]

and Bader, Brett W

Kolda, Tamara G. and Bader, Brett W. , title =. SIAM Review , volume =. 2009 , doi =

2009
[19]

Dumais, S. T. and Furnas, G. W. and Landauer, T. K. and Deerwester, S. and Harshman, R. , title =. 1988 , isbn =. doi:10.1145/57167.57214 , booktitle =

work page doi:10.1145/57167.57214 1988
[20]

Eckart-Young

Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition , author=. Psychometrika , volume=. 1970 , publisher=

1970
[21]

explanatory

Foundations of the PARAFAC procedure: Models and conditions for an" explanatory" multimodal factor analysis , author=
[22]

Proceedings of the Conference on Fairness, Accountability, and Transparency , pages =

De-Arteaga, Maria and Romanov, Alexey and Wallach, Hanna and Chayes, Jennifer and Borgs, Christian and Chouldechova, Alexandra and Geyik, Sahin and Kenthapadi, Krishnaram and Kalai, Adam Tauman , title =. Proceedings of the Conference on Fairness, Accountability, and Transparency , pages =. 2019 , isbn =. doi:10.1145/3287560.3287572 , abstract =

work page doi:10.1145/3287560.3287572 2019
[23]

R eddit B ias: A Real-World Resource for Bias Evaluation and Debiasing of Conversational Language Models

Barikeri, Soumya and Lauscher, Anne and Vuli \'c , Ivan and Glava s , Goran. R eddit B ias: A Real-World Resource for Bias Evaluation and Debiasing of Conversational Language Models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: ...

work page doi:10.18653/v1/2021.acl-long.151 2021
[24]

Generating Full Length Wikipedia Biographies: The Impact of Gender Bias on the Retrieval-Based Generation of Women Biographies , publisher =

Fan, Angela and Gardent, Claire , keywords =. Generating Full Length Wikipedia Biographies: The Impact of Gender Bias on the Retrieval-Based Generation of Women Biographies , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2204.05879 , url =

work page doi:10.48550/arxiv.2204.05879 2022
[25]

Multilingual T witter Corpus and Baselines for Evaluating Demographic Bias in Hate Speech Recognition

Huang, Xiaolei and Xing, Linzi and Dernoncourt, Franck and Paul, Michael J. Multilingual T witter Corpus and Baselines for Evaluating Demographic Bias in Hate Speech Recognition. Proceedings of the Twelfth Language Resources and Evaluation Conference. 2020

2020
[26]

2018 , booktitle=

Amazon scraps secret AI recruiting tool that showed bias against women , author=. 2018 , booktitle=

2018
[27]

, author=

Visualizing data using t-SNE. , author=. Journal of machine learning research , volume=
[28]

Probing Classifiers are Unreliable for Concept Removal and Detection , publisher =

Kumar, Abhinav and Tan, Chenhao and Sharma, Amit , keywords =. Probing Classifiers are Unreliable for Concept Removal and Detection , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2207.04153 , url =

work page doi:10.48550/arxiv.2207.04153 2022
[29]

Language and linguistics compass , volume=

Construction morphology , author=. Language and linguistics compass , volume=. 2010 , publisher=

2010
[30]

2023 , eprint=

Measuring Inductive Biases of In-Context Learning with Underspecified Demonstrations , author=. 2023 , eprint=

2023
[31]

2022 , eprint=

MABEL: Attenuating Gender Bias using Textual Entailment Data , author=. 2022 , eprint=

2022
[32]

2023 , eprint=

Training Socially Aligned Language Models in Simulated Human Society , author=. 2023 , eprint=

2023
[33]

Language and Linguistics Compass , volume =

Hovy, Dirk and Prabhumoye, Shrimai , title =. Language and Linguistics Compass , volume =. doi:https://doi.org/10.1111/lnc3.12432 , url =. https://compass.onlinelibrary.wiley.com/doi/pdf/10.1111/lnc3.12432 , abstract =

work page doi:10.1111/lnc3.12432
[34]

You reap what you sow: On the Challenges of Bias Evaluation Under Multilingual Settings

Talat, Zeerak and N \'e v \'e ol, Aur \'e lie and Biderman, Stella and Clinciu, Miruna and Dey, Manan and Longpre, Shayne and Luccioni, Sasha and Masoud, Maraim and Mitchell, Margaret and Radev, Dragomir and Sharma, Shanya and Subramonian, Arjun and Tae, Jaesung and Tan, Samson and Tunuguntla, Deepak and Van Der Wal, Oskar. You reap what you sow: On the C...

work page doi:10.18653/v1/2022.bigscience-1.3 2022
[35]

arXiv preprint arXiv:2404.01349 , year=

Fairness in Large Language Models: A Taxonomic Survey , author=. arXiv preprint arXiv:2404.01349 , year=

arXiv
[36]

Storkey , editor =

Harrison Edwards and Amos J. Storkey , editor =. Censoring Representations with an Adversary , booktitle =. 2016 , url =

2016
[37]

Encoding Prior Knowledge with Eigenword Embeddings

Osborne, Dominique and Narayan, Shashi and Cohen, Shay B. Encoding Prior Knowledge with Eigenword Embeddings. Transactions of the Association for Computational Linguistics. 2016. doi:10.1162/tacl_a_00108

work page doi:10.1162/tacl_a_00108 2016
[38]

A Joint Matrix Factorization Analysis of Multilingual Representations

Zhao, Zheng and Ziser, Yftah and Webber, Bonnie and Cohen, Shay. A Joint Matrix Factorization Analysis of Multilingual Representations. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.851

work page doi:10.18653/v1/2023.findings-emnlp.851 2023
[39]

Toward Gender-Inclusive Coreference Resolution

Cao, Yang Trista and Daum \'e III, Hal. Toward Gender-Inclusive Coreference Resolution. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.418

work page doi:10.18653/v1/2020.acl-main.418 2020
[40]

Sparse Activation Editing for Reliable Instruction Following in Narratives

Zhao, Runcong and Cao, Chengyu and Zhu, Qinglin and Ly, Xiucheng and Shao, Shun and Gui, Lin and Xu, Ruifeng and He, Yulan. Sparse Activation Editing for Reliable Instruction Following in Narratives. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1311

work page doi:10.18653/v1/2025.emnlp-main.1311 2025
[41]

Contemporary Mathematics , volume=

Projectors on intersection of subspaces , author=. Contemporary Mathematics , volume=. 2015 , publisher=

2015
[42]

FairImagen: Post-Processing for Bias Mitigation in Text-to-Image Models , url =

Fu, Zihao and Brown, Ryan and Shao, Shun and Rawal, Kai and Delaney, Eoin and Russell, Chris , booktitle =. FairImagen: Post-Processing for Bias Mitigation in Text-to-Image Models , url =
[43]

Demographic Dialectal Variation in Social Media: A Case Study of A frican- A merican E nglish

Blodgett, Su Lin and Green, Lisa and O ' Connor, Brendan. Demographic Dialectal Variation in Social Media: A Case Study of A frican- A merican E nglish. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1120

work page doi:10.18653/v1/d16-1120 2016
[44]

proceedings of the Conference on Fairness, Accountability, and Transparency , pages=

Bias in bios: A case study of semantic representation bias in a high-stakes setting , author=. proceedings of the Conference on Fairness, Accountability, and Transparency , pages=
[45]

2024 , eprint=

Beyond Voice Assistants: Exploring Advantages and Risks of an In-Car Social Robot in Real Driving Scenarios , author=. 2024 , eprint=

2024
[46]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
[47]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
[48]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016
[49]

2023 , eprint=

Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models , author=. 2023 , eprint=

2023
[50]

LEACE: Perfect linear concept erasure in closed form , url =

Belrose, Nora and Schneider-Joseph, David and Ravfogel, Shauli and Cotterell, Ryan and Raff, Edward and Biderman, Stella , booktitle =. LEACE: Perfect linear concept erasure in closed form , url =
[51]

Proceedings of the 39th International Conference on Machine Learning , pages =

Linear Adversarial Concept Erasure , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

2022
[52]

2023 , eprint=

Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2023 , eprint=

2023
[53]

International Conference on Learning Representations , year=

All-but-the-Top: Simple and Effective Postprocessing for Word Representations , author=. International Conference on Learning Representations , year=
[54]

Conceptor Debiasing of Word Representations Evaluated on WEAT

Karve, Saket and Ungar, Lyle and Sedoc, Jo \ a o. Conceptor Debiasing of Word Representations Evaluated on WEAT. Proceedings of the First Workshop on Gender Bias in Natural Language Processing. 2019. doi:10.18653/v1/W19-3806

work page doi:10.18653/v1/w19-3806 2019
[55]

2023 , eprint=

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla , author=. 2023 , eprint=

2023
[56]

Gold Doesn ' t Always Glitter: Spectral Removal of Linear and Nonlinear Guarded Attribute Information

Shao, Shun and Ziser, Yftah and Cohen, Shay B. Gold Doesn ' t Always Glitter: Spectral Removal of Linear and Nonlinear Guarded Attribute Information. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2023. doi:10.18653/v1/2023.eacl-main.118

work page doi:10.18653/v1/2023.eacl-main.118 2023
[57]

Lipstick on a Pig: D ebiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them

Gonen, Hila and Goldberg, Yoav. Lipstick on a Pig: D ebiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v1/N19-1061

work page doi:10.18653/v1/n19-1061 2019
[58]

Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection

Ravfogel, Shauli and Elazar, Yanai and Gonen, Hila and Twiton, Michael and Goldberg, Yoav. Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.647

work page doi:10.18653/v1/2020.acl-main.647 2020
[59]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

On measuring and mitigating biased inferences of word embeddings , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[60]

Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models

Mondorf, Philipp and Wold, Sondre and Plank, Barbara. Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.727

work page doi:10.18653/v1/2025.acl-long.727 2025
[61]

SIAM Journal on Matrix Analysis and Applications , volume =

De Lathauwer, Lieven and De Moor, Bart and Vandewalle, Joos , title =. SIAM Journal on Matrix Analysis and Applications , volume =. 2000 , doi =. https://doi.org/10.1137/S0895479896305696 , abstract =

work page doi:10.1137/s0895479896305696 2000
[62]

arXiv preprint arXiv:2210.12553 , year=

Understanding domain learning in language models through subpopulation analysis , author=. arXiv preprint arXiv:2210.12553 , year=

arXiv
[63]

First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward at ICML 2022 , year=

Self-destructing models: Increasing the costs of harmful dual uses in foundation models , author=. First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward at ICML 2022 , year=

2022
[64]

Transactions of the Association for Computational Linguistics , volume =

Schick, Timo and Udupa, Sahana and Schütze, Hinrich , title = ". Transactions of the Association for Computational Linguistics , volume =. 2021 , month =. doi:10.1162/tacl_a_00434 , url =

work page doi:10.1162/tacl_a_00434 2021
[65]

Adversarial Removal of Demographic Attributes from Text Data

Elazar, Yanai and Goldberg, Yoav. Adversarial Removal of Demographic Attributes from Text Data. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1002

work page doi:10.18653/v1/d18-1002 2018
[66]

2018 , isbn =

Zhang, Brian Hu and Lemoine, Blake and Mitchell, Margaret , title =. 2018 , isbn =. doi:10.1145/3278721.3278779 , booktitle =

work page doi:10.1145/3278721.3278779 2018
[67]

Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =

Xie, Qizhe and Dai, Zihang and Du, Yulun and Hovy, Eduard and Neubig, Graham , title =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =. 2017 , isbn =

2017
[68]

CoRR , volume =

Kellie Webster and Xuezhi Wang and Ian Tenney and Alex Beutel and Emily Pitler and Ellie Pavlick and Jilin Chen and Slav Petrov , title =. CoRR , volume =. 2020 , url =. 2010.06032 , timestamp =

arXiv 2020
[69]

An Empirical Survey of the Effectiveness of Debiasing Techniques for Pre-trained Language Models

Meade, Nicholas and Poole-Dayan, Elinor and Reddy, Siva. An Empirical Survey of the Effectiveness of Debiasing Techniques for Pre-trained Language Models. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.132

work page doi:10.18653/v1/2022.acl-long.132 2022
[70]

arXiv preprint arXiv:2306.05949 , year=

Evaluating the Social Impact of Generative AI Systems in Systems and Society , author=. arXiv preprint arXiv:2306.05949 , year=

arXiv
[71]

arXiv preprint arXiv:2108.07258 , year=

On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=

Pith/arXiv arXiv
[72]

arXiv preprint arXiv:2303.15715 , year=

Foundation models and fair use , author=. arXiv preprint arXiv:2303.15715 , year=

arXiv
[73]

2023 , eprint=

Recent Advances towards Safe, Responsible, and Moral Dialogue Systems: A Survey , author=. 2023 , eprint=

2023
[74]

2023 , eprint=

Sociotechnical Safety Evaluation of Generative AI Systems , author=. 2023 , eprint=

2023
[75]

arXiv preprint arXiv:2112.04359 , year=

Ethical and social risks of harm from language models , author=. arXiv preprint arXiv:2112.04359 , year=

Pith/arXiv arXiv
[76]

Stevie and Hendricks, Lisa Anne and Rauh, Maribeth and Wu, Boxi and Agnew, William and Kunesch, Markus and Duan, Isabella and Gabriel, Iason and Isaac, William , title =

Bergman, A. Stevie and Hendricks, Lisa Anne and Rauh, Maribeth and Wu, Boxi and Agnew, William and Kunesch, Markus and Duan, Isabella and Gabriel, Iason and Isaac, William , title =. 2023 , isbn =. doi:10.1145/3593013.3594019 , booktitle =

work page doi:10.1145/3593013.3594019 2023
[77]

2020 , eprint=

Language (Technology) is Power: A Critical Survey of "Bias" in NLP , author=. 2020 , eprint=

2020
[78]

Fairness in Language Models Beyond E nglish: Gaps and Challenges

Ramesh, Krithika and Sitaram, Sunayana and Choudhury, Monojit. Fairness in Language Models Beyond E nglish: Gaps and Challenges. Findings of the Association for Computational Linguistics: EACL 2023. 2023. doi:10.18653/v1/2023.findings-eacl.157

work page doi:10.18653/v1/2023.findings-eacl.157 2023
[79]

1996 , issue_date =

Friedman, Batya and Nissenbaum, Helen , title =. 1996 , issue_date =. doi:10.1145/230538.230561 , month =

work page doi:10.1145/230538.230561 1996
[80]

Harms of Gender Exclusivity and Challenges in Non-Binary Representation in Language Technologies

Dev, Sunipa and Monajatipoor, Masoud and Ovalle, Anaelia and Subramonian, Arjun and Phillips, Jeff and Chang, Kai-Wei. Harms of Gender Exclusivity and Challenges in Non-Binary Representation in Language Technologies. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.150

work page doi:10.18653/v1/2021.emnlp-main.150 2021

Showing first 80 references.

[1] [1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972

[2] [2]

Publications Manual , year = "1983", publisher =

1983

[3] [3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[4] [4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

[5] [5]

Dan Gusfield , title =. 1997

1997

[6] [6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015

[7] [7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

[8] [8]

Journal of Artificial Intelligence Research , volume=

A survey of cross-lingual word embedding models , author=. Journal of Artificial Intelligence Research , volume=

[9] [9]

Adversarial Concept Erasure in Kernel Space

Ravfogel, Shauli and Vargas, Francisco and Goldberg, Yoav and Cotterell, Ryan. Adversarial Concept Erasure in Kernel Space. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.405

work page doi:10.18653/v1/2022.emnlp-main.405 2022

[10] [10]

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , url =

Bolukbasi, Tolga and Chang, Kai-Wei and Zou, James Y and Saligrama, Venkatesh and Kalai, Adam T , booktitle =. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , url =

[11] [11]

Bryson and Arvind Narayanan , title =

Aylin Caliskan and Joanna J. Bryson and Arvind Narayanan , title =. Science , volume =. 2017 , doi =

2017

[12] [12]

2021 , isbn =

Guo, Wei and Caliskan, Aylin , title =. 2021 , isbn =. doi:10.1145/3461702.3462536 , booktitle =

work page doi:10.1145/3461702.3462536 2021

[13] [13]

Black is to Criminal as Caucasian is to Police: Detecting and Removing Multiclass Bias in Word Embeddings

Manzini, Thomas and Yao Chong, Lim and Black, Alan W and Tsvetkov, Yulia. Black is to Criminal as Caucasian is to Police: Detecting and Removing Multiclass Bias in Word Embeddings. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers...

work page doi:10.18653/v1/n19-1062 2019

[14] [14]

Cross-lingual Embeddings Reveal Universal and Lineage-Specific Patterns in Grammatical Gender Assignment

Veeman, Hartger and Allassonni \`e re-Tang, Marc and Berdicevskis, Aleksandrs and Basirat, Ali. Cross-lingual Embeddings Reveal Universal and Lineage-Specific Patterns in Grammatical Gender Assignment. Proceedings of the 24th Conference on Computational Natural Language Learning. 2020. doi:10.18653/v1/2020.conll-1.20

work page doi:10.18653/v1/2020.conll-1.20 2020

[15] [15]

Analytical Methods for Interpretable Ultradense Word Embeddings

Dufter, Philipp and Sch. Analytical Methods for Interpretable Ultradense Word Embeddings. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1111

work page doi:10.18653/v1/d19-1111 2019

[16] [16]

Unsupervised Cross-lingual Representation Learning at Scale

Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm \'a n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin. Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. ...

work page doi:10.18653/v1/2020.acl-main.747 2020

[17] [17]

Psychometrika , volume=

Principal component analysis of three-mode data by means of alternating least squares algorithms , author=. Psychometrika , volume=. 1980 , publisher=

1980

[18] [18]

and Bader, Brett W

Kolda, Tamara G. and Bader, Brett W. , title =. SIAM Review , volume =. 2009 , doi =

2009

[19] [19]

Dumais, S. T. and Furnas, G. W. and Landauer, T. K. and Deerwester, S. and Harshman, R. , title =. 1988 , isbn =. doi:10.1145/57167.57214 , booktitle =

work page doi:10.1145/57167.57214 1988

[20] [20]

Eckart-Young

Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition , author=. Psychometrika , volume=. 1970 , publisher=

1970

[21] [21]

explanatory

Foundations of the PARAFAC procedure: Models and conditions for an" explanatory" multimodal factor analysis , author=

[22] [22]

Proceedings of the Conference on Fairness, Accountability, and Transparency , pages =

De-Arteaga, Maria and Romanov, Alexey and Wallach, Hanna and Chayes, Jennifer and Borgs, Christian and Chouldechova, Alexandra and Geyik, Sahin and Kenthapadi, Krishnaram and Kalai, Adam Tauman , title =. Proceedings of the Conference on Fairness, Accountability, and Transparency , pages =. 2019 , isbn =. doi:10.1145/3287560.3287572 , abstract =

work page doi:10.1145/3287560.3287572 2019

[23] [23]

R eddit B ias: A Real-World Resource for Bias Evaluation and Debiasing of Conversational Language Models

Barikeri, Soumya and Lauscher, Anne and Vuli \'c , Ivan and Glava s , Goran. R eddit B ias: A Real-World Resource for Bias Evaluation and Debiasing of Conversational Language Models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: ...

work page doi:10.18653/v1/2021.acl-long.151 2021

[24] [24]

Generating Full Length Wikipedia Biographies: The Impact of Gender Bias on the Retrieval-Based Generation of Women Biographies , publisher =

Fan, Angela and Gardent, Claire , keywords =. Generating Full Length Wikipedia Biographies: The Impact of Gender Bias on the Retrieval-Based Generation of Women Biographies , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2204.05879 , url =

work page doi:10.48550/arxiv.2204.05879 2022

[25] [25]

Multilingual T witter Corpus and Baselines for Evaluating Demographic Bias in Hate Speech Recognition

Huang, Xiaolei and Xing, Linzi and Dernoncourt, Franck and Paul, Michael J. Multilingual T witter Corpus and Baselines for Evaluating Demographic Bias in Hate Speech Recognition. Proceedings of the Twelfth Language Resources and Evaluation Conference. 2020

2020

[26] [26]

2018 , booktitle=

Amazon scraps secret AI recruiting tool that showed bias against women , author=. 2018 , booktitle=

2018

[27] [27]

, author=

Visualizing data using t-SNE. , author=. Journal of machine learning research , volume=

[28] [28]

Probing Classifiers are Unreliable for Concept Removal and Detection , publisher =

Kumar, Abhinav and Tan, Chenhao and Sharma, Amit , keywords =. Probing Classifiers are Unreliable for Concept Removal and Detection , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2207.04153 , url =

work page doi:10.48550/arxiv.2207.04153 2022

[29] [29]

Language and linguistics compass , volume=

Construction morphology , author=. Language and linguistics compass , volume=. 2010 , publisher=

2010

[30] [30]

2023 , eprint=

Measuring Inductive Biases of In-Context Learning with Underspecified Demonstrations , author=. 2023 , eprint=

2023

[31] [31]

2022 , eprint=

MABEL: Attenuating Gender Bias using Textual Entailment Data , author=. 2022 , eprint=

2022

[32] [32]

2023 , eprint=

Training Socially Aligned Language Models in Simulated Human Society , author=. 2023 , eprint=

2023

[33] [33]

Language and Linguistics Compass , volume =

Hovy, Dirk and Prabhumoye, Shrimai , title =. Language and Linguistics Compass , volume =. doi:https://doi.org/10.1111/lnc3.12432 , url =. https://compass.onlinelibrary.wiley.com/doi/pdf/10.1111/lnc3.12432 , abstract =

work page doi:10.1111/lnc3.12432

[34] [34]

You reap what you sow: On the Challenges of Bias Evaluation Under Multilingual Settings

Talat, Zeerak and N \'e v \'e ol, Aur \'e lie and Biderman, Stella and Clinciu, Miruna and Dey, Manan and Longpre, Shayne and Luccioni, Sasha and Masoud, Maraim and Mitchell, Margaret and Radev, Dragomir and Sharma, Shanya and Subramonian, Arjun and Tae, Jaesung and Tan, Samson and Tunuguntla, Deepak and Van Der Wal, Oskar. You reap what you sow: On the C...

work page doi:10.18653/v1/2022.bigscience-1.3 2022

[35] [35]

arXiv preprint arXiv:2404.01349 , year=

Fairness in Large Language Models: A Taxonomic Survey , author=. arXiv preprint arXiv:2404.01349 , year=

arXiv

[36] [36]

Storkey , editor =

Harrison Edwards and Amos J. Storkey , editor =. Censoring Representations with an Adversary , booktitle =. 2016 , url =

2016

[37] [37]

Encoding Prior Knowledge with Eigenword Embeddings

Osborne, Dominique and Narayan, Shashi and Cohen, Shay B. Encoding Prior Knowledge with Eigenword Embeddings. Transactions of the Association for Computational Linguistics. 2016. doi:10.1162/tacl_a_00108

work page doi:10.1162/tacl_a_00108 2016

[38] [38]

A Joint Matrix Factorization Analysis of Multilingual Representations

Zhao, Zheng and Ziser, Yftah and Webber, Bonnie and Cohen, Shay. A Joint Matrix Factorization Analysis of Multilingual Representations. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.851

work page doi:10.18653/v1/2023.findings-emnlp.851 2023

[39] [39]

Toward Gender-Inclusive Coreference Resolution

Cao, Yang Trista and Daum \'e III, Hal. Toward Gender-Inclusive Coreference Resolution. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.418

work page doi:10.18653/v1/2020.acl-main.418 2020

[40] [40]

Sparse Activation Editing for Reliable Instruction Following in Narratives

Zhao, Runcong and Cao, Chengyu and Zhu, Qinglin and Ly, Xiucheng and Shao, Shun and Gui, Lin and Xu, Ruifeng and He, Yulan. Sparse Activation Editing for Reliable Instruction Following in Narratives. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1311

work page doi:10.18653/v1/2025.emnlp-main.1311 2025

[41] [41]

Contemporary Mathematics , volume=

Projectors on intersection of subspaces , author=. Contemporary Mathematics , volume=. 2015 , publisher=

2015

[42] [42]

FairImagen: Post-Processing for Bias Mitigation in Text-to-Image Models , url =

Fu, Zihao and Brown, Ryan and Shao, Shun and Rawal, Kai and Delaney, Eoin and Russell, Chris , booktitle =. FairImagen: Post-Processing for Bias Mitigation in Text-to-Image Models , url =

[43] [43]

Demographic Dialectal Variation in Social Media: A Case Study of A frican- A merican E nglish

Blodgett, Su Lin and Green, Lisa and O ' Connor, Brendan. Demographic Dialectal Variation in Social Media: A Case Study of A frican- A merican E nglish. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1120

work page doi:10.18653/v1/d16-1120 2016

[44] [44]

proceedings of the Conference on Fairness, Accountability, and Transparency , pages=

Bias in bios: A case study of semantic representation bias in a high-stakes setting , author=. proceedings of the Conference on Fairness, Accountability, and Transparency , pages=

[45] [45]

2024 , eprint=

Beyond Voice Assistants: Exploring Advantages and Risks of an In-Car Social Robot in Real Driving Scenarios , author=. 2024 , eprint=

2024

[46] [46]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

[47] [47]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

[48] [48]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016

[49] [49]

2023 , eprint=

Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models , author=. 2023 , eprint=

2023

[50] [50]

LEACE: Perfect linear concept erasure in closed form , url =

Belrose, Nora and Schneider-Joseph, David and Ravfogel, Shauli and Cotterell, Ryan and Raff, Edward and Biderman, Stella , booktitle =. LEACE: Perfect linear concept erasure in closed form , url =

[51] [51]

Proceedings of the 39th International Conference on Machine Learning , pages =

Linear Adversarial Concept Erasure , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

2022

[52] [52]

2023 , eprint=

Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2023 , eprint=

2023

[53] [53]

International Conference on Learning Representations , year=

All-but-the-Top: Simple and Effective Postprocessing for Word Representations , author=. International Conference on Learning Representations , year=

[54] [54]

Conceptor Debiasing of Word Representations Evaluated on WEAT

Karve, Saket and Ungar, Lyle and Sedoc, Jo \ a o. Conceptor Debiasing of Word Representations Evaluated on WEAT. Proceedings of the First Workshop on Gender Bias in Natural Language Processing. 2019. doi:10.18653/v1/W19-3806

work page doi:10.18653/v1/w19-3806 2019

[55] [55]

2023 , eprint=

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla , author=. 2023 , eprint=

2023

[56] [56]

Gold Doesn ' t Always Glitter: Spectral Removal of Linear and Nonlinear Guarded Attribute Information

Shao, Shun and Ziser, Yftah and Cohen, Shay B. Gold Doesn ' t Always Glitter: Spectral Removal of Linear and Nonlinear Guarded Attribute Information. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2023. doi:10.18653/v1/2023.eacl-main.118

work page doi:10.18653/v1/2023.eacl-main.118 2023

[57] [57]

Lipstick on a Pig: D ebiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them

Gonen, Hila and Goldberg, Yoav. Lipstick on a Pig: D ebiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v1/N19-1061

work page doi:10.18653/v1/n19-1061 2019

[58] [58]

Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection

Ravfogel, Shauli and Elazar, Yanai and Gonen, Hila and Twiton, Michael and Goldberg, Yoav. Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.647

work page doi:10.18653/v1/2020.acl-main.647 2020

[59] [59]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

On measuring and mitigating biased inferences of word embeddings , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[60] [60]

Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models

Mondorf, Philipp and Wold, Sondre and Plank, Barbara. Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.727

work page doi:10.18653/v1/2025.acl-long.727 2025

[61] [61]

SIAM Journal on Matrix Analysis and Applications , volume =

De Lathauwer, Lieven and De Moor, Bart and Vandewalle, Joos , title =. SIAM Journal on Matrix Analysis and Applications , volume =. 2000 , doi =. https://doi.org/10.1137/S0895479896305696 , abstract =

work page doi:10.1137/s0895479896305696 2000

[62] [62]

arXiv preprint arXiv:2210.12553 , year=

Understanding domain learning in language models through subpopulation analysis , author=. arXiv preprint arXiv:2210.12553 , year=

arXiv

[63] [63]

First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward at ICML 2022 , year=

Self-destructing models: Increasing the costs of harmful dual uses in foundation models , author=. First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward at ICML 2022 , year=

2022

[64] [64]

Transactions of the Association for Computational Linguistics , volume =

Schick, Timo and Udupa, Sahana and Schütze, Hinrich , title = ". Transactions of the Association for Computational Linguistics , volume =. 2021 , month =. doi:10.1162/tacl_a_00434 , url =

work page doi:10.1162/tacl_a_00434 2021

[65] [65]

Adversarial Removal of Demographic Attributes from Text Data

Elazar, Yanai and Goldberg, Yoav. Adversarial Removal of Demographic Attributes from Text Data. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1002

work page doi:10.18653/v1/d18-1002 2018

[66] [66]

2018 , isbn =

Zhang, Brian Hu and Lemoine, Blake and Mitchell, Margaret , title =. 2018 , isbn =. doi:10.1145/3278721.3278779 , booktitle =

work page doi:10.1145/3278721.3278779 2018

[67] [67]

Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =

Xie, Qizhe and Dai, Zihang and Du, Yulun and Hovy, Eduard and Neubig, Graham , title =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =. 2017 , isbn =

2017

[68] [68]

CoRR , volume =

Kellie Webster and Xuezhi Wang and Ian Tenney and Alex Beutel and Emily Pitler and Ellie Pavlick and Jilin Chen and Slav Petrov , title =. CoRR , volume =. 2020 , url =. 2010.06032 , timestamp =

arXiv 2020

[69] [69]

An Empirical Survey of the Effectiveness of Debiasing Techniques for Pre-trained Language Models

Meade, Nicholas and Poole-Dayan, Elinor and Reddy, Siva. An Empirical Survey of the Effectiveness of Debiasing Techniques for Pre-trained Language Models. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.132

work page doi:10.18653/v1/2022.acl-long.132 2022

[70] [70]

arXiv preprint arXiv:2306.05949 , year=

Evaluating the Social Impact of Generative AI Systems in Systems and Society , author=. arXiv preprint arXiv:2306.05949 , year=

arXiv

[71] [71]

arXiv preprint arXiv:2108.07258 , year=

On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=

Pith/arXiv arXiv

[72] [72]

arXiv preprint arXiv:2303.15715 , year=

Foundation models and fair use , author=. arXiv preprint arXiv:2303.15715 , year=

arXiv

[73] [73]

2023 , eprint=

Recent Advances towards Safe, Responsible, and Moral Dialogue Systems: A Survey , author=. 2023 , eprint=

2023

[74] [74]

2023 , eprint=

Sociotechnical Safety Evaluation of Generative AI Systems , author=. 2023 , eprint=

2023

[75] [75]

arXiv preprint arXiv:2112.04359 , year=

Ethical and social risks of harm from language models , author=. arXiv preprint arXiv:2112.04359 , year=

Pith/arXiv arXiv

[76] [76]

Stevie and Hendricks, Lisa Anne and Rauh, Maribeth and Wu, Boxi and Agnew, William and Kunesch, Markus and Duan, Isabella and Gabriel, Iason and Isaac, William , title =

Bergman, A. Stevie and Hendricks, Lisa Anne and Rauh, Maribeth and Wu, Boxi and Agnew, William and Kunesch, Markus and Duan, Isabella and Gabriel, Iason and Isaac, William , title =. 2023 , isbn =. doi:10.1145/3593013.3594019 , booktitle =

work page doi:10.1145/3593013.3594019 2023

[77] [77]

2020 , eprint=

Language (Technology) is Power: A Critical Survey of "Bias" in NLP , author=. 2020 , eprint=

2020

[78] [78]

Fairness in Language Models Beyond E nglish: Gaps and Challenges

Ramesh, Krithika and Sitaram, Sunayana and Choudhury, Monojit. Fairness in Language Models Beyond E nglish: Gaps and Challenges. Findings of the Association for Computational Linguistics: EACL 2023. 2023. doi:10.18653/v1/2023.findings-eacl.157

work page doi:10.18653/v1/2023.findings-eacl.157 2023

[79] [79]

1996 , issue_date =

Friedman, Batya and Nissenbaum, Helen , title =. 1996 , issue_date =. doi:10.1145/230538.230561 , month =

work page doi:10.1145/230538.230561 1996

[80] [80]

Harms of Gender Exclusivity and Challenges in Non-Binary Representation in Language Technologies

Dev, Sunipa and Monajatipoor, Masoud and Ovalle, Anaelia and Subramonian, Arjun and Phillips, Jeff and Chang, Kai-Wei. Harms of Gender Exclusivity and Challenges in Non-Binary Representation in Language Technologies. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.150

work page doi:10.18653/v1/2021.emnlp-main.150 2021