arxiv: 2605.09647 · v1 · submitted 2026-05-10 · 💻 cs.SI

Recognition: 2 theorem links

· Lean Theorem

Modeling Implicit Conflict Monitoring Mechanisms against Stereotypes in LLMs

Jingshen Zhang , Bo Wang , Yanlin Fu , Dongming Zhao , Ruifang He , Yuexian Hou , Zifei Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:28 UTC · model grok-4.3

classification 💻 cs.SI

keywords LLMsstereotypical biasself-debiasingconflict monitoringneuron ablationmodel editingfairness

0 comments

The pith

Deactivating COCO neurons in LLMs causes over 90 percent of outputs to revert to biased content, exposing an internal self-correction process.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how large language models develop internal ways to reduce stereotypical outputs during generation, separate from prompt-based safety rules. It introduces COCO as a way to locate neurons that stay consistent within one type of response but differ sharply between stereotypical and unbiased ones. Turning these neurons off produces far more bias than direct adversarial prompts do. The work also shows lightweight editing methods that strengthen this internal correction while keeping the model's normal abilities intact.

Core claim

We propose COCO, a contrastive causal method to identify neurons that exhibit high intra-consistency yet sharp inter-contrast across antithetical generative responses such as stereotypical versus unbiased outputs. Ablation studies reveal that deactivating COCO neurons leads to a catastrophic collapse of the model's fairness with over 90 percent of outputs reverting to biased content, far exceeding the bias levels induced by explicit adversarial jailbreak attacks. We further propose two training-free editing strategies, Local Enhancement and Networked Enhancement, that improve robustness against jailbreaks and performance on safety benchmarks while preserving generative proficiency.

What carries the argument

COCO neurons: units isolated by the contrastive causal method that maintain high consistency within stereotypical or unbiased outputs but show sharp differences between the two.

If this is right

Deactivating COCO neurons produces over 90 percent biased outputs, exceeding bias from explicit jailbreak attacks.
Simple amplification of COCO neuron weights yields only marginal fairness gains.
Local Enhancement and Networked Enhancement editing methods increase resistance to adversarial jailbreaks.
The edited models retain strong results on open-ended safety benchmarks and core generation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same contrastive isolation approach could be applied to locate neurons involved in detecting hallucinations or maintaining logical consistency.
If COCO-like neurons exist across different model architectures, targeted editing might offer a general route to strengthening internal safety checks without full retraining.

Load-bearing premise

The contrastive causal method isolates neurons whose activity actually causes the suppression of stereotypes instead of merely coinciding with the difference in outputs.

What would settle it

An ablation experiment in which deactivating the identified COCO neurons fails to produce a sharp rise in biased outputs or in which random sets of neurons produce comparable increases when deactivated.

Figures

Figures reproduced from arXiv: 2605.09647 by Bo Wang, Dongming Zhao, Jingshen Zhang, Ruifang He, Yanlin Fu, Yuexian Hou, Zifei Yu.

**Figure 2.** Figure 2: Comparison between stimulus-driven traditional safety mechanisms and processoriented self-debiasing mechanisms. Traditional safety mechanisms can refuse to respond when detecting high-risk keywords like ”steal a car”, as shown in (a); however, they can be easily bypassed by natural prompt attacks through semantic obfuscation, as shown in (b). We analyze emergent self-debiasing mechanisms that do not rely… view at source ↗

**Figure 3.** Figure 3: COCO Neuron Extraction. Quantify Neuron Activation Response. As discussed in Section 2, given a neuron Nl,j w and an input query x, the hidden state after l-th layer when handling x is denoted as h l (x). Furthermore, following Zhao et al. (2025b), the activation response of neuron Nl,j w in processing x, denoted as a l,j w , is calculated by: a l,j w = ||h l \N l,j w (x) − h l (x)||2 (3) where h l \N l… view at source ↗

**Figure 4.** Figure 4: Experimental results for the enhancing editing of LE-COCO and NE-COCO. Higher [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Results of safety jailbreak testing. Higher values denote stronger resistance. Comprehen [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Safety Jailbreak Prompt Templates used in our work. [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: The distribution heatmap of LE-COCO neurons in Llama3-8B (Left) and NE-COCO [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Shifts in the attention score matrices following enhancement. Top rows: Top 3 attention [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

read the original abstract

In this paper, we study an emergent self-debiasing mechanisms against stereotypical content in Large Language Models (LLMs). Unlike traditional safety mechanisms that are primarily triggered by explicit input-level stimuli, self-debiasing mechanisms can involve generation-time intrinsic correction that are not directly reducible to surface-level prompt. Motivated by conflict-monitoring and response-inhibition accounts in cognitive neuroscience, we propose COCO, a contrastive causal method designed to identify COCO neurons that exhibit high intra-\underline{CO}nsistency yet sharp inter-\underline{CO}ntrast across antithetical generative responses, such as stereotypical versus unbiased outputs. Ablation studies reveal that deactivating COCO neurons leads to a catastrophic collapse of the model's fairness; over 90\% of outputs revert to biased content, far exceeding the bias levels induced by explicit adversarial jailbreak attacks. Observing that simple weight amplification of COCO neurons yields only marginal gains, we propose two training-free, lightweight editing strategies: Local Enhancement (LE-COCO) and Networked Enhancement (NE-COCO). Comprehensive evaluations show that our methods bolster robustness against adversarial jailbreaks and achieve strong performance on open-ended safety benchmarks, while preserving foundational generative proficiency. While this study primarily addresses social stereotypes, the COCO mechanism holds significant potential for diverse domains like hallucination detection, offering valuable insights toward the development of self-evolving AI agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper locates neurons whose ablation sharply increases stereotype bias and offers two lightweight edits to strengthen them, but the causal link to conflict monitoring is not yet established.

read the letter

The paper's main contribution is a way to find neurons in LLMs that help suppress stereotypical outputs during generation, along with two simple editing tricks to make that suppression stronger. The ablation result is eye-catching: removing these COCO neurons causes most outputs to flip back to biased content, more so than standard jailbreaks. They define COCO neurons as those with high consistency inside either the stereotypical or the unbiased response class but big differences between the classes. This contrastive selection is the new piece. From there they test local weight boosts and a networked version, both without any retraining. The evaluations suggest these edits improve resistance to adversarial prompts on safety tests without hurting regular performance much. The neuroscience motivation around conflict monitoring gives it a conceptual hook. Where it gets soft is the interpretation of what these neurons are actually doing. The ablation shows they are important for fair outputs, but that does not automatically mean they are monitoring for conflicts and then inhibiting the bad path. It could be that they are just part of the circuit that produces the unbiased continuation, so removing them naturally shifts the output distribution. The abstract does not describe timing analyses or additional controls that would distinguish those possibilities. Details on neuron selection thresholds, dataset sizes, and how the 90 percent figure was calculated are also missing from the summary, which makes it difficult to assess how robust the finding is. This is the sort of paper that would interest people working on post-hoc editing for LLM safety and on finding internal mechanisms for self-correction. Someone looking for concrete, replicable interventions would find the methods accessible. It should go to peer review because the core idea is distinct from prior safety work and the results are strong enough to warrant closer examination, even if the mechanistic claim needs more support. Reviewers can push on the causal evidence and the missing methodological specifics.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes COCO, a contrastive causal method to identify 'COCO neurons' in LLMs that show high intra-consistency within stereotypical or unbiased generations but large inter-contrast between them. Motivated by cognitive neuroscience accounts of conflict monitoring and response inhibition, the authors claim that ablating these neurons causes over 90% of outputs to revert to biased content—exceeding effects from explicit jailbreak attacks. They introduce two training-free editing strategies (LE-COCO and NE-COCO) that improve robustness on safety benchmarks while preserving generative performance, with potential extension to hallucination detection.

Significance. If the COCO neurons can be shown to implement a specific implicit conflict-monitoring function rather than simply encoding one response class, the work would advance mechanistic understanding of emergent self-debiasing in LLMs and supply practical, lightweight editing techniques. The training-free nature and reported gains over jailbreaks are potentially useful, but the current evidence does not yet establish the functional interpretation or the claimed superiority.

major comments (3)

[Abstract] Abstract: The central claim that ablating COCO neurons produces 'catastrophic collapse' with >90% reversion to biased outputs is presented without any description of how the 90% figure was computed, the evaluation dataset size or composition, statistical significance tests, or controls for neuron selection criteria. This information is load-bearing for the causal interpretation of conflict monitoring.
[Abstract] Abstract: The contrastive selection criterion (high intra-consistency within each class but sharp inter-contrast) selects neurons that differentiate stereotypical from unbiased outputs by construction. Ablation is therefore expected to shift the output distribution toward the alternative class; this outcome does not require or demonstrate a dedicated conflict-detection or inhibition mechanism.
[Abstract] Abstract: No activation-timing analysis, controlled intervention experiments, or other evidence is supplied to show that the selected neurons activate preferentially upon internal detection of a stereotype conflict rather than simply participating in unbiased continuation generation. The neuroscience analogy therefore rests on an untested functional interpretation.

minor comments (2)

[Abstract] The acronym COCO is expanded via underlined letters in the abstract but the full phrase is not stated explicitly, which may hinder readability.
Quantitative results, dataset descriptions, and ablation controls are summarized at a high level; the manuscript would be strengthened by including these details in the main text or a dedicated methods/results section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment point by point below, indicating planned revisions where the manuscript can be strengthened without misrepresenting our findings.

read point-by-point responses

Referee: The central claim that ablating COCO neurons produces 'catastrophic collapse' with >90% reversion to biased outputs is presented without any description of how the 90% figure was computed, the evaluation dataset size or composition, statistical significance tests, or controls for neuron selection criteria. This information is load-bearing for the causal interpretation of conflict monitoring.

Authors: We agree that the abstract would benefit from additional methodological context. The full manuscript details the computation of the reversion rate, the prompt dataset used for evaluation, and associated statistical analyses in the experimental results section. We will revise the abstract to incorporate a concise summary of these elements to support the causal claims. revision: yes
Referee: The contrastive selection criterion (high intra-consistency within each class but sharp inter-contrast between them) selects neurons that differentiate stereotypical from unbiased outputs by construction. Ablation is therefore expected to shift the output distribution toward the alternative class; this outcome does not require or demonstrate a dedicated conflict-detection or inhibition mechanism.

Authors: The selection criterion is contrastive by design, yet the manuscript shows that ablation of these specific neurons produces a collapse exceeding that from explicit jailbreak attacks, while their enhancement yields targeted robustness improvements not achieved by generic weight scaling. To address the concern, we will add controls ablating neurons selected under alternative criteria in the revised manuscript to demonstrate the specificity of the intra-consistency and inter-contrast properties. revision: partial
Referee: No activation-timing analysis, controlled intervention experiments, or other evidence is supplied to show that the selected neurons activate preferentially upon internal detection of a stereotype conflict rather than simply participating in unbiased continuation generation. The neuroscience analogy therefore rests on an untested functional interpretation.

Authors: The ablation and enhancement interventions constitute controlled experiments establishing the causal necessity and sufficiency of the identified neurons for maintaining unbiased generation. We acknowledge that activation timing analysis is not included and will expand the discussion to clarify the scope of the functional interpretation and the limits of the neuroscience analogy based on the available causal evidence. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes an empirical contrastive method (COCO) for neuron identification based on activation consistency and contrast across output classes, followed by ablation experiments and editing strategies as validation steps. No equations, parameter fits, or self-citations are described that reduce any central claim (such as the ablation outcome or mechanism interpretation) to a definitional equivalence or by-construction result. The work presents standard experimental findings without load-bearing self-referential derivations, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities beyond the introduction of COCO neurons themselves.

invented entities (1)

COCO neurons no independent evidence
purpose: Neurons exhibiting high intra-consistency yet sharp inter-contrast across antithetical generative responses (stereotypical vs. unbiased)
Introduced as the central object of study; no independent evidence or prior citation is provided in the abstract.

pith-pipeline@v0.9.0 · 5558 in / 1264 out tokens · 52562 ms · 2026-05-12T02:28:34.818746+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

C2-Score(N) = (L(A+,A−) + L(A−,A+))/2 ... lim C(A−)→0, C(A+)→0, D(A−,A+)→+∞
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Motivated by conflict-monitoring and response-inhibition accounts in cognitive neuroscience

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · 1 internal anchor

[1]

The Journal of Neuroscience , year=

Error Effects in Anterior Cingulate Cortex Reverse when Error Likelihood Is High , author=. The Journal of Neuroscience , year=

work page
[2]

Turken , title =

Diane Swick and And U. Turken , title =. Proceedings of the National Academy of Sciences , volume =. 2002 , doi =. https://www.pnas.org/doi/pdf/10.1073/pnas.252521499 , abstract =

work page doi:10.1073/pnas.252521499 2002
[3]

Frank and Brion S

Michael J. Frank and Brion S. Woroch and Tim Curran , abstract =. Error-Related Negativity Predicts Reinforcement Learning and Conflict Biases , journal =. 2005 , issn =. doi:https://doi.org/10.1016/j.neuron.2005.06.020 , url =

work page doi:10.1016/j.neuron.2005.06.020 2005
[4]

2025 , eprint=

A Statistical Physics of Language Model Reasoning , author=. 2025 , eprint=

work page 2025
[5]

2025 , eprint=

Leveraging Submodule Linearity Enhances Task Arithmetic Performance in LLMs , author=. 2025 , eprint=

work page 2025
[6]

ArXiv , year=

Efficient Streaming Language Models with Attention Sinks , author=. ArXiv , year=

work page
[7]

2024 , eprint=

Neuron-Level Knowledge Attribution in Large Language Models , author=. 2024 , eprint=

work page 2024
[8]

Gender bias and stereotypes in Large Language Models , url=

Kotek, Hadas and Dockum, Rikker and Sun, David , year=. Gender bias and stereotypes in Large Language Models , url=. doi:10.1145/3582269.3615599 , booktitle=

work page doi:10.1145/3582269.3615599
[9]

Semantics derived automatically from language corpora necessarily contain human biases , volume =

Caliskan-Islam, Aylin and Bryson, Joanna and Narayanan, Arvind , year =. Semantics derived automatically from language corpora necessarily contain human biases , volume =. Science , doi =

work page
[10]

Conference on Empirical Methods in Natural Language Processing , year=

Neuron-Level Knowledge Attribution in Large Language Models , author=. Conference on Empirical Methods in Natural Language Processing , year=

work page
[11]

ArXiv , year=

Mistral 7B , author=. ArXiv , year=

work page
[12]

ArXiv , year=

LLaMA: Open and Efficient Foundation Language Models , author=. ArXiv , year=

work page
[13]

2019 , eprint=

Representation Learning with Contrastive Predictive Coding , author=. 2019 , eprint=

work page 2019
[14]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language understanding , author=. arXiv preprint arXiv:2009.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2009
[15]

2024 , eprint=

SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models , author=. 2024 , eprint=

work page 2024
[16]

2018 , eprint=

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. 2018 , eprint=

work page 2018
[17]

2025 , eprint=

Disentangling Language and Culture for Evaluating Multilingual Large Language Models , author=. 2025 , eprint=

work page 2025
[18]

2025 , eprint=

Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons , author=. 2025 , eprint=

work page 2025
[19]

2024 , eprint=

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications , author=. 2024 , eprint=

work page 2024
[20]

2006 , url=

Dynamical Systems in Neuroscience: The Geometry of Excitability and Bursting , author=. 2006 , url=

work page 2006
[21]

1960 , url=

A new approach to linear filtering and prediction problems" transaction of the asme journal of basic , author=. 1960 , url=

work page 1960
[22]

2002 , url=

A New Approach to Linear Filtering and Prediction Problems , author=. 2002 , url=

work page 2002
[23]

2010 , url=

Mathematical foundations of neuroscience , author=. 2010 , url=

work page 2010
[24]

2024 , eprint=

GPT-4 Technical Report , author=. 2024 , eprint=

work page 2024
[25]

2024 , eprint=

Jailbreaking Black Box Large Language Models in Twenty Queries , author=. 2024 , eprint=

work page 2024
[26]

2024 , eprint=

Bypassing the Safety Training of Open-Source LLMs with Priming Attacks , author=. 2024 , eprint=

work page 2024
[27]

2023 , eprint=

Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=

work page 2023
[28]

2025 , eprint=

Certifying Counterfactual Bias in LLMs , author=. 2025 , eprint=

work page 2025
[29]

, author=

Implicit social cognition: attitudes, self-esteem, and stereotypes. , author=. Psychological review , year=

work page
[30]

2016 , eprint=

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , author=. 2016 , eprint=

work page 2016
[31]

2023 , eprint=

Reflexion: Language Agents with Verbal Reinforcement Learning , author=. 2023 , eprint=

work page 2023
[32]

Implicit: Investigating Social Bias in Large Language Models through Self-Reflection , author=

Explicit vs. Implicit: Investigating Social Bias in Large Language Models through Self-Reflection , author=. 2025 , eprint=

work page 2025
[33]

2025 , eprint=

Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs , author=. 2025 , eprint=

work page 2025
[34]

2024 , eprint=

Towards Implicit Bias Detection and Mitigation in Multi-Agent LLM Interactions , author=. 2024 , eprint=

work page 2024
[35]

A Comparative Study of Explicit and Implicit Gender Biases in Large Language Models via Self-evaluation

Zhao, Yachao and Wang, Bo and Wang, Yan and Zhao, Dongming and Jin, Xiaojia and Zhang, Jijun and He, Ruifang and Hou, Yuexian. A Comparative Study of Explicit and Implicit Gender Biases in Large Language Models via Self-evaluation. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-...

work page 2024
[36]

Semantics derived automatically from language corpora contain human-like biases,

Caliskan, Aylin and Bryson, Joanna J. and Narayanan, Arvind , year=. Semantics derived automatically from language corpora contain human-like biases , volume=. Science , publisher=. doi:10.1126/science.aal4230 , number=

work page doi:10.1126/science.aal4230
[37]

North American Chapter of the Association for Computational Linguistics , year=

Gender Bias in Contextualized Word Embeddings , author=. North American Chapter of the Association for Computational Linguistics , year=

work page
[38]

PERCEPTRONS AND THE THEORY OF BRAIN MECHANISMS , author=

PRINCIPLES OF NEURODYNAMICS. PERCEPTRONS AND THE THEORY OF BRAIN MECHANISMS , author=. American Journal of Psychology , year=

work page
[39]

1962 , url=

Principles of neurodynamics , author=. 1962 , url=

work page 1962
[40]

2024 , eprint=

Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models , author=. 2024 , eprint=

work page 2024
[41]

2021 , eprint=

Transformer Feed-Forward Layers Are Key-Value Memories , author=. 2021 , eprint=

work page 2021
[42]

2016 , eprint=

Neural Machine Translation by Jointly Learning to Align and Translate , author=. 2016 , eprint=

work page 2016
[43]

2022 , eprint=

Knowledge Neurons in Pretrained Transformers , author=. 2022 , eprint=

work page 2022
[44]

2015 , eprint=

Deep Residual Learning for Image Recognition , author=. 2015 , eprint=

work page 2015
[45]

2023 , eprint=

Attention Is All You Need , author=. 2023 , eprint=

work page 2023
[46]

Understanding and Enhancing Safety Mechanisms of

Yiran Zhao and Wenxuan Zhang and Yuxi Xie and Anirudh Goyal and Kenji Kawaguchi and Michael Shieh , booktitle=. Understanding and Enhancing Safety Mechanisms of. 2025 , url=

work page 2025
[47]

2023 , eprint=

A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis , author=. 2023 , eprint=

work page 2023
[48]

2023 , eprint=

Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models , author=. 2023 , eprint=

work page 2023
[49]

2024 , eprint=

Bias and Fairness in Large Language Models: A Survey , author=. 2024 , eprint=

work page 2024
[50]

Debiasing Pretrained Text Encoders by Paying Attention to Paying Attention

Gaci, Yacine and Benatallah, Boualem and Casati, Fabio and Benabdeslem, Khalid. Debiasing Pretrained Text Encoders by Paying Attention to Paying Attention. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.651

work page doi:10.18653/v1/2022.emnlp-main.651 2022
[51]

B ias A lert: A Plug-and-play Tool for Social Bias Detection in LLM s

Fan, Zhiting and Chen, Ruizhe and Xu, Ruiling and Liu, Zuozhu. B ias A lert: A Plug-and-play Tool for Social Bias Detection in LLM s. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.820

work page doi:10.18653/v1/2024.emnlp-main.820 2024
[52]

2025 , eprint=

Rethinking LLM Bias Probing Using Lessons from the Social Sciences , author=. 2025 , eprint=

work page 2025
[53]

2024 , eprint=

Self-Debiasing Large Language Models: Zero-Shot Recognition and Reduction of Stereotypes , author=. 2024 , eprint=

work page 2024
[54]

2024 , eprint=

Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis , author=. 2024 , eprint=

work page 2024
[55]

2025 , eprint=

Safety Layers in Aligned Large Language Models: The Key to LLM Security , author=. 2025 , eprint=

work page 2025
[56]

2023 , eprint=

Robust Natural Language Understanding with Residual Attention Debiasing , author=. 2023 , eprint=

work page 2023
[57]

2023 , eprint=

LIMA: Less Is More for Alignment , author=. 2023 , eprint=

work page 2023
[58]

2024 , eprint=

Exploring the Linear Subspace Hypothesis in Gender Bias Mitigation , author=. 2024 , eprint=

work page 2024
[59]

2021 , eprint=

Towards Understanding and Mitigating Social Biases in Language Models , author=. 2021 , eprint=

work page 2021
[60]

2024 , eprint=

Linear Adversarial Concept Erasure , author=. 2024 , eprint=

work page 2024
[61]

2025 , eprint=

LEACE: Perfect linear concept erasure in closed form , author=. 2025 , eprint=

work page 2025
[62]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024
[63]

Entropy-based Attention Regularization Frees Unintended Bias Mitigation from Lists

Attanasio, Giuseppe and Nozza, Debora and Hovy, Dirk and Baralis, Elena. Entropy-based Attention Regularization Frees Unintended Bias Mitigation from Lists. Findings of the Association for Computational Linguistics: ACL 2022. 2022. doi:10.18653/v1/2022.findings-acl.88

work page doi:10.18653/v1/2022.findings-acl.88 2022
[64]

2017 , eprint=

Proximal Policy Optimization Algorithms , author=. 2017 , eprint=

work page 2017
[65]

2025 , eprint=

Prompting Fairness: Integrating Causality to Debias Large Language Models , author=. 2025 , eprint=

work page 2025
[66]

Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society , year=

Detecting Emergent Intersectional Biases: Contextualized Word Embeddings Contain a Distribution of Human-like Biases , author=. Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society , year=

work page 2021
[67]

and Rudinger, Rachel

May, Chandler and Wang, Alex and Bordia, Shikha and Bowman, Samuel R. and Rudinger, Rachel. On Measuring Social Biases in Sentence Encoders. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v1/N19-1063

work page doi:10.18653/v1/n19-1063 2019
[68]

, author=

Measuring individual differences in implicit cognition: the implicit association test. , author=. Journal of personality and social psychology , year=

work page
[69]

2023 , eprint=

GPQA: A Graduate-Level Google-Proof Q&A Benchmark , author=. 2023 , eprint=

work page 2023
[70]

2022 , eprint=

TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. 2022 , eprint=

work page 2022
[71]

2021 , eprint=

Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

work page 2021
[72]

Sunipa Dev and Jeff Phillips

Parrish, Alicia and Chen, Angelica and Nangia, Nikita and Padmakumar, Vishakh and Phang, Jason and Thompson, Jana and Htut, Phu Mon and Bowman, Samuel. BBQ : A hand-built bias benchmark for question answering. Findings of the Association for Computational Linguistics: ACL 2022. 2022. doi:10.18653/v1/2022.findings-acl.165

work page doi:10.18653/v1/2022.findings-acl.165 2022
[73]

2024 , eprint=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. 2024 , eprint=

work page 2024
[74]

ArXiv , year=

An Explanation of In-context Learning as Implicit Bayesian Inference , author=. ArXiv , year=

work page
[75]

2022 , eprint=

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=

work page 2022
[76]

Gender bias in coreference resolution: Evaluation and debiasing methods

Zhao, Jieyu and Wang, Tianlu and Yatskar, Mark and Ordonez, Vicente and Chang, Kai-Wei. Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 2018. doi:10.18653/v1/N18-2003

work page doi:10.18653/v1/n18-2003 2018
[77]

Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell

BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation , author =. 2021 , isbn =. doi:10.1145/3442188.3445924 , booktitle =

work page doi:10.1145/3442188.3445924 2021
[78]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[79]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[80]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016

Showing first 80 references.