Presents a new expert-curated dataset of multi-turn counterspeech dialogues in five languages targeting hate against seven groups, with span annotations linking to verified external knowledge for RAG applications.
The Risk of Racial Bias in Hate Speech Detection
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CL 8roles
background 1polarities
background 1representative citing papers
A Dutch BERT model encodes gender linearly by epoch 20 but does not dynamically update its representations when explicit female cues contradict learned stereotypical associations in short sentence templates.
LLMs generate adequate counterspeech for co-occurring hate and misinformation in 40% of cases, with a mixed knowledge strategy from fact-checkers and NGOs proving most effective after expert revision.
CoGate-LSTM adds prototype-guided cosine feature-space gating to a character-level BiLSTM with multi-source embeddings and focal loss, reaching 0.881 macro-F1 on Jigsaw toxic comments while using 7.3M parameters and outperforming fine-tuned BERT by 6.9 points on minority labels.
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.
Toxicity in language models is disproportionately encoded in early MLP layers and can be localized via activation differentials then suppressed at inference time without gradient descent.
Automated hate speech detectors show poor alignment with heterogeneous in-group judgments on reclaimed slur usage, driven by low inter-annotator agreement and contextual features like derogatory intent.
A survey that catalogs threat models, detection approaches, and mitigation strategies for toxicity in multilingual LLMs while identifying challenges such as uneven language coverage and culturally variable harm definitions.
citing papers explorer
-
CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges
Presents a new expert-curated dataset of multi-turn counterspeech dialogues in five languages targeting hate against seven groups, with span annotations linking to verified external knowledge for RAG applications.
-
Is She Even Relevant? When BERT Ignores Explicit Gender Cues
A Dutch BERT model encodes gender linearly by epoch 20 but does not dynamically update its representations when explicit female cues contradict learned stereotypical associations in short sentence templates.
-
Assisted Counterspeech Writing at the Crossroads of Hate Speech and Misinformation
LLMs generate adequate counterspeech for co-occurring hate and misinformation in 40% of cases, with a mixed knowledge strategy from fact-checkers and NGOs proving most effective after expert revision.
-
CoGate-LSTM: Prototype-Guided Feature-Space Gating for Mitigating Gradient Dilution in Imbalanced Toxic Comment Classification
CoGate-LSTM adds prototype-guided cosine feature-space gating to a character-level BiLSTM with multi-source embeddings and focal loss, reaching 0.881 macro-F1 on Jigsaw toxic comments while using 7.3M parameters and outperforming fine-tuned BERT by 6.9 points on minority labels.
-
Ethical and social risks of harm from Language Models
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.
-
Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language Models
Toxicity in language models is disproportionately encoded in early MLP layers and can be localized via activation differentials then suppressed at inference time without gradient descent.
-
IYKYK (But AI Doesn't): Automated Content Moderation Does Not Capture Communities' Heterogeneous Attitudes Towards Reclaimed Language
Automated hate speech detectors show poor alignment with heterogeneous in-group judgments on reclaimed slur usage, driven by low inter-annotator agreement and contextual features like derogatory intent.
-
A Survey of Toxicity Detection and Mitigation Strategies for Multilingual Language Models
A survey that catalogs threat models, detection approaches, and mitigation strategies for toxicity in multilingual LLMs while identifying challenges such as uneven language coverage and culturally variable harm definitions.