ToxSyn-PT: A Synthetic Fine-Grained Dataset of Minority-Targeted Toxic Language in Portuguese
read the original abstract
The development of robust hate speech detection systems remains limited by the lack of large-scale, fine-grained training data, especially for languages beyond English. Existing corpora typically rely on simplistic toxic and non-toxic labels, and the few that capture hate directed at specific minority groups lack the positive counterexamples required to distinguish genuine hate from mere discussion. In this work, we introduce ToxSyn-PT, the first Portuguese large-scale corpus explicitly designed for multi-label hate speech detection across nine protected minority groups, including the non-toxic counterexamples absent in all other public datasets. Generated via a controllable four-stage pipeline, ToxSyn contains discourse-type annotations to capture rhetorical strategies of toxic/non-toxic language, such as sarcasm, dehumanization, and cultural appreciation. Our experiments reveal a catastrophic, mutual generalization failure compared to existing datasets from social-media domains: models trained on social media struggle to generalize to minority-specific contexts, and vice-versa. This finding indicates they are distinct tasks and exposes summary metrics like Macro F1 can be unreliable indicators of true model behavior, as they completely mask model failure. We publicly release ToxSyn on HuggingFace to support reproducible research on synthetic data generation and benchmark progress in hate-speech detection for low- and mid-resource languages.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
Safety Is Not Universal: The Selective Safety Trap in LLM Alignment
Safety alignment in LLMs is not uniform but forms a demographic hierarchy, with defense rates varying by up to 42% across groups; a new benchmark and DPO method demonstrate transferable safety.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.