DIA-HARM: Dialectal Disparities in Harmful Content Detection Across 50 English Dialects

Adaku Uchendu; Ali Al-Lawati; Dongwon Lee; Jason Lucas; Matt Murtagh; Uchendu Uchendu

arxiv: 2604.05318 · v2 · pith:VQMUK4UInew · submitted 2026-04-07 · 💻 cs.CL

DIA-HARM: Dialectal Disparities in Harmful Content Detection Across 50 English Dialects

Jason Lucas , Matt Murtagh , Ali Al-Lawati , Uchendu Uchendu , Adaku Uchendu , Dongwon Lee This is my paper

Pith reviewed 2026-05-10 19:54 UTC · model grok-4.3

classification 💻 cs.CL

keywords disinformation detectiondialectal variationEnglish dialectsmodel robustnessharmful contentbenchmarkmultilingual modelscontent moderation

0 comments

The pith

Disinformation detectors show reduced performance on non-Standard American English dialects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a benchmark to check how disinformation detectors handle English written in 50 different dialects instead of only Standard American English. It converts existing test data into dialect versions using systematic language rules and runs the new samples through 16 different models. Human-written dialect versions cause measurable drops in detection quality while AI-written versions do not, and some models lose more than a third of their accuracy on mixed inputs. Multilingual models maintain higher scores across dialects than models trained only on English. If the pattern holds, the tools used to flag harmful content may work less reliably for speakers of many English varieties.

Core claim

Evaluations using the DIA-HARM benchmark and the D3 corpus of 195K dialectal samples show that human-written dialectal content degrades F1 scores by 1.4-3.6 percent across 16 models while AI-generated dialectal content stays stable, with some models exhibiting over 33 percent degradation on mixed content. Fine-tuned transformers reach best-case F1 of 96.6 percent versus 78.3 percent for zero-shot LLMs, and cross-dialect analysis of 2450 pairs finds that multilingual models such as mDeBERTa average 97.2 percent F1 while monolingual models like RoBERTa fail on dialectal inputs.

What carries the argument

The DIA-HARM benchmark applies linguistically grounded transformations to create 50 English dialect variants of disinformation samples for testing detection robustness.

If this is right

Fine-tuned transformers substantially outperform zero-shot LLMs on dialectal disinformation inputs.
Multilingual models generalize across dialects far better than monolingual models such as RoBERTa.
Human-written dialect content triggers larger performance losses than AI-generated dialect content.
Current detectors may produce unequal results for hundreds of millions of non-Standard American English speakers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Moderation pipelines may need explicit dialect coverage during training to reduce uneven error rates.
The observed stability on AI-generated text suggests detectors may rely on cues that differ between synthetic and natural language.
Similar robustness gaps could appear in related tasks such as hate-speech or toxicity detection.
Global deployment of these models would benefit from routine testing on authentic regional English data.

Load-bearing premise

The transformed dialect samples accurately match real-world usage and keep the original disinformation label without adding separate changes that alter model behavior.

What would settle it

Running the same 16 models on naturally collected disinformation examples written in the target dialects and finding no drop in detection scores relative to Standard American English.

Figures

Figures reproduced from arXiv: 2604.05318 by Adaku Uchendu, Ali Al-Lawati, Dongwon Lee, Jason Lucas, Matt Murtagh, Uchendu Uchendu.

**Figure 2.** Figure 2: The DIA-HARM evaluation framework. Starting from 9 SAE disinformation benchmarks, we apply Multi-VALUE rule-based dialect transformations to generate 50 English dialectal variants. D-PURIFY validates transformation quality using semantic, logical, and feature accuracy metrics. We then evaluate 16 detectors across multiple experimental settings (SQ1–SQ4), measuring classification robustness under unseen, se… view at source ↗

**Figure 3.** Figure 3: SQ1: Generalization gap (∆ F1) from SAE to dialectal variants by content type. Solid blue = human content; hatched green = AI content; dotted orange = mixed content. Negative values indicate degradation on dialects. 6.1 SQ1: Generalization to Unseen Dialects [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Asymmetric harm across evaluation regimes. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Per-model asymmetric harm across all evaluation regimes. Green dots show [PITH_FULL_IMAGE:figures/full_fig_p035_5.png] view at source ↗

read the original abstract

Harmful content detectors, particularly disinformation classifiers, are predominantly developed and evaluated on Standard American English (SAE), leaving their robustness to dialectal variation unexplored. We present DIA-HARM, the first benchmark for evaluating disinformation detection robustness across 50 English dialects spanning U.S., British, African, Caribbean, and Asia-Pacific varieties. Using Multi-VALUE's linguistically grounded transformations, we introduce D-CUBE (Dialectal Disinformation Detection Corpus), a core corpus component of DIA-HARM comprising 195K samples derived from established disinformation benchmarks. Our evaluation of 16 detection models reveals systematic vulnerabilities: human-written dialectal content degrades detection by 1.4-3.6% F1, while AI-generated content remains stable. Fine-tuned transformers substantially outperform zero-shot LLMs (96.6% vs. 78.3% best-case F1), with some models exhibiting catastrophic failures exceeding 33% degradation on mixed content. Cross-dialectal transfer analysis across 2,450 dialect pairs shows that multilingual models (mDeBERTa: 97.2% average F1) generalize effectively, while monolingual models like RoBERTa and XLM-RoBERTa fail on dialectal inputs. These findings demonstrate that current disinformation detectors may systematically disadvantage hundreds of millions of non-SAE speakers worldwide. We release the DIA-HARM benchmark, including the D-CUBE corpus (https://github.com/jsl5710/dia-harm), and evaluation tools (https://jsl5710.github.io/dia-harm).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DIA-HARM offers a broad new benchmark on dialectal disinformation detection but leans too much on unvalidated synthetic transformations.

read the letter

The main takeaway is that this paper creates the first benchmark for disinformation detection across 50 English dialects and reports performance gaps on transformed data, but those gaps may not reflect real dialectal issues. What the work does is introduce DIA-HARM and the D3 corpus of 195K samples derived via Multi-VALUE from prior benchmarks. It evaluates 16 models, finds bigger drops on human-written dialect content than AI-generated, and shows multilingual models transfer better across 2450 dialect pairs. Releasing the code and data is practical and lets the community check the details. The evaluations provide specific numbers on F1 degradation and model comparisons that were not available before for this scale of dialects. Covering varieties from multiple regions adds breadth. The soft spot sits in the data generation. The transformations are rule- or model-based, yet the paper does not appear to include checks like human judgments on how natural the text sounds or whether the disinformation label holds after changes. If the output introduces unnatural patterns or shifts meaning, the observed drops of a few percent or more could stem from that rather than dialect robustness. This makes the broader claim about disadvantaging non-standard English speakers rest on an assumption that needs testing against actual dialect data. This paper suits researchers focused on fairness in NLP applications like content moderation. Someone looking for a new testbed or ideas on cross-dialect evaluation will find the resources valuable. It deserves peer review because the benchmark and release are concrete steps forward, even if the interpretation of results requires more support. I would recommend sending it to referees.

Referee Report

2 major / 2 minor

Summary. The paper introduces DIA-HARM, the first benchmark for disinformation detection robustness across 50 English dialects (U.S., British, African, Caribbean, Asia-Pacific). It constructs the D3 corpus (195K samples) by applying Multi-VALUE linguistically grounded transformations to existing disinformation benchmarks. Evaluation of 16 models (fine-tuned transformers and zero-shot LLMs) reports F1 degradations of 1.4-3.6% on human-written dialectal content (with some mixed cases >33%), better performance from multilingual models (e.g., mDeBERTa at 97.2% average F1), and cross-dialectal transfer results over 2,450 pairs. The authors conclude that current detectors may systematically disadvantage non-SAE speakers and release the framework, D3 corpus, and tools.

Significance. If the D3 corpus validly represents real-world dialectal disinformation, the work identifies a practically important robustness gap in harmful-content detection systems that could affect hundreds of millions of speakers. The empirical scale (50 dialects, 16 models, large corpus), release of code/data, and cross-dialect transfer analysis are strengths that would support follow-on research in fairness and multilingual NLP.

major comments (2)

[D3 corpus construction] Corpus construction / D3 creation section: The central claim that detectors 'systematically disadvantage' non-SAE speakers rests on performance drops observed after Multi-VALUE transformations. No quantitative validation (human naturalness ratings, semantic equivalence checks against authentic dialect corpora, or label-preservation verification) is reported for the 50 dialects. If transformations introduce unnatural phrasing or alter surface cues that models rely on, the measured F1 gaps (1.4-3.6% and >33%) may reflect artifacts rather than dialectal robustness failure.
[Evaluation of 16 detection models] Evaluation and results section: The abstract and results distinguish human-written vs. AI-generated content and report specific degradation numbers, but provide no statistical tests (e.g., significance of F1 differences, confidence intervals, or controls for transformation-induced label drift) to support that the observed gaps are attributable to dialect rather than other factors. This weakens the load-bearing inference to real-world disadvantage.

minor comments (2)

[Abstract] Abstract: The phrasing 'first benchmark' should be qualified with citations to prior dialectal robustness studies in related tasks (e.g., sentiment, toxicity) to avoid overstatement.
[Results tables] Table/figure captions: Ensure all tables reporting F1 scores include the exact number of samples per dialect category and the baseline SAE performance for direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline revisions to strengthen the presentation of our results while preserving the core contributions of the DIA-HARM benchmark.

read point-by-point responses

Referee: [D3 corpus construction] Corpus construction / D3 creation section: The central claim that detectors 'systematically disadvantage' non-SAE speakers rests on performance drops observed after Multi-VALUE transformations. No quantitative validation (human naturalness ratings, semantic equivalence checks against authentic dialect corpora, or label-preservation verification) is reported for the 50 dialects. If transformations introduce unnatural phrasing or alter surface cues that models rely on, the measured F1 gaps (1.4-3.6% and >33%) may reflect artifacts rather than dialectal robustness failure.

Authors: We appreciate the referee's emphasis on corpus validity. The D3 corpus relies on Multi-VALUE transformations, which were previously validated in the source work for linguistic fidelity, naturalness, and semantic preservation across English dialects through expert linguistic review and human judgments. We did not replicate new human evaluations here to focus on the downstream detection task, but we will add explicit citations to those prior validations, a dedicated paragraph discussing their scope, and a brief acknowledgment that our results inherit the strengths and limitations of the transformation framework. Label preservation follows from the design of the transformations (surface-form changes that retain propositional content), and we will note this explicitly. We agree that fresh verification would be ideal and will include it as a limitation if space allows. revision: partial
Referee: [Evaluation of 16 detection models] Evaluation and results section: The abstract and results distinguish human-written vs. AI-generated content and report specific degradation numbers, but provide no statistical tests (e.g., significance of F1 differences, confidence intervals, or controls for transformation-induced label drift) to support that the observed gaps are attributable to dialect rather than other factors. This weakens the load-bearing inference to real-world disadvantage.

Authors: We agree that statistical support would strengthen the claims. In the revised manuscript we will add (1) bootstrap-derived 95% confidence intervals for all reported F1 scores, (2) paired statistical tests (Wilcoxon signed-rank) comparing original vs. dialectal performance per model and dialect group, and (3) a small-scale manual audit of 200 transformed samples to quantify any label drift introduced by the transformations. These additions will be placed in the evaluation section and will directly address whether the observed gaps exceed what could be expected from sampling variation or transformation artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation with external dependencies

full rationale

The paper constructs the D3 corpus by applying Multi-VALUE transformations (an external prior method) to established disinformation benchmarks and then measures performance of 16 detection models across dialects. No mathematical derivations, equations, or 'predictions' are present that reduce by construction to fitted parameters or self-referential definitions. Central claims rest on observed F1 scores and cross-dialect transfer metrics, which are directly falsifiable via independent replication on the released corpus rather than being forced by internal definitions or self-citation chains. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the validity of the Multi-VALUE dialect transformations preserving semantic content and labels, plus the assumption that the chosen 50 dialects and 16 models are representative for the stated conclusions.

axioms (1)

domain assumption Linguistically grounded transformations from Multi-VALUE produce valid dialectal variants that preserve the original disinformation label.
Invoked when creating the D3 corpus from established benchmarks.

pith-pipeline@v0.9.0 · 5576 in / 1203 out tokens · 55200 ms · 2026-05-10T19:54:06.000936+00:00 · methodology

DIA-HARM: Dialectal Disparities in Harmful Content Detection Across 50 English Dialects

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)