arxiv: 2605.14087 · v1 · submitted 2026-05-13 · 💻 cs.CL · cs.LG

Recognition: no theorem link

Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study

Mokshit Surana , Archit Rathod , Akshaj Satishkumar

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:07 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords toxicity mitigationDExpertslarge language modelsreplication studyRealToxicityPromptsToxiGeninference-time steeringAI safety

0 comments

The pith

DExperts steers LLMs away from explicit toxic outputs at inference time but remains vulnerable to implicit hate speech and incurs a tenfold latency cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This replication study evaluates DExperts, a decoding-time technique that steers language model generation toward safer outputs without retraining the underlying model. It establishes that the method reaches 100 percent safety on standard explicit toxicity prompts from RealToxicityPrompts yet falls to 98.5 percent when tested against adversarial implicit hate speech in the ToxiGen dataset. The work also measures that each generation takes roughly ten times longer, rising from 0.2 seconds to 2.0 seconds. A reader would care because many practical deployments require both reliable safety and acceptable speed, and the results expose a clear robustness gap between explicit and implicit toxicity. The findings underscore the limits of current inference-only fixes and the need for approaches that handle diverse harmful patterns at lower cost.

Core claim

DExperts achieves near-perfect safety rates of 100% on explicit toxicity benchmarks using RealToxicityPrompts on standard GPT-2 models, but safety rates drop to 98.5% against adversarial implicit hate speech on the ToxiGen dataset, while introducing a approximately 10x latency penalty from 0.2s to 2.0s per generation.

What carries the argument

DExperts (Decoding-time Experts), an inference-time mitigation technique that combines expert models to steer generation away from toxic continuations without requiring model retraining or weight updates.

If this is right

Explicit toxicity can be nearly eliminated at generation time using expert steering without retraining.
Implicit and adversarial toxicity patterns expose brittleness that current decoding-time methods do not fully address.
Real-time applications face practical barriers from the measured tenfold increase in generation latency.
Replication confirms original efficacy on explicit cases while revealing the need for more generalizable safety techniques.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The performance gap implies that benchmark definitions of toxicity may miss subtle or context-dependent harmful patterns that appear in real deployments.
Hybrid approaches that combine light fine-tuning with decoding-time steering could reduce the latency penalty while preserving safety gains.
Extending the evaluation to larger base models would reveal whether the latency cost scales linearly or becomes more severe.

Load-bearing premise

The chosen benchmarks and the specific DExperts implementation faithfully represent real-world toxicity risks and that the reported safety rates generalize beyond the tested prompts and model sizes.

What would settle it

Running DExperts on a fresh collection of implicit hate speech prompts and observing safety rates that stay above 99 percent across multiple model scales would support the robustness claim; rates that consistently fall below 95 percent would falsify it.

Figures

Figures reproduced from arXiv: 2605.14087 by Akshaj Satishkumar, Archit Rathod, Mokshit Surana.

**Figure 1.** Figure 1: Baseline Toxicity Distribution from Phase 1. The [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 3.** Figure 3: Shift in Toxicity Distribution with DExperts Mit [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: CDF comparison between baseline and DExperts, [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Trade-offs in DExperts mitigation: (a) Perfect safety achievement on RealToxicityPrompts with 100% safe generations, [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: The Robustness Gap: Violin plot comparison be [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 10.** Figure 10: CDF curves showing the robustness gap. Phase 2 [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 11.** Figure 11: Computational overhead comparison showing his [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

read the original abstract

Large Language Models (LLMs), when trained on web-scale corpora, inherently absorb toxic patterns from their training data. This leads to ``toxic degeneration'' where even innocuous prompts can trigger harmful outputs. This phenomenon poses significant risks for real-world deployments. Thus, necessitating effective mitigation strategies that should maintain model utility while ensuring safety. In this comprehensive replication study, we evaluate the efficacy of \textbf{DExperts} (Decoding-time Experts), which is an inference-time mitigation technique that steers generation without requiring model retraining. We structured our research into three systematic phases: (1) establishing baseline toxicity measurements using \textbf{RealToxicityPrompts} on standard GPT-2 models; then (2) implementing and evaluating DExperts to mitigate explicit toxicity; and finally (3) stress-testing the method against implicit hate speech using the adversarial \textbf{ToxiGen} dataset. Our empirical results confirm that while DExperts achieves near-perfect safety rates (100\%) on explicit toxicity benchmarks, it exhibits brittleness against adversarial, implicit hate speech, with safety rates dropping to 98.5\%. Furthermore, we quantify a critical trade-off. The method introduces a $\sim$10x latency penalty (from 0.2s to 2.0s per generation), posing challenges for real-time deployment scenarios. This study contributes to the growing body of work on AI safety by highlighting the robustness gap between explicit and implicit toxicity mitigation. We emphasize the need for more sophisticated approaches that generalize across diverse hate speech patterns without prohibitive computational costs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Straight replication of DExperts that adds a ToxiGen check but lacks the stats to support calling the 1.5% safety drop brittleness.

read the letter

This replication mostly confirms the original DExperts results on explicit toxicity while adding one extra dataset test on implicit hate. The paper walks through baselines on RealToxicityPrompts with GPT-2, applies the method, and then runs the adversarial ToxiGen set, reporting 100% safety on the first and 98.5% on the second along with a roughly 10x latency increase. That structure is clear and the confirmation of the prior safety numbers is useful to see in print again. The latency observation is also practical for anyone considering real-time use. Reproducing the core result without new training is the main contribution here, and the authors are upfront that this is not a new method. The soft spot is the interpretation of the 98.5% figure as evidence of brittleness. The text gives only point estimates with no sample sizes, no per-prompt breakdowns, no confidence intervals, and no significance test, so it is impossible to tell whether the 1.5% difference is reliable or just noise. The same holds for the latency numbers, which appear without hardware, batch size, or measurement protocol. Those gaps make the trade-off claim hard to evaluate from the given information. The work is honest about its scope and does not invent new mechanisms or overstate novelty, which keeps it grounded. It is aimed at researchers who track decoding-time safety techniques and want to see how one established approach behaves on implicit cases. A reader hunting for fresh algorithms will not find them, but someone verifying or extending DExperts might pick up the added dataset evaluation once the experimental details are filled in. I would send this to peer review. The replication itself is straightforward enough to deserve referee input on the missing controls and statistics, even though revisions will be needed to make the brittleness argument hold up.

Referee Report

3 major / 2 minor

Summary. The manuscript is a replication study of DExperts (decoding-time experts) for mitigating toxicity in LLMs. It measures baseline toxicity on RealToxicityPrompts with GPT-2, reports that DExperts reaches 100% safety on explicit toxicity, drops to 98.5% safety on the adversarial ToxiGen dataset for implicit hate speech, and incurs a ~10x latency penalty (0.2s to 2.0s per generation).

Significance. If the empirical claims are supported by proper statistical reporting and reproducible implementation details, the work would usefully document a robustness gap between explicit and implicit toxicity for inference-time methods and quantify their computational cost, thereby guiding future safety research toward more generalizable approaches.

major comments (3)

[Abstract] Abstract: the central claim of 'brittleness' rests on the 1.5% safety-rate drop (100% to 98.5%) between RealToxicityPrompts and ToxiGen, yet no dataset cardinalities, confidence intervals, binomial variance estimates, or hypothesis tests are supplied; without these the difference cannot be shown to be statistically reliable or practically meaningful.
[Abstract] Abstract: the ~10x latency penalty (0.2s to 2.0s) is stated without hardware specification, batch size, measurement protocol, or separation of expert-model overhead, rendering the reported trade-off impossible to evaluate or replicate.
[Phase (3)] Phase (3) stress-testing: the evaluation on ToxiGen is presented as a direct observation with no ablation studies, per-prompt breakdowns, or controls for prompt difficulty, which are required to substantiate the robustness-gap conclusion.

minor comments (2)

[Abstract] Abstract: the phrase 'comprehensive replication study' is used without citing the original DExperts paper or specifying which implementation details were replicated versus newly implemented.
[Abstract] Abstract: safety-rate percentages are given as point estimates; explicit definitions of what constitutes a toxic versus safe generation (e.g., threshold on toxicity classifier) would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our replication study. We value the emphasis on statistical rigor, reproducibility, and deeper analysis of the robustness gap. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our empirical claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'brittleness' rests on the 1.5% safety-rate drop (100% to 98.5%) between RealToxicityPrompts and ToxiGen, yet no dataset cardinalities, confidence intervals, binomial variance estimates, or hypothesis tests are supplied; without these the difference cannot be shown to be statistically reliable or practically meaningful.

Authors: We agree that the abstract would benefit from explicit statistical context to support the observed difference. RealToxicityPrompts contains 100,000 prompts and ToxiGen was evaluated on its full adversarial subset of 1,000 prompts. In the revision we will report 95% binomial confidence intervals (100.0% [99.97–100.00%] vs. 98.5% [97.6–99.2%]), note the exact sample sizes, and include a two-proportion z-test (p < 0.001) demonstrating that the drop is statistically significant. These additions will be placed in both the abstract and the results section without changing the reported point estimates. revision: yes
Referee: [Abstract] Abstract: the ~10x latency penalty (0.2s to 2.0s) is stated without hardware specification, batch size, measurement protocol, or separation of expert-model overhead, rendering the reported trade-off impossible to evaluate or replicate.

Authors: We acknowledge the need for precise experimental details. All timing measurements were performed on a single NVIDIA V100 GPU (32 GB) with batch size 1, generating a maximum of 20 tokens per prompt. Latency was recorded as mean wall-clock time using CUDA events, separating the base-model forward pass from the additional expert-model pass. The ~10× increase is almost entirely attributable to the second forward pass at every decoding step. We will insert these specifications into the abstract and add a short “Computational Cost” paragraph in the methods section with the exact protocol. revision: yes
Referee: [Phase (3)] Phase (3) stress-testing: the evaluation on ToxiGen is presented as a direct observation with no ablation studies, per-prompt breakdowns, or controls for prompt difficulty, which are required to substantiate the robustness-gap conclusion.

Authors: The ToxiGen results were intended as a targeted stress test rather than a full comparative study. To meet the referee’s request we will add: (i) an ablation comparing DExperts against the unsteered GPT-2 baseline on the same ToxiGen prompts, (ii) a per-prompt error breakdown highlighting the specific implicit-hate patterns that trigger the 1.5% failures, and (iii) a control experiment using difficulty-matched subsets drawn from RealToxicityPrompts. These analyses will appear in a new subsection of Phase (3) and will be summarized in the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements only

full rationale

This is an empirical replication study that reports observed toxicity rates (100% on RealToxicityPrompts, 98.5% on ToxiGen) and latency values (0.2s to 2.0s) from running fixed benchmarks and an existing mitigation method. The provided text contains no equations, parameter-fitting steps, derivations, or self-referential definitions. All claims are presented as direct experimental outcomes rather than quantities constructed from the method's own inputs. No load-bearing step reduces to a fit or self-citation by construction, satisfying the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study relies on standard toxicity benchmarks and the original DExperts method without introducing new free parameters, axioms beyond routine ML evaluation assumptions, or invented entities.

axioms (1)

domain assumption Toxicity metrics from RealToxicityPrompts and ToxiGen are reliable proxies for real-world harm
Invoked when interpreting safety rates as evidence of mitigation success

pith-pipeline@v0.9.0 · 5583 in / 1187 out tokens · 52161 ms · 2026-05-15T05:07:37.073599+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 2 internal anchors

[1]

Brown et al

Tom B. Brown et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

work page 2020
[2]

Language models are unsupervised multitask learners.OpenAI Blog, 2019

Alec Radford et al. Language models are unsupervised multitask learners.OpenAI Blog, 2019

work page 2019
[3]

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. InFindings of EMNLP, pages 3356–3369, 2020

work page 2020
[4]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProceedings of FAccT, pages 610–623, 2021

work page 2021
[5]

The woman worked as a babysitter: On biases in language generation

Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman worked as a babysitter: On biases in language generation. InProceedings of EMNLP, pages 3407–3412, 2019

work page 2019
[6]

Smith, and Yejin Choi

Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavat- ula, Noah A. Smith, and Yejin Choi. DExperts: Decoding-time controlled text generation with experts and anti-experts. InProceedings of ACL-IJCNLP, pages 6691–6706, 2021

work page 2021
[7]

Challenges in detoxifying language models

Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hendricks, Kirsty Anderson, Pushmeet Kohli, Ben Coppin, and Po-Sen Huang. Challenges in detoxifying language models. InFindings of EMNLP, pages 2447–2469, 2021. Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study

work page 2021
[8]

Training language models to follow instructions with human feedback

Long Ouyang et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[9]

Plug and play language models: A simple approach to controlled text generation

Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation. InProceedings of ICLR, 2020

work page 2020
[10]

FUDGE: Controlled text generation with future discriminators

Kevin Yang and Dan Klein. FUDGE: Controlled text generation with future discriminators. InProceedings of NAACL, pages 3511–3535, 2021

work page 2021
[11]

ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. InProceedings of ACL, pages 3309–3326, 2022

work page 2022
[12]

Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A. Smith. The risk of racial bias in hate speech detection. InProceedings of ACL, pages 1668–1678, 2019

work page 2019
[13]

Perspective API documentation

Perspective API. Perspective API documentation. https://www.perspectiveapi. com/, 2023

work page 2023
[14]

InProceedings of ACL, 2024

Xavier Suau, Pieter Delobelle, Katherine Metcalf, Armand Joulin, Nicholas Apos- toloff, Luca Zappella, and Pau Rodríguez Whispering Experts: Neural Inter- ventions for Toxicity Mitigation in Language Models. InProceedings of ACL, 2024

work page 2024
[15]

InFindings of EMNLP, 2023

Heegyu Kim and Hyunsouk Cho GTA: Gated Toxicity Avoidance for LM Perfor- mance Preservation. InFindings of EMNLP, 2023

work page 2023
[16]

InProceedings of EMNLP, 2023

Jiaxin Wen, Pei Ke, Hao Sun, Zhexin Zhang, Chengfei Li, Jinfeng Bai, Minlie Huang Unveiling the Implicit Toxicity in Large Language Models. InProceedings of EMNLP, 2023

work page 2023
[17]

InFindings of EMNLP, 2023

Sarthak Roy, Ashish Harshavardhan, Animesh Mukherjee, and Punyajoy Saha Probing LLMs for hate speech detection: strengths and vulnerabilities. InFindings of EMNLP, 2023

work page 2023
[18]

Jingjie Zeng, Liang Yang, Zekun Wang, Yuanyuan Sun, and Hongfei Lin Sheep’s Skin, Wolf’s Deeds: Are LLMs Ready for Metaphorical Implicit Hate Speech? In Proceedings of ACL, 2025

work page 2025
[19]

Gallegos, Ryan A

Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K. Ahmed Bias and Fairness in Large Language Models: A Survey.Computational Linguistics, volume 50, pages 1097–1179, 2024

work page 2024
[20]

InFindings of ACL, 2024

Tinh Son Luong, Thanh-Thien Le, Linh Ngo Van, and Thien Huu Nguyen Realistic Evaluation of Toxicity in Large Language Models. InFindings of ACL, 2024

work page 2024
[21]

Detoxifying large language models via knowledge editing

Mengru Wang et al. Detoxifying Large Language Models via Knowledge Editing. arXiv preprint arXiv:2403.14472, 2024

work page arXiv 2024
[22]

arXiv preprint arXiv:2405.09373, 2024

Devansh Jain, Priyanshu Kumar, Samuel Gehman, Xuhui Zhou, Thomas Hartvigsen, Maarten Sap PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models. arXiv preprint arXiv:2405.09373, 2024

work page arXiv 2024
[23]

arXiv preprint arXiv:2408.12599, 2024

Xun Liang, Hanyu Wang, Yezhaohui Wang, Shichao Song, Jiawei Yang, Simin Niu, Jie Hu, Dan Liu, Shunyu Yao, Feiyu Xiong, and Zhiyu Li Controllable Text Gen- eration for Large Language Models: A Survey. arXiv preprint arXiv:2408.12599, 2024

work page arXiv 2024
[24]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai et al. Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

InFindings of ACL, 2024

Luiza Pozzobon, Patrick Lewis, Sara Hooker, Beyza Ermis From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models. InFindings of ACL, 2024

work page 2024
[26]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models. arXiv preprint arXiv:2404.01318, 2024

work page internal anchor Pith review arXiv 2024
[27]

arXiv preprint arXiv:2501.00066, 2024

Bohdan Turbal, Anastasiia Mazur, Jiaxu Zhao, and Mykola Pechenizkiy On Adversarial Robustness of Language Models in Transfer Learning. arXiv preprint arXiv:2501.00066, 2024

work page arXiv 2024