pith. machine review for the scientific record. sign in

arxiv: 2605.10639 · v1 · submitted 2026-05-11 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:23 UTC · model grok-4.3

classification 💻 cs.AI
keywords toxicity benchmarksLLM evaluationbenchmark biastask type effectsdata domain shiftsmodel instabilitysafety assessment
0
0 comments X

The pith

Toxicity benchmarks flag more content as harmful when the task shifts from text completion to summarization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests how stable common toxicity benchmarks are when used to evaluate large language models. It shows that the same input data produces higher rates of harmful flags once the task changes from completing text to summarizing it. The work also finds that benchmark outputs become inconsistent when the source domain of the data is altered and that different models trigger unstable safety scores on the same benchmarks. These results matter because many groups now use the benchmarks to certify models for customer applications and moderation tools, so hidden inconsistencies could allow unsafe models to pass or block safe ones.

Core claim

The paper claims that established toxicity benchmarks are biased and non-robust, as shown by experiments where shifting from text completion to summarization increases the tendency to flag content as harmful, where certain benchmarks lose consistent behavior across changes in input data domain, and where model-specific instabilities appear, all indicating the need for more robust safety evaluation frameworks.

What carries the argument

Side-by-side experiments that hold input data fixed while varying task type (completion versus summarization), data domain, and model choice to measure changes in benchmark toxicity scores.

Load-bearing premise

The observed changes in benchmark scores come from intrinsic biases inside the benchmarks themselves rather than from the specific models, prompts, or datasets selected for the tests.

What would settle it

If toxicity scores stayed the same when the task switched from completion to summarization and remained consistent across different data domains even after repeating the tests with varied models and prompts, the claim of intrinsic benchmark biases would be refuted.

Figures

Figures reproduced from arXiv: 2605.10639 by Andreas St\"ockl, Martin Flechl, Regina Gugg, Selina Niederl\"ander.

Figure 1
Figure 1. Figure 1: Per-model McNemar ORp for baseline versus summarization tasks across benchmarks. The dashed line marks the null value (ORp = 1); lower (higher) values indicate increased (decreased) harmful classifications for summarization. Switching the evaluation task to summarization consistently increases the rate of harmful classifications across most benchmarks, highlighting a significant blind spot in standard dire… view at source ↗
Figure 2
Figure 2. Figure 2: presents the population-level odds ratios (ORpop) derived from the GLM for each domain relative to the baseline. The results reveal that benchmarks differ considerably in how they respond to domain changes [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cramér’s V for per-model chi-squared tests of domain independence across benchmarks. Darker cells indicate stronger domain effects. In summary, the prevailing trend under domain shifting is a decrease in de￾tected harmfulness, as most significant ORpop values lie above 1. While this effect is strongest for RTP, it appears across multiple benchmarks. At the same time, the model-level analysis highlights tha… view at source ↗
Figure 4
Figure 4. Figure 4: Mean evaluation scores on a 1-5 Likert scale across all five quality criteria per benchmark and target domain, as assessed by Claude 3 Haiku. C.2 Detailed Baseline Results DNA ToxiGen HarmBench RTP RTA↑ TTR↓ ASR↓ PTR↓ Hate Neutral Toxic Non-Toxic Llama2-7B-Chat 0.966±0.006 0.000±0.000 0.000±0.000 0.035±0.013 0.059±0.016 0.019±0.005 GPT-3.5-Turbo 0.920±0.009 0.015±0.002 0.000±0.000 0.390±0.035 0.032±0.012 0… view at source ↗
read the original abstract

The rapid adoption of LLMs in both research and industry highlights the challenges of deploying them safely and reveals a gap in the systematic evaluation of toxicity benchmarks. As organizations increasingly rely on these benchmarks to certify models for customer-facing applications and automated moderation, unrecognized evaluation biases could lead to the deployment of vulnerable or unsafe systems. This work investigates the robustness of established benchmarking setups and examines how to measure currently neglected intrinsic biases, such as those related to model choice, metrics, and task types. Our experiments uncover significant discrepancies in benchmark behaviors when evaluation setups are altered. Specifically, shifting the task from text completion to summarization increases the tendency of benchmarks to flag content as harmful. Additionally, certain benchmarks fail to maintain consistent behavior when the input data domain is changed. Furthermore, we observe model-specific instabilities, demonstrating a clear need for more robust and comprehensive safety evaluation frameworks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper audits toxicity benchmarks for LLMs through experiments that vary the evaluation task (text completion vs. summarization), input data domain, and choice of model. It reports that summarization increases the rate at which content is flagged as toxic, that certain benchmarks exhibit inconsistent behavior under domain shifts, and that results show model-specific instabilities, concluding that current benchmarks require more robust safety evaluation frameworks.

Significance. If the discrepancies can be isolated to intrinsic benchmark properties, the work would usefully document practical limitations in toxicity evaluation setups that are already used for model certification and moderation. The empirical focus on task and domain sensitivity adds concrete observations to the growing literature on LLM safety benchmark fragility, though its impact hinges on the rigor of the controls.

major comments (3)
  1. [Methods / Experimental Setup] Methods / Experimental Setup: The central attribution of increased toxicity flagging to the shift from completion to summarization (and to domain changes) requires that prompt templates, instruction phrasing, and input statistics (length, base toxicity prevalence) be held fixed or explicitly ablated across conditions. No such controls or matching statistics are described, leaving open the possibility that observed shifts arise from prompt construction or data distribution differences rather than benchmark-intrinsic bias.
  2. [Results] Results sections reporting model-specific instabilities: Without reported statistical tests, confidence intervals, or ablation over multiple prompt phrasings and random seeds, it is unclear whether the instabilities reflect genuine benchmark-model interactions or sensitivity to unstated implementation choices. This directly affects the claim that the results demonstrate a 'clear need' for new frameworks.
  3. [Benchmark Selection] Benchmark selection and coverage: The paper examines 'established' toxicity benchmarks but does not justify the specific set chosen or demonstrate that the observed inconsistencies generalize beyond the selected models and datasets. This limits the load-bearing strength of the call for 'more robust and comprehensive' frameworks.
minor comments (2)
  1. [Abstract / Introduction] Abstract and introduction use the term 'intrinsic biases' without an operational definition; a short paragraph clarifying what would count as benchmark-intrinsic versus setup-dependent would improve clarity.
  2. [Figures / Tables] Figure captions and table legends should explicitly state the number of runs, prompt variants, and exact toxicity threshold used for each reported percentage.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments have prompted us to strengthen the methodological details, add statistical analyses, and clarify the scope of our claims. We address each major point below and have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: [Methods / Experimental Setup] Methods / Experimental Setup: The central attribution of increased toxicity flagging to the shift from completion to summarization (and to domain changes) requires that prompt templates, instruction phrasing, and input statistics (length, base toxicity prevalence) be held fixed or explicitly ablated across conditions. No such controls or matching statistics are described, leaving open the possibility that observed shifts arise from prompt construction or data distribution differences rather than benchmark-intrinsic bias.

    Authors: We appreciate this observation on experimental controls. In the original setup we used fixed prompt templates and instruction phrasing across task conditions to isolate the effect of task type, but we did not report input-length distributions or base toxicity prevalence for each domain. In the revision we have added a dedicated subsection in Methods that tabulates these statistics for all conditions and includes an ablation over three prompt phrasings. The additional results show that the increase in flagged toxicity under summarization persists after matching on length and prevalence, supporting our attribution to task-intrinsic factors. revision: yes

  2. Referee: [Results] Results sections reporting model-specific instabilities: Without reported statistical tests, confidence intervals, or ablation over multiple prompt phrasings and random seeds, it is unclear whether the instabilities reflect genuine benchmark-model interactions or sensitivity to unstated implementation choices. This directly affects the claim that the results demonstrate a 'clear need' for new frameworks.

    Authors: We agree that the absence of statistical quantification limited the strength of the instability claims. The revised Results section now reports 95% bootstrap confidence intervals and paired t-tests for all reported differences. We further ran each configuration with five random seeds and two additional prompt variants; the model-specific patterns remain statistically significant and consistent across these controls. These additions are presented in new tables and figures, which we believe now substantiate the call for more robust frameworks. revision: yes

  3. Referee: [Benchmark Selection] Benchmark selection and coverage: The paper examines 'established' toxicity benchmarks but does not justify the specific set chosen or demonstrate that the observed inconsistencies generalize beyond the selected models and datasets. This limits the load-bearing strength of the call for 'more robust and comprehensive' frameworks.

    Authors: The selected benchmarks (RealToxicityPrompts, ToxiGen, and the Perspective API-based suite) were chosen because they are the most widely adopted in both research and production safety pipelines; we have now added an explicit justification paragraph in Section 2 citing their usage frequency in recent model releases. We acknowledge that our experiments cover only three models and these particular datasets. The revised Discussion section therefore frames the findings as illustrative of evaluation fragility rather than exhaustive proof, and we have tempered the language around the need for new frameworks to reflect this scope while still highlighting the practical implications for current certification practices. revision: partial

Circularity Check

0 steps flagged

Empirical audit of benchmarks with no derived predictions or self-referential definitions

full rationale

The paper performs direct experimental measurements across task types (completion vs. summarization) and data domains using existing toxicity benchmarks and multiple LLMs. No equations, fitted parameters, or first-principles derivations appear; all claims rest on observed discrepancies in flagged toxicity rates. No self-citations are load-bearing for any central result, and no step reduces a reported outcome to an input by construction. The work is therefore self-contained as an empirical audit.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper performs an empirical investigation and does not introduce free parameters, new axioms, or invented entities; it relies on standard assumptions about benchmark validity that are tested rather than postulated.

pith-pipeline@v0.9.0 · 5449 in / 1029 out tokens · 29470 ms · 2026-05-12T05:23:29.626356+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 12 internal anchors

  1. [1]

    arXiv:2010.12472 (2020), https://arxiv.org/abs/2010.12472

    Caselli, T., Basile, V., Mitrović, J., Granitzer, M.: HateBERT: Retraining BERT for abusive language detection in English. arXiv:2010.12472 (2020), https://arxiv.org/abs/2010.12472

  2. [2]

    JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

    Chao, P., Debenedetti, E., Robey, A., Andriushchenko, M., Croce, F., Sehwag, V., Dobriban, E., Flammarion, N., Pappas, G.J., Tramer, F., et al.: Jailbreak- Bench: An open robustness benchmark for jailbreaking large language models. arXiv:2404.01318 (2024), https://arxiv.org/abs/2404.01318

  3. [3]

    Yunkai Dang, Kaichen Huang, Jiahao Huo, Yibo Yan, Sirui Huang, Dongrui Liu, Mengxi Gao, Jie Zhang, Chen Qian, Kun Wang, Yong Liu, Jing Shao, Hui Xiong, and Xuming Hu

    Chern, S., Hu, Z., Yang, Y., Chern, E., Guo, Y., Jin, J., Wang, B., Liu, P.: Be- Honest: Benchmarking honesty in large language models. arXiv:2406.13261 (2024), https://arxiv.org/abs/2406.13261

  4. [4]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

    Chiang, W.L., Zheng, L., Sheng, Y., Angelopoulos, A.N., Li, T., Li, D., Zhang, H., Zhu, B., Jordan, M., Gonzalez, J.E., et al.: Chatbot Arena: An open platform for evaluating LLMs by human preference. arXiv:2403.04132 (2024), https://arxiv.org/abs/2403.04132

  5. [5]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    DeepSeek-AI: DeepSeek LLM: Scaling open-source language models with longter- mism. arXiv:2401.02954 (2024), https://arxiv.org/abs/2401.02954

  6. [6]

    In: Proceedings of the 2021 ACM conference on fairness, ac- countability, and transparency

    Dhamala, J., Sun, T., Kumar, V., Krishna, S., Pruksachatkun, Y., Chang, K.W., Gupta, R.: BOLD: Dataset and metrics for measuring biases in open-ended lan- guage generation. In: Proceedings of the 2021 ACM conference on fairness, ac- countability, and transparency. pp. 862–872 (2021)

  7. [7]

    Realtoxicityprompts: Evaluating neural toxic degenera- tion in language models

    Gehman,S.,Gururangan,S.,Sap,M.,Choi,Y.,Smith,N.A.:RealToxicityPrompts: Evaluating neural toxic degeneration in language models. arXiv:2009.11462 (2020), https://arxiv.org/abs/2009.11462

  8. [8]

    Mas- ter’s thesis, University of Applied Sciences Upper Austria Hagenberg (2025), https://permalink.obvsg.at/fho/AC176875873

    Gugg, R.: Is content moderation for LLMs task-biased? A statistical meta study on the impact of LLM tasks and their domains on alignment enchmarks. Mas- ter’s thesis, University of Applied Sciences Upper Austria Hagenberg (2025), https://permalink.obvsg.at/fho/AC176875873

  9. [9]

    arXiv preprint arXiv:2203.09509 , year=

    Hartvigsen, T., Gabriel, S., Palangi, H., Sap, M., Ray, D., Kamar, E.: ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv:2203.09509 (2022), https://arxiv.org/abs/2203.09509

  10. [10]

    An empirical study of metrics to measure representational harms in pre-trained language models

    Hosseini, S., Palangi, H., Awadallah, A.H.: An empirical study of metrics to measure representational harms in pre-trained language models. arXiv:2301.09211 (2023), https://arxiv.org/abs/2301.09211

  11. [11]

    Catastrophic jailbreak of open-source llms via exploiting generation

    Huang, Y., Gupta, S., Xia, M., Li, K., Chen, D.: Catastrophic jailbreak of open-source LLMs via exploiting generation. arXiv:2310.06987 (2023), https://arxiv.org/abs/2310.06987

  12. [12]

    Mistral 7B

    Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., Casas, D.d.l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al.: Mistral 7b. arXiv:2310.06825 (2023), https://arxiv.org/abs/2310.06825 Investigating Bias in Toxicity Benchmarks 13

  13. [13]

    arXiv:2104.14337 (2021), https://arxiv.org/abs/2104.14337

    Kiela,D.,Bartolo,M.,Nie,Y.,Kaushik,D.,Geiger,A.,Wu,Z.,Vidgen,B.,Prasad, G., Singh, A., Ringshia, P., Ma, Z., Thrush, T., Riedel, S., Waseem, Z., Stenetorp, P., Jia, R., Bansal, M., Potts, C., Williams, A.: Dynabench: Rethinking bench- marking in NLP. arXiv:2104.14337 (2021), https://arxiv.org/abs/2104.14337

  14. [14]

    arXiv:2402.05044 (2024), https://arxiv.org/abs/2402.05044

    Li, L., Dong, B., Wang, R., Hu, X., Zuo, W., Lin, D., Qiao, Y., Shao, J.: SALAD- Bench:Ahierarchicalandcomprehensivesafetybenchmarkforlargelanguagemod- els. arXiv:2402.05044 (2024), https://arxiv.org/abs/2402.05044

  15. [15]

    Holistic Evaluation of Language Models

    Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., et al.: Holistic evaluation of language models. arXiv:2211.09110 (2022), https://arxiv.org/abs/2211.09110

  16. [16]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    Lin, S., Hilton, J., Evans, O.: TruthfulQA: Measuring how models mimic human falsehoods. arXiv:2109.07958 (2021), https://arxiv.org/abs/2109.07958

  17. [17]

    arXiv:2310.17389 (2023), https://arxiv.org/abs/2310.17389

    Lin, Z., Wang, Z., Tong, Y., Wang, Y., Guo, Y., Wang, Y., Shang, J.: ToxicChat: Unveilinghiddenchallengesoftoxicitydetectioninreal-worlduser-AIconversation. arXiv:2310.17389 (2023), https://arxiv.org/abs/2310.17389

  18. [18]

    Trustworthy llms: a survey and guideline for evaluating large language models’ alignment

    Liu, Y., Yao, Y., Ton, J.F., Zhang, X., Guo, R., Cheng, H., Klochkov, Y., Taufiq, M.F., Li, H.: Trustworthy LLMs: A survey and guideline for evaluating large language models’ alignment. arXiv:2308.05374 (2023), https://arxiv.org/abs/2308.05374

  19. [19]

    In: Proceed- ings of the AAAI conference on artificial intelligence

    Mathew, B., Saha, P., Yimam, S.M., Biemann, C., Goyal, P., Mukherjee, A.: Hat- eXplain: A benchmark dataset for explainable hate speech detection. In: Proceed- ings of the AAAI conference on artificial intelligence. vol. 35, pp. 14867–14875 (2021)

  20. [20]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., et al.: HarmBench: A standardized evaluation frame- work for automated red teaming and robust refusal. arXiv:2402.04249 (2024), https://arxiv.org/abs/2402.04249

  21. [21]

    arXiv preprint arXiv:2004.09456 , year=

    Nadeem, M., Bethke, A., Reddy, S.: StereoSet: Measuring stereo- typical bias in pretrained language models. arXiv:2004.09456 (2020), https://arxiv.org/abs/2004.09456

  22. [22]

    arXiv preprint arXiv:2010.00133 , year=

    Nangia, N., Vania, C., Bhalerao, R., Bowman, S.R.: CrowS-Pairs: A challenge dataset for measuring social biases in masked language models. arXiv:2010.00133 (2020), https://arxiv.org/abs/2010.00133

  23. [23]

    BBQ: A hand-built bias benchmark for question answering.arXiv preprint arXiv:2110.08193,

    Parrish, A., Chen, A., Nangia, N., Padmakumar, V., Phang, J., Thompson, J., Htut, P.M., Bowman, S.R.: BBQ: A hand-built bias benchmark for question an- swering. arXiv:2110.08193 (2021), https://arxiv.org/abs/2110.08193

  24. [24]

    XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

    Röttger, P., Kirk, H.R., Vidgen, B., Attanasio, G., Bianchi, F., Hovy, D.: XSTest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv:2308.01263 (2023), https://arxiv.org/abs/2308.01263

  25. [25]

    On second thought, let’s not think step by step! Bias and toxicity in zero-shot reasoning

    Shaikh, O., Zhang, H., Held, W., Bernstein, M., Yang, D.: On second thought, let’s not think step by step! Bias and toxicity in zero-shot reasoning. arXiv:2212.08061 (2022), https://arxiv.org/abs/2212.08061

  26. [26]

    A strongreject for empty jailbreaks

    Souly, A., Lu, Q., Bowen, D., Trinh, T., Hsieh, E., Pandey, S., Abbeel, P., Sveg- liato, J., Emmons, S., Watkins, O., et al.: A StrongREJECT for empty jailbreaks. arXiv:2402.10260 (2024), https://arxiv.org/abs/2402.10260

  27. [27]

    arXiv:2406.12624 (2025), https://arxiv.org/abs/2406.12624

    Thakur, A.S., Choudhary, K., Ramayapally, V.S., Vaidyanathan, S., Hupkes, D.: Judging the judges: Evaluating alignment and vulnerabilities in LLMs-as-judges. arXiv:2406.12624 (2025), https://arxiv.org/abs/2406.12624

  28. [28]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 14 R. Gugg et al. 2: Open foundation and fine-tuned chat models. arXiv:2307.09288 (2023), https://arxiv.org/abs/2307.09288

  29. [29]

    Hale, and Paul Röttger

    Vidgen, B., Scherrer, N., Kirk, H.R., Qian, R., Kannappan, A., Hale, S.A., Röttger, P.: SimpleSafetyTests: A test suite for identifying critical safety risks in large lan- guage models. arXiv:2311.08370 (2023), https://arxiv.org/abs/2311.08370

  30. [30]

    Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs

    Wang, Y., Li, H., Han, X., Nakov, P., Baldwin, T.: Do-Not-Answer: A dataset for evaluating safeguards in LLMs. arXiv:2308.13387 (2023), https://arxiv.org/abs/2308.13387

  31. [31]

    Ethical and social risks of harm from Language Models

    Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., et al.: Ethical and social risks of harm from language models. arXiv:2112.04359 (2021), https://arxiv.org/abs/2112.04359

  32. [32]

    Xie, T., Qi, X., Zeng, Y., Huang, Y., Sehwag, U., Huang, K., He, L., Wei, B., Li, D., Sheng, Y., et al.: SORRY-Bench: Systematically evaluating large language modelsafetyrefusal.InternationalConferenceonLearningRepresentations(ICLR) (2025)

  33. [33]

    Qwen2 Technical Report

    Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., et al.: Qwen2 technical report. arXiv:2407.10671 (2024), https://arxiv.org/abs/2407.10671

  34. [34]

    Safetybench: Eval- uating the safety of large language models with mul- tiple choice questions

    Zhang, Z., Lei, L., Wu, L., Sun, R., Huang, Y., Long, C., Liu, X., Lei, X., Tang, J., Huang, M.: SafetyBench: Evaluating the safety of large language models. arXiv:2309.07045 (2023), https://arxiv.org/abs/2309.07045

  35. [35]

    arXiv preprint arXiv:1804.06876 , year=

    Zhao, J., Wang, T., Yatskar, M., Ordonez, V., Chang, K.W.: Gender bias in coref- erence resolution: Evaluation and debiasing methods. arXiv:1804.06876 (2018), https://arxiv.org/abs/1804.06876

  36. [36]

    In: Proceedings of the 16th Confer- ence of the European Chapter of the Association for Computational Linguistics: Main Volume

    Zhou, X., Sap, M., Swayamdipta, S., Choi, Y., Smith, N.A.: Challenges in auto- mated debiasing for toxic language detection. In: Proceedings of the 16th Confer- ence of the European Chapter of the Association for Computational Linguistics: Main Volume. pp. 3143–3155 (2021)

  37. [37]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J.Z., Fredrikson, M.: Universal and transferable adversarial attacks on aligned language models. arXiv:2307.15043 (2023), https://arxiv.org/abs/2307.15043 Investigating Bias in Toxicity Benchmarks 15 A Model Configurations Model AttributeGPT-3.5-Turbo-0125 Mistral-7B-Instruct Llama2-7B-Chat Qwen2-7B-Instru...