arxiv: 2605.10639 · v1 · submitted 2026-05-11 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks

Regina Gugg , Selina Niederl\"ander , Andreas St\"ockl , Martin Flechl

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:23 UTC · model grok-4.3

classification 💻 cs.AI

keywords toxicity benchmarksLLM evaluationbenchmark biastask type effectsdata domain shiftsmodel instabilitysafety assessment

0 comments

The pith

Toxicity benchmarks flag more content as harmful when the task shifts from text completion to summarization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests how stable common toxicity benchmarks are when used to evaluate large language models. It shows that the same input data produces higher rates of harmful flags once the task changes from completing text to summarizing it. The work also finds that benchmark outputs become inconsistent when the source domain of the data is altered and that different models trigger unstable safety scores on the same benchmarks. These results matter because many groups now use the benchmarks to certify models for customer applications and moderation tools, so hidden inconsistencies could allow unsafe models to pass or block safe ones.

Core claim

The paper claims that established toxicity benchmarks are biased and non-robust, as shown by experiments where shifting from text completion to summarization increases the tendency to flag content as harmful, where certain benchmarks lose consistent behavior across changes in input data domain, and where model-specific instabilities appear, all indicating the need for more robust safety evaluation frameworks.

What carries the argument

Side-by-side experiments that hold input data fixed while varying task type (completion versus summarization), data domain, and model choice to measure changes in benchmark toxicity scores.

Load-bearing premise

The observed changes in benchmark scores come from intrinsic biases inside the benchmarks themselves rather than from the specific models, prompts, or datasets selected for the tests.

What would settle it

If toxicity scores stayed the same when the task switched from completion to summarization and remained consistent across different data domains even after repeating the tests with varied models and prompts, the claim of intrinsic benchmark biases would be refuted.

Figures

Figures reproduced from arXiv: 2605.10639 by Andreas St\"ockl, Martin Flechl, Regina Gugg, Selina Niederl\"ander.

**Figure 1.** Figure 1: Per-model McNemar ORp for baseline versus summarization tasks across benchmarks. The dashed line marks the null value (ORp = 1); lower (higher) values indicate increased (decreased) harmful classifications for summarization. Switching the evaluation task to summarization consistently increases the rate of harmful classifications across most benchmarks, highlighting a significant blind spot in standard dire… view at source ↗

**Figure 2.** Figure 2: presents the population-level odds ratios (ORpop) derived from the GLM for each domain relative to the baseline. The results reveal that benchmarks differ considerably in how they respond to domain changes [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Cramér’s V for per-model chi-squared tests of domain independence across benchmarks. Darker cells indicate stronger domain effects. In summary, the prevailing trend under domain shifting is a decrease in detected harmfulness, as most significant ORpop values lie above 1. While this effect is strongest for RTP, it appears across multiple benchmarks. At the same time, the model-level analysis highlights tha… view at source ↗

**Figure 4.** Figure 4: Mean evaluation scores on a 1-5 Likert scale across all five quality criteria per benchmark and target domain, as assessed by Claude 3 Haiku. C.2 Detailed Baseline Results DNA ToxiGen HarmBench RTP RTA↑ TTR↓ ASR↓ PTR↓ Hate Neutral Toxic Non-Toxic Llama2-7B-Chat 0.966±0.006 0.000±0.000 0.000±0.000 0.035±0.013 0.059±0.016 0.019±0.005 GPT-3.5-Turbo 0.920±0.009 0.015±0.002 0.000±0.000 0.390±0.035 0.032±0.012 0… view at source ↗

read the original abstract

The rapid adoption of LLMs in both research and industry highlights the challenges of deploying them safely and reveals a gap in the systematic evaluation of toxicity benchmarks. As organizations increasingly rely on these benchmarks to certify models for customer-facing applications and automated moderation, unrecognized evaluation biases could lead to the deployment of vulnerable or unsafe systems. This work investigates the robustness of established benchmarking setups and examines how to measure currently neglected intrinsic biases, such as those related to model choice, metrics, and task types. Our experiments uncover significant discrepancies in benchmark behaviors when evaluation setups are altered. Specifically, shifting the task from text completion to summarization increases the tendency of benchmarks to flag content as harmful. Additionally, certain benchmarks fail to maintain consistent behavior when the input data domain is changed. Furthermore, we observe model-specific instabilities, demonstrating a clear need for more robust and comprehensive safety evaluation frameworks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper audits toxicity benchmarks for LLMs through experiments that vary the evaluation task (text completion vs. summarization), input data domain, and choice of model. It reports that summarization increases the rate at which content is flagged as toxic, that certain benchmarks exhibit inconsistent behavior under domain shifts, and that results show model-specific instabilities, concluding that current benchmarks require more robust safety evaluation frameworks.

Significance. If the discrepancies can be isolated to intrinsic benchmark properties, the work would usefully document practical limitations in toxicity evaluation setups that are already used for model certification and moderation. The empirical focus on task and domain sensitivity adds concrete observations to the growing literature on LLM safety benchmark fragility, though its impact hinges on the rigor of the controls.

major comments (3)

[Methods / Experimental Setup] Methods / Experimental Setup: The central attribution of increased toxicity flagging to the shift from completion to summarization (and to domain changes) requires that prompt templates, instruction phrasing, and input statistics (length, base toxicity prevalence) be held fixed or explicitly ablated across conditions. No such controls or matching statistics are described, leaving open the possibility that observed shifts arise from prompt construction or data distribution differences rather than benchmark-intrinsic bias.
[Results] Results sections reporting model-specific instabilities: Without reported statistical tests, confidence intervals, or ablation over multiple prompt phrasings and random seeds, it is unclear whether the instabilities reflect genuine benchmark-model interactions or sensitivity to unstated implementation choices. This directly affects the claim that the results demonstrate a 'clear need' for new frameworks.
[Benchmark Selection] Benchmark selection and coverage: The paper examines 'established' toxicity benchmarks but does not justify the specific set chosen or demonstrate that the observed inconsistencies generalize beyond the selected models and datasets. This limits the load-bearing strength of the call for 'more robust and comprehensive' frameworks.

minor comments (2)

[Abstract / Introduction] Abstract and introduction use the term 'intrinsic biases' without an operational definition; a short paragraph clarifying what would count as benchmark-intrinsic versus setup-dependent would improve clarity.
[Figures / Tables] Figure captions and table legends should explicitly state the number of runs, prompt variants, and exact toxicity threshold used for each reported percentage.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments have prompted us to strengthen the methodological details, add statistical analyses, and clarify the scope of our claims. We address each major point below and have revised the manuscript accordingly.

read point-by-point responses

Referee: [Methods / Experimental Setup] Methods / Experimental Setup: The central attribution of increased toxicity flagging to the shift from completion to summarization (and to domain changes) requires that prompt templates, instruction phrasing, and input statistics (length, base toxicity prevalence) be held fixed or explicitly ablated across conditions. No such controls or matching statistics are described, leaving open the possibility that observed shifts arise from prompt construction or data distribution differences rather than benchmark-intrinsic bias.

Authors: We appreciate this observation on experimental controls. In the original setup we used fixed prompt templates and instruction phrasing across task conditions to isolate the effect of task type, but we did not report input-length distributions or base toxicity prevalence for each domain. In the revision we have added a dedicated subsection in Methods that tabulates these statistics for all conditions and includes an ablation over three prompt phrasings. The additional results show that the increase in flagged toxicity under summarization persists after matching on length and prevalence, supporting our attribution to task-intrinsic factors. revision: yes
Referee: [Results] Results sections reporting model-specific instabilities: Without reported statistical tests, confidence intervals, or ablation over multiple prompt phrasings and random seeds, it is unclear whether the instabilities reflect genuine benchmark-model interactions or sensitivity to unstated implementation choices. This directly affects the claim that the results demonstrate a 'clear need' for new frameworks.

Authors: We agree that the absence of statistical quantification limited the strength of the instability claims. The revised Results section now reports 95% bootstrap confidence intervals and paired t-tests for all reported differences. We further ran each configuration with five random seeds and two additional prompt variants; the model-specific patterns remain statistically significant and consistent across these controls. These additions are presented in new tables and figures, which we believe now substantiate the call for more robust frameworks. revision: yes
Referee: [Benchmark Selection] Benchmark selection and coverage: The paper examines 'established' toxicity benchmarks but does not justify the specific set chosen or demonstrate that the observed inconsistencies generalize beyond the selected models and datasets. This limits the load-bearing strength of the call for 'more robust and comprehensive' frameworks.

Authors: The selected benchmarks (RealToxicityPrompts, ToxiGen, and the Perspective API-based suite) were chosen because they are the most widely adopted in both research and production safety pipelines; we have now added an explicit justification paragraph in Section 2 citing their usage frequency in recent model releases. We acknowledge that our experiments cover only three models and these particular datasets. The revised Discussion section therefore frames the findings as illustrative of evaluation fragility rather than exhaustive proof, and we have tempered the language around the need for new frameworks to reflect this scope while still highlighting the practical implications for current certification practices. revision: partial

Circularity Check

0 steps flagged

Empirical audit of benchmarks with no derived predictions or self-referential definitions

full rationale

The paper performs direct experimental measurements across task types (completion vs. summarization) and data domains using existing toxicity benchmarks and multiple LLMs. No equations, fitted parameters, or first-principles derivations appear; all claims rest on observed discrepancies in flagged toxicity rates. No self-citations are load-bearing for any central result, and no step reduces a reported outcome to an input by construction. The work is therefore self-contained as an empirical audit.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper performs an empirical investigation and does not introduce free parameters, new axioms, or invented entities; it relies on standard assumptions about benchmark validity that are tested rather than postulated.

pith-pipeline@v0.9.0 · 5449 in / 1029 out tokens · 29470 ms · 2026-05-12T05:23:29.626356+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

shifting the task from text completion to summarization increases the tendency of benchmarks to flag content as harmful
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

certain benchmarks fail to maintain consistent behavior when the input data domain is changed

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 12 internal anchors

[1]

arXiv:2010.12472 (2020), https://arxiv.org/abs/2010.12472

Caselli, T., Basile, V., Mitrović, J., Granitzer, M.: HateBERT: Retraining BERT for abusive language detection in English. arXiv:2010.12472 (2020), https://arxiv.org/abs/2010.12472

work page arXiv 2010
[2]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Chao, P., Debenedetti, E., Robey, A., Andriushchenko, M., Croce, F., Sehwag, V., Dobriban, E., Flammarion, N., Pappas, G.J., Tramer, F., et al.: Jailbreak- Bench: An open robustness benchmark for jailbreaking large language models. arXiv:2404.01318 (2024), https://arxiv.org/abs/2404.01318

work page internal anchor Pith review arXiv 2024
[3]

Yunkai Dang, Kaichen Huang, Jiahao Huo, Yibo Yan, Sirui Huang, Dongrui Liu, Mengxi Gao, Jie Zhang, Chen Qian, Kun Wang, Yong Liu, Jing Shao, Hui Xiong, and Xuming Hu

Chern, S., Hu, Z., Yang, Y., Chern, E., Guo, Y., Jin, J., Wang, B., Liu, P.: Be- Honest: Benchmarking honesty in large language models. arXiv:2406.13261 (2024), https://arxiv.org/abs/2406.13261

work page arXiv 2024
[4]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Chiang, W.L., Zheng, L., Sheng, Y., Angelopoulos, A.N., Li, T., Li, D., Zhang, H., Zhu, B., Jordan, M., Gonzalez, J.E., et al.: Chatbot Arena: An open platform for evaluating LLMs by human preference. arXiv:2403.04132 (2024), https://arxiv.org/abs/2403.04132

work page internal anchor Pith review arXiv 2024
[5]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

DeepSeek-AI: DeepSeek LLM: Scaling open-source language models with longter- mism. arXiv:2401.02954 (2024), https://arxiv.org/abs/2401.02954

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

In: Proceedings of the 2021 ACM conference on fairness, ac- countability, and transparency

Dhamala, J., Sun, T., Kumar, V., Krishna, S., Pruksachatkun, Y., Chang, K.W., Gupta, R.: BOLD: Dataset and metrics for measuring biases in open-ended lan- guage generation. In: Proceedings of the 2021 ACM conference on fairness, ac- countability, and transparency. pp. 862–872 (2021)

work page 2021
[7]

Realtoxicityprompts: Evaluating neural toxic degenera- tion in language models

Gehman,S.,Gururangan,S.,Sap,M.,Choi,Y.,Smith,N.A.:RealToxicityPrompts: Evaluating neural toxic degeneration in language models. arXiv:2009.11462 (2020), https://arxiv.org/abs/2009.11462

work page arXiv 2009
[8]

Mas- ter’s thesis, University of Applied Sciences Upper Austria Hagenberg (2025), https://permalink.obvsg.at/fho/AC176875873

Gugg, R.: Is content moderation for LLMs task-biased? A statistical meta study on the impact of LLM tasks and their domains on alignment enchmarks. Mas- ter’s thesis, University of Applied Sciences Upper Austria Hagenberg (2025), https://permalink.obvsg.at/fho/AC176875873

work page 2025
[9]

arXiv preprint arXiv:2203.09509 , year=

Hartvigsen, T., Gabriel, S., Palangi, H., Sap, M., Ray, D., Kamar, E.: ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv:2203.09509 (2022), https://arxiv.org/abs/2203.09509

work page arXiv 2022
[10]

An empirical study of metrics to measure representational harms in pre-trained language models

Hosseini, S., Palangi, H., Awadallah, A.H.: An empirical study of metrics to measure representational harms in pre-trained language models. arXiv:2301.09211 (2023), https://arxiv.org/abs/2301.09211

work page arXiv 2023
[11]

Catastrophic jailbreak of open-source llms via exploiting generation

Huang, Y., Gupta, S., Xia, M., Li, K., Chen, D.: Catastrophic jailbreak of open-source LLMs via exploiting generation. arXiv:2310.06987 (2023), https://arxiv.org/abs/2310.06987

work page arXiv 2023
[12]

Mistral 7B

Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., Casas, D.d.l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al.: Mistral 7b. arXiv:2310.06825 (2023), https://arxiv.org/abs/2310.06825 Investigating Bias in Toxicity Benchmarks 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

arXiv:2104.14337 (2021), https://arxiv.org/abs/2104.14337

Kiela,D.,Bartolo,M.,Nie,Y.,Kaushik,D.,Geiger,A.,Wu,Z.,Vidgen,B.,Prasad, G., Singh, A., Ringshia, P., Ma, Z., Thrush, T., Riedel, S., Waseem, Z., Stenetorp, P., Jia, R., Bansal, M., Potts, C., Williams, A.: Dynabench: Rethinking bench- marking in NLP. arXiv:2104.14337 (2021), https://arxiv.org/abs/2104.14337

work page arXiv 2021
[14]

arXiv:2402.05044 (2024), https://arxiv.org/abs/2402.05044

Li, L., Dong, B., Wang, R., Hu, X., Zuo, W., Lin, D., Qiao, Y., Shao, J.: SALAD- Bench:Ahierarchicalandcomprehensivesafetybenchmarkforlargelanguagemod- els. arXiv:2402.05044 (2024), https://arxiv.org/abs/2402.05044

work page arXiv 2024
[15]

Holistic Evaluation of Language Models

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., et al.: Holistic evaluation of language models. arXiv:2211.09110 (2022), https://arxiv.org/abs/2211.09110

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Lin, S., Hilton, J., Evans, O.: TruthfulQA: Measuring how models mimic human falsehoods. arXiv:2109.07958 (2021), https://arxiv.org/abs/2109.07958

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

arXiv:2310.17389 (2023), https://arxiv.org/abs/2310.17389

Lin, Z., Wang, Z., Tong, Y., Wang, Y., Guo, Y., Wang, Y., Shang, J.: ToxicChat: Unveilinghiddenchallengesoftoxicitydetectioninreal-worlduser-AIconversation. arXiv:2310.17389 (2023), https://arxiv.org/abs/2310.17389

work page arXiv 2023
[18]

Trustworthy llms: a survey and guideline for evaluating large language models’ alignment

Liu, Y., Yao, Y., Ton, J.F., Zhang, X., Guo, R., Cheng, H., Klochkov, Y., Taufiq, M.F., Li, H.: Trustworthy LLMs: A survey and guideline for evaluating large language models’ alignment. arXiv:2308.05374 (2023), https://arxiv.org/abs/2308.05374

work page arXiv 2023
[19]

In: Proceed- ings of the AAAI conference on artificial intelligence

Mathew, B., Saha, P., Yimam, S.M., Biemann, C., Goyal, P., Mukherjee, A.: Hat- eXplain: A benchmark dataset for explainable hate speech detection. In: Proceed- ings of the AAAI conference on artificial intelligence. vol. 35, pp. 14867–14875 (2021)

work page 2021
[20]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., et al.: HarmBench: A standardized evaluation frame- work for automated red teaming and robust refusal. arXiv:2402.04249 (2024), https://arxiv.org/abs/2402.04249

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

arXiv preprint arXiv:2004.09456 , year=

Nadeem, M., Bethke, A., Reddy, S.: StereoSet: Measuring stereo- typical bias in pretrained language models. arXiv:2004.09456 (2020), https://arxiv.org/abs/2004.09456

work page arXiv 2004
[22]

arXiv preprint arXiv:2010.00133 , year=

Nangia, N., Vania, C., Bhalerao, R., Bowman, S.R.: CrowS-Pairs: A challenge dataset for measuring social biases in masked language models. arXiv:2010.00133 (2020), https://arxiv.org/abs/2010.00133

work page arXiv 2010
[23]

BBQ: A hand-built bias benchmark for question answering.arXiv preprint arXiv:2110.08193,

Parrish, A., Chen, A., Nangia, N., Padmakumar, V., Phang, J., Thompson, J., Htut, P.M., Bowman, S.R.: BBQ: A hand-built bias benchmark for question an- swering. arXiv:2110.08193 (2021), https://arxiv.org/abs/2110.08193

work page arXiv 2021
[24]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Röttger, P., Kirk, H.R., Vidgen, B., Attanasio, G., Bianchi, F., Hovy, D.: XSTest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv:2308.01263 (2023), https://arxiv.org/abs/2308.01263

work page internal anchor Pith review arXiv 2023
[25]

On second thought, let’s not think step by step! Bias and toxicity in zero-shot reasoning

Shaikh, O., Zhang, H., Held, W., Bernstein, M., Yang, D.: On second thought, let’s not think step by step! Bias and toxicity in zero-shot reasoning. arXiv:2212.08061 (2022), https://arxiv.org/abs/2212.08061

work page arXiv 2022
[26]

A strongreject for empty jailbreaks

Souly, A., Lu, Q., Bowen, D., Trinh, T., Hsieh, E., Pandey, S., Abbeel, P., Sveg- liato, J., Emmons, S., Watkins, O., et al.: A StrongREJECT for empty jailbreaks. arXiv:2402.10260 (2024), https://arxiv.org/abs/2402.10260

work page arXiv 2024
[27]

arXiv:2406.12624 (2025), https://arxiv.org/abs/2406.12624

Thakur, A.S., Choudhary, K., Ramayapally, V.S., Vaidyanathan, S., Hupkes, D.: Judging the judges: Evaluating alignment and vulnerabilities in LLMs-as-judges. arXiv:2406.12624 (2025), https://arxiv.org/abs/2406.12624

work page arXiv 2025
[28]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 14 R. Gugg et al. 2: Open foundation and fine-tuned chat models. arXiv:2307.09288 (2023), https://arxiv.org/abs/2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Hale, and Paul Röttger

Vidgen, B., Scherrer, N., Kirk, H.R., Qian, R., Kannappan, A., Hale, S.A., Röttger, P.: SimpleSafetyTests: A test suite for identifying critical safety risks in large lan- guage models. arXiv:2311.08370 (2023), https://arxiv.org/abs/2311.08370

work page arXiv 2023
[30]

Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs

Wang, Y., Li, H., Han, X., Nakov, P., Baldwin, T.: Do-Not-Answer: A dataset for evaluating safeguards in LLMs. arXiv:2308.13387 (2023), https://arxiv.org/abs/2308.13387

work page arXiv 2023
[31]

Ethical and social risks of harm from Language Models

Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., et al.: Ethical and social risks of harm from language models. arXiv:2112.04359 (2021), https://arxiv.org/abs/2112.04359

work page internal anchor Pith review Pith/arXiv arXiv 2021
[32]

Xie, T., Qi, X., Zeng, Y., Huang, Y., Sehwag, U., Huang, K., He, L., Wei, B., Li, D., Sheng, Y., et al.: SORRY-Bench: Systematically evaluating large language modelsafetyrefusal.InternationalConferenceonLearningRepresentations(ICLR) (2025)

work page 2025
[33]

Qwen2 Technical Report

Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., et al.: Qwen2 technical report. arXiv:2407.10671 (2024), https://arxiv.org/abs/2407.10671

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Safetybench: Eval- uating the safety of large language models with mul- tiple choice questions

Zhang, Z., Lei, L., Wu, L., Sun, R., Huang, Y., Long, C., Liu, X., Lei, X., Tang, J., Huang, M.: SafetyBench: Evaluating the safety of large language models. arXiv:2309.07045 (2023), https://arxiv.org/abs/2309.07045

work page arXiv 2023
[35]

arXiv preprint arXiv:1804.06876 , year=

Zhao, J., Wang, T., Yatskar, M., Ordonez, V., Chang, K.W.: Gender bias in coref- erence resolution: Evaluation and debiasing methods. arXiv:1804.06876 (2018), https://arxiv.org/abs/1804.06876

work page arXiv 2018
[36]

In: Proceedings of the 16th Confer- ence of the European Chapter of the Association for Computational Linguistics: Main Volume

Zhou, X., Sap, M., Swayamdipta, S., Choi, Y., Smith, N.A.: Challenges in auto- mated debiasing for toxic language detection. In: Proceedings of the 16th Confer- ence of the European Chapter of the Association for Computational Linguistics: Main Volume. pp. 3143–3155 (2021)

work page 2021
[37]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J.Z., Fredrikson, M.: Universal and transferable adversarial attacks on aligned language models. arXiv:2307.15043 (2023), https://arxiv.org/abs/2307.15043 Investigating Bias in Toxicity Benchmarks 15 A Model Configurations Model AttributeGPT-3.5-Turbo-0125 Mistral-7B-Instruct Llama2-7B-Chat Qwen2-7B-Instru...

work page internal anchor Pith review Pith/arXiv arXiv 2023