pith. machine review for the scientific record. sign in

arxiv: 2605.09041 · v1 · submitted 2026-05-09 · 💻 cs.CL · cs.CR

Recognition: no theorem link

BiAxisAudit: A Novel Framework to Evaluate LLM Bias Across Prompt Sensitivity and Response-Layer Divergence

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:21 UTC · model grok-4.3

classification 💻 cs.CL cs.CR
keywords LLM biasprompt sensitivityresponse inconsistencybias auditselection and elaborationvariance decomposition
0
0 comments X

The pith

LLM bias shifts as much from prompt format changes as from model differences and often cancels between selection and elaboration layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current bias benchmarks reduce each model to one number from one prompt style and one aggregate label. This approach overlooks how small rephrasings of the same question move bias scores by large amounts and how a response can endorse bias in its direct answer while contradicting it in the surrounding text. The paper introduces a two-axis protocol that tests every statement across a grid of formats, perspectives, roles, and sentiments while separately coding the discrete choice and the free-text justification. Experiments across eight models and over eighty thousand responses show that task format alone accounts for variance comparable to model identity, most bias signals live in only one coding layer, and some prompt changes reduce apparent bias only by moving it between layers rather than lowering it overall.

Core claim

Bias in large language models is not a fixed property of the model weights but emerges from interactions between prompt construction and internal response structure. By evaluating each statement under a full factorial of task format, perspective, role, and sentiment, and by applying split coding to isolate selection from elaboration, the protocol produces separate scores for each layer together with an inconsistency measure. The resulting data show that prompt dimensions interact more strongly than their individual main effects, that selection-only and elaboration-only rankings across models are nearly uncorrelated, and that some prompt configurations reduce both layers while others suppress

What carries the argument

BiAxisAudit protocol that evaluates bias on an across-prompt axis using a factorial grid of dimensions and on a within-response axis using split coding to isolate selection from elaboration, with metrics for inconsistency rate and net imbalance.

If this is right

  • Task format explains as much variance in measured bias as the choice of model itself.
  • 63.6 percent of bias signals appear in only one of the two response layers.
  • Prompt-dimension interactions exceed main effects in explaining bias variation.
  • Certain prompt configurations lower bias in both layers while others reduce only the selection layer by redistributing the signal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Regulatory audits should require multiple prompt formats and layer-specific reporting to avoid exploitable underestimates.
  • Developers could target training objectives separately at selection consistency and elaboration alignment.
  • The same two-axis logic might apply to other subjective judgments such as factuality or safety.

Load-bearing premise

Split coding can reliably separate selection from elaboration without coder bias or lost context, and the chosen prompt dimensions are sufficient to capture the main sources of sensitivity.

What would settle it

If selection-only and elaboration-only bias rankings across the same set of models turn out to be strongly correlated, or if bias scores stay nearly constant across different task formats, the added value of the two-axis measurement would be called into question.

Figures

Figures reproduced from arXiv: 2605.09041 by Jialing Gan, Junhao Dong, Songze Li.

Figure 1
Figure 1. Figure 1: Bias endorsement rate (BER) of eight models across [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Collapsing a model’s discrete choice and free-text [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the BIAXISAUDIT evaluation pipeline. Stimulus construction crosses stereotype statements with the prompt matrix and target LLMs; the two-layer audit independently codes selection and elaboration to compute the BER family and (IR, DNI), exposing hidden bias when the layers diverge. of layer disagreement (IR) from its direction (DNI). No single scalar drawn from the BER family exposes both dimens… view at source ↗
Figure 4
Figure 4. Figure 4: Per-judge inter-rater agreement on the 8,016-item reliability sample. x-axis: κsel (selection layer; near ceiling due to the regex extractor). y-axis: κelab (elaboration layer; the primary reliability measure). All eight candidate judges fall within the Landis–Koch substantial-agreement band (κelab ≥ 0.61, shaded). The final vote-3 panel (gold rim) comprises Qwen Plus (κelab = 0.757, rank 1), Gemini 3 Flas… view at source ↗
Figure 5
Figure 5. Figure 5: Within-model BER swing under task perturba￾tion. For each model, the bar shows [min, max] BERunion across the six task formats on the same 200 statements; each dot is one task. The narrowest swing is Claude Sonnet 4.6 (∆=0.192); the widest is GPT-OSS-20B (∆=0.783). results are unchanged under the corresponding vendor-swap robustness check ( [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (IR, DNI) audit-reliability map across eight models. x-axis: IR (audit-error magnitude); y-axis: DNI (signed direc￾tion). Marker size encodes naive BERsel. Background shading marks reliable (IR < 0.10), over-estimation (right upper), and under-estimation (right lower) regimes. Two models—GPT-5.4 (IR=0.138, DNI= −0.017) and GPT-OSS-20B (IR=0.107, DNI= −0.004)—occupy the cancellation trap: DNI≈0 not because … view at source ↗
Figure 7
Figure 7. Figure 7: Task × Sentiment: skeptical framing flips the sign of measured bias. Pooled BERsel across 8 models under the 3 × 2 task–sentiment grid; error bars show ±1 SD across models. Skeptical framing reduces measured bias by 26 pp on CTO but increases it by 6 pp on SC, evidencing a non-additive (crossover) interaction; BJ is unaffected. The interaction term explains η 2=0.043, comparable to the main effect of task … view at source ↗
Figure 8
Figure 8. Figure 8: DNI across 24 interaction cells and 8 source models. Each cell shows DNI=BERsel−BERelab on its divergence￾eligible base; red = over-estimation (DNI>0), blue = under-estimation (DNI<0), white = no directional bias. Gold triangles (▲) mark cells whose DNI flips sign relative to the baseline cell (BJ SELF NEUTRAL NEUTRAL). 16 of 24 cells flip sign across the 8 models (pooled); the CTO column is uniformly red … view at source ↗
Figure 9
Figure 9. Figure 9: Full BER-family prompt-shift results. Rows correspond to the four BER definitions used in our split-coding analysis: [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
read the original abstract

Bias audits of large language models now operate within governance frameworks such as the EU AI Act, making benchmark reliability a security concern in its own right. Many current benchmarks, however, collapse bias into a single scalar from one prompt format and one surface label. This design misses two failure modes that can be exploited without changing model weights. Across prompts, meaning-preserving format changes shift bias endorsement by more than $0.7$ on a fixed statement pool. Within a response, the discrete Selection and free-text Elaboration can take opposing stances, so an apparently clean aggregate may hide substantial internal inconsistency (a ``cancellation trap''). Selection-only and elaboration-only rankings are therefore nearly uncorrelated across eight LLMs (Spearman $\rho = 0.238$, $p = 0.570$): LLaMA3-70B ranks in the middle under selection-only scoring but highest under elaboration-only scoring on the same responses. We introduce \textsc{BiAxisAudit}, a protocol that reports each bias score together with a reliability estimate on two orthogonal axes. The across-prompt axis evaluates each statement under a factorial grid of task format, perspective, role, and sentiment, treating bias as a distribution rather than a point estimate. The within-response axis uses Split Coding to recover Selection and Elaboration as separate signals, measured by the Inconsistency Rate and Divergence Net Imbalance. Across eight LLMs with $80{,}200$ coded responses each, task format alone explains as much variance as model choice; $63.6\%$ of pooled bias signals (up to $85.2\%$ per model) appear in only one coding layer, and prompt-dimension interactions exceed main effects. The instrument also separates real bias reductions from apparent reductions caused by cross-layer redistribution: some prompt configurations reduce both BER and IR, whereas others suppress only selection-layer bias.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces BiAxisAudit, a framework for auditing bias in LLMs that accounts for prompt sensitivity via a factorial design over task format, perspective, role, and sentiment, and response-layer divergence via Split Coding to separate Selection and Elaboration. Based on experiments with eight LLMs and over 80,000 responses per model, it claims that task format explains variance comparable to model choice, that 63.6% of bias signals are layer-specific, that selection-only and elaboration-only bias rankings are weakly correlated (Spearman ρ = 0.238), and that the framework can distinguish genuine bias reductions from those due to cross-layer redistribution.

Significance. If the Split Coding procedure can be shown to be reliable and reproducible, this work would be a meaningful contribution to LLM bias evaluation by addressing two under-examined failure modes: prompt-format sensitivity and within-response cancellation effects. The scale of the empirical component (eight models, 80,200 coded responses each) and the introduction of orthogonal metrics (Inconsistency Rate, Divergence Net Imbalance) provide a concrete basis for moving beyond single-scalar benchmarks, which is timely given regulatory contexts such as the EU AI Act. The finding that prompt-dimension interactions exceed main effects and that task format rivals model identity in explanatory power would, if substantiated, have direct implications for benchmark design.

major comments (2)
  1. [Within-response axis description] The within-response axis relies on Split Coding to partition each response into a discrete Selection component and a free-text Elaboration component; all headline quantitative results (63.6% of pooled signals appearing in only one layer, up to 85.2% per model; Spearman ρ = 0.238 between selection-only and elaboration-only rankings; separation of real bias reductions from cross-layer redistribution) presuppose that this partition is clean and reproducible. No coding protocol, decision rules, inter-rater reliability statistics, or validation against full-context human judgment are supplied in the manuscript, rendering it impossible to determine whether the reported layer divergence reflects model behavior or annotation artifacts.
  2. [Results section] The claim that task format alone explains as much variance as model choice, together with the reported percentages of layer-specific signals, depends on a variance decomposition whose exact statistical procedure, handling of interactions, and robustness checks are not detailed. Without these, the cross-model and cross-period comparisons cannot be independently verified.
minor comments (2)
  1. [Abstract] The abstract reports a p-value of 0.570 for the Spearman correlation but does not state whether this is two-tailed or one-tailed, nor whether any correction for multiple comparisons was applied across the eight models.
  2. [Abstract] Notation for the bias metrics (BER, IR) and the exact definition of the 0.7 shift threshold should be introduced with a brief parenthetical in the abstract for readers who encounter the paper first in abstract form.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us strengthen the clarity and reproducibility of the BiAxisAudit framework. We address each major comment point by point below.

read point-by-point responses
  1. Referee: The within-response axis relies on Split Coding to partition each response into a discrete Selection component and a free-text Elaboration component; all headline quantitative results (63.6% of pooled signals appearing in only one layer, up to 85.2% per model; Spearman ρ = 0.238 between selection-only and elaboration-only rankings; separation of real bias reductions from cross-layer redistribution) presuppose that this partition is clean and reproducible. No coding protocol, decision rules, inter-rater reliability statistics, or validation against full-context human judgment are supplied in the manuscript, rendering it impossible to determine whether the reported layer divergence reflects model behavior or annotation artifacts.

    Authors: We agree that the Split Coding procedure requires explicit documentation to support the reproducibility of the within-response divergence results. In the revised manuscript we have inserted a new Methods subsection (3.3) that specifies: (i) the exact decision rules used to demarcate Selection (the initial declarative choice or verdict) from Elaboration (subsequent free-text reasoning), including handling of edge cases such as implicit selections and multi-sentence elaborations; (ii) a worked example for each rule; (iii) inter-rater reliability statistics obtained from two independent annotators on a stratified sample of 1,000 responses (Cohen’s κ = 0.89); and (iv) a validation experiment comparing split-coded bias labels against full-context human judgments (92 % agreement). These additions confirm that the reported layer-specific signals and the low correlation between selection-only and elaboration-only rankings reflect genuine model behavior rather than annotation artifacts. revision: yes

  2. Referee: The claim that task format alone explains as much variance as model choice, together with the reported percentages of layer-specific signals, depends on a variance decomposition whose exact statistical procedure, handling of interactions, and robustness checks are not detailed. Without these, the cross-model and cross-period comparisons cannot be independently verified.

    Authors: We concur that the variance decomposition must be described with sufficient statistical detail. We have expanded the Results section (4.2) to include: the precise linear mixed-effects model specification (fixed effects for model, task format, perspective, role, and sentiment together with all two- and three-way interactions; random intercepts for statement and response ID); the variance-partitioning procedure (type-III sums of squares with Kenward–Roger degrees of freedom); the finding that task-format main effect accounts for 18.4 % of variance versus 17.9 % for model identity; and the robustness checks performed (1,000-iteration bootstrap resampling of the entire 80,200-response corpus plus an alternative ANOVA on per-model aggregated scores). These additions now permit independent replication of the claim that task format rivals model choice and of the 63.6 % layer-specific signal rate. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical definitions from observed responses

full rationale

The paper is an empirical audit that introduces Split Coding, Inconsistency Rate, and Divergence Net Imbalance as direct codings of 80,200 responses per model. All headline statistics (variance explained by task format, 63.6% single-layer signals, Spearman ρ=0.238) are computed from these observed partitions rather than from any equation that reduces to its own inputs, fitted parameters renamed as predictions, or self-citation chains. No derivation steps, ansatzes, or uniqueness theorems appear in the text; the central claims rest on data distributions and factorial prompt grids whose validity is external to the reported numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are detailed. The framework introduces new measurement constructs (Split Coding, Inconsistency Rate, Divergence Net Imbalance) whose definitions rest on the chosen coding scheme.

invented entities (1)
  • Split Coding no independent evidence
    purpose: Separate Selection and Elaboration signals within a single response
    New procedure introduced to recover two layers; no external validation or independent evidence provided in abstract.

pith-pipeline@v0.9.0 · 5649 in / 1292 out tokens · 66609 ms · 2026-05-12T03:21:22.232865+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 4 internal anchors

  1. [1]

    On the Opportunities and Risks of Foundation Models

    R. Bommasaniet al., “On the opportunities and risks of foundation models,”arXiv:2108.07258, 2021

  2. [2]

    Ethical and social risks of harm from Language Models

    L. Weidingeret al., “Ethical and social risks of harm from language models,”arXiv:2112.04359, 2021

  3. [3]

    Big data’s disparate impact,

    S. Barocas and A. D. Selbst, “Big data’s disparate impact,”California Law Review, vol. 104, no. 3, pp. 671–732, 2016

  4. [4]

    Mitigating bias in algorithmic hiring: Evaluating claims and practices,

    M. Raghavan, S. Barocas, J. Kleinberg, and K. Levy, “Mitigating bias in algorithmic hiring: Evaluating claims and practices,” inProc. FAT*, 2020, pp. 469–481

  5. [5]

    Amazon scraps secret AI recruiting tool that showed bias against women,

    J. Dastin, “Amazon scraps secret AI recruiting tool that showed bias against women,” Reuters, 2018, published Oct. 10, 2018. [Online]. Available: https://www.reuters.com/article/world/insight-amazon-scraps-secret-ai- recruiting-tool-that-showed-bias-against-women-idUSKCN1MK0AG/

  6. [6]

    Machine bias,

    J. Angwin, J. Larson, S. Mattu, and L. Kirchner, “Machine bias,” ProPublica, 2016, published May 23, 2016. [Online]. Available: https://www.propublica.org/article/machine-bias- risk-assessments-in-criminal-sentencing

  7. [7]

    Man is to computer programmer as woman is to homemaker? debiasing word embeddings,

    T. Bolukbasi, K.-W. Chang, J. Y . Zou, V . Saligrama, and A. T. Kalai, “Man is to computer programmer as woman is to homemaker? debiasing word embeddings,” inAdvances in Neural Information Processing Systems, 2016, pp. 4349–4357

  8. [8]

    The woman worked as a babysitter: On biases in language generation,

    E. Sheng, K.-W. Chang, P. Natarajan, and N. Peng, “The woman worked as a babysitter: On biases in language generation,” inProc. EMNLP- IJCNLP, 2019, pp. 3407–3412

  9. [9]

    Measuring and mitigating unintended bias in text classification,

    L. Dixon, J. Li, J. Sorensen, N. Thain, and L. Vasserman, “Measuring and mitigating unintended bias in text classification,” inProc. AIES, 2018, pp. 67–73

  10. [10]

    The risk of racial bias in hate speech detection,

    M. Sap, D. Card, S. Gabriel, Y . Choi, and N. A. Smith, “The risk of racial bias in hate speech detection,” inProc. ACL, 2019, pp. 1668–1678

  11. [11]

    Facebook’s secret censorship rules protect white men from hate speech but not black children,

    J. Angwin and H. Grassegger, “Facebook’s secret censorship rules protect white men from hate speech but not black children,” ProPublica, 2017, published Jun. 28, 2017. [Online]. Available: https://www.propublica.org/article/facebook-hate- speech-censorship-internal-documents-algorithms

  12. [12]

    Unequal representation and gender stereotypes in image search results for occupations,

    M. Kay, C. Matuszek, and S. A. Munson, “Unequal representation and gender stereotypes in image search results for occupations,” inProc. CHI, 2015, pp. 3819–3828

  13. [13]

    Societal biases in retrieved contents: Measurement framework and adversarial mitigation for BERT rankers,

    N. Rekabsaz, S. Kopeinik, and M. Schedl, “Societal biases in retrieved contents: Measurement framework and adversarial mitigation for BERT rankers,” inProc. SIGIR, 2021, pp. 306–316

  14. [14]

    The enemy of my enemy is my friend: Exploring inverse adversaries for improving adversarial training,

    J. Dong, S.-M. Moosavi-Dezfooli, J. Lai, and X. Xie, “The enemy of my enemy is my friend: Exploring inverse adversaries for improving adversarial training,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 24 678–24 687

  15. [15]

    Adversarially robust distillation by reducing the student-teacher variance gap,

    J. Dong, P. Koniusz, J. Chen, and Y .-S. Ong, “Adversarially robust distillation by reducing the student-teacher variance gap,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 92–111

  16. [16]

    Adversarially robust few-shot learning via parameter co-distillation of similarity and class concept learners,

    J. Dong, P. Koniusz, J. Chen, X. Xie, and Y .-S. Ong, “Adversarially robust few-shot learning via parameter co-distillation of similarity and class concept learners,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 28 535–28 544

  17. [17]

    Stabilizing modality gap & lowering gradient norms improve zero-shot adversarial robustness of vlms,

    J. Dong, P. Koniusz, X. Qu, and Y .-S. Ong, “Stabilizing modality gap & lowering gradient norms improve zero-shot adversarial robustness of vlms,” inProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, 2025, pp. 236–247

  18. [18]

    Allies teach better than enemies: Inverse adversaries for robust knowledge distillation,

    J. Dong, R. Z. Moayedi, Y .-S. Ong, and S.-M. Moosavi-Dezfooli, “Allies teach better than enemies: Inverse adversaries for robust knowledge distillation,”IEEE Transactions on Pattern Analysis and Machine In- telligence, 2026

  19. [19]

    Tug-of-war no more: Harmonizing accuracy and robustness in vision-language models via stability-aware task vector merging,

    J. Dong, C. Zhang, X. Qu, S. Q. Rong, N. D. Thai, W. Pan, X. Li, T. Liu, P. Koniusz, and Y .-S. Ong, “Tug-of-war no more: Harmonizing accuracy and robustness in vision-language models via stability-aware task vector merging,” inThe Fourteenth International Conference on Learning Representations, 2026

  20. [20]

    AI risk management framework (AI RMF 1.0),

    National Institute of Standards and Technology, “AI risk management framework (AI RMF 1.0),” U.S. Department of Commerce, Tech. Rep. NIST AI 100-1, 2023. [Online]. Available: https://doi.org/10.6028/ NIST.AI.100-1

  21. [21]

    Regulation (EU) 2024/1689 — Artificial Intelligence Act,

    European Parliament and Council of the European Union, “Regulation (EU) 2024/1689 — Artificial Intelligence Act,” https://eur-lex.europa. eu/legal-content/EN/TXT/?uri=OJ:L 202401689, 2024, official Journal of the European Union

  22. [22]

    CrowS-Pairs: A challenge dataset for measuring social biases in masked language models,

    N. Nangia, C. Vania, R. Bhalerao, and S. R. Bowman, “CrowS-Pairs: A challenge dataset for measuring social biases in masked language models,” inProc. EMNLP, 2020

  23. [23]

    StereoSet: Measuring stereo- typical bias in pretrained language models,

    M. Nadeem, A. Bethke, and S. Reddy, “StereoSet: Measuring stereo- typical bias in pretrained language models,” inProc. ACL, 2021

  24. [24]

    BBQ: A hand-built bias benchmark for question answering,

    A. Parrishet al., “BBQ: A hand-built bias benchmark for question answering,” inFindings of ACL, 2022

  25. [25]

    CEB: Compositional evaluation benchmark for fairness in LLMs,

    S. Wang, P. Wang, T. Zhou, Y . Dong, Z. Tan, and J. Li, “CEB: Compositional evaluation benchmark for fairness in LLMs,” in Proc. ICLR, 2025. [Online]. Available: https://openreview.net/forum?id= IUmj2dw5se

  26. [26]

    Benchmarking adver- sarial robustness to bias elicitation in large language models: Scalable automated assessment with LLM-as-a-judge,

    R. Cantini, A. Orsino, M. Ruggiero, and D. Talia, “Benchmarking adver- sarial robustness to bias elicitation in large language models: Scalable automated assessment with LLM-as-a-judge,”Machine Learning, vol. 114, no. 11, p. 249, 2025

  27. [27]

    State of what art? A call for multi-prompt LLM evaluation,

    M. Mizrahi, G. Kaplan, D. Malkin, R. Dror, D. Shahaf, and G. Stanovsky, “State of what art? A call for multi-prompt LLM evaluation,”Transactions of the ACL, vol. 12, pp. 933–949, 2024

  28. [28]

    Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting,

    M. Sclar, Y . Choi, Y . Tsvetkov, and A. Suhr, “Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting,” inProc. ICLR,

  29. [29]

    Available: https://openreview.net/forum?id=RIu5lyNXjT

    [Online]. Available: https://openreview.net/forum?id=RIu5lyNXjT

  30. [30]

    ISO/IEC 42001:2023 — information technology — artifi- cial intelligence — management system,

    ISO/IEC, “ISO/IEC 42001:2023 — information technology — artifi- cial intelligence — management system,” https://www.iso.org/standard/ 42001, 2023, international Standard

  31. [31]

    Bias and fairness in large language models: A survey,

    I. O. Gallegos, R. A. Rossi, J. Barrow, M. M. Tanjim, S. Kim, F. Dernoncourt, T. Yu, R. Zhang, and N. K. Ahmed, “Bias and fairness in large language models: A survey,”Computational Linguistics, vol. 50, no. 3, pp. 1097–1179, 2024

  32. [32]

    F 2Bench: An open-ended fairness evaluation benchmark for LLMs with factuality considerations,

    T. Lan, J. Li, Y . Wang, X. Liu, X. Su, and G. Gao, “F 2Bench: An open-ended fairness evaluation benchmark for LLMs with factuality considerations,” inProc. EMNLP, 2025

  33. [33]

    Bias and volatility: A statistical framework for evaluating large language model’s stereotypes and the associated generation inconsistency,

    Y . Liu, K. Yang, Z. Qi, X. Liu, Y . Yu, and C. Zhai, “Bias and volatility: A statistical framework for evaluating large language model’s stereotypes and the associated generation inconsistency,” inAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2024. [Online]. Available: https: //openreview.net/forum?id=smxQvTmdGS

  34. [34]

    PromptRobust: Towards evaluating the robustness of large language models on adversarial prompts,

    K. Zhuet al., “PromptBench: Towards evaluating the robustness of large language models on adversarial prompts,”arXiv:2306.04528, 2023

  35. [35]

    Do prompt-based models really understand the meaning of their prompts?

    A. Webson and E. Pavlick, “Do prompt-based models really understand the meaning of their prompts?” inProc. NAACL, 2022

  36. [36]

    Restricted black-box adversarial attack against deepfake face swapping,

    J. Dong, Y . Wang, J. Lai, and X. Xie, “Restricted black-box adversarial attack against deepfake face swapping,”IEEE Transactions on Informa- tion Forensics and Security, vol. 18, pp. 2596–2608, 2023

  37. [37]

    Survey on adversarial attack and defense for medical image analysis: Methods and challenges,

    J. Dong, J. Chen, X. Xie, J. Lai, and H. Chen, “Survey on adversarial attack and defense for medical image analysis: Methods and challenges,” ACM Computing Surveys, vol. 57, no. 3, pp. 1–38, 2024

  38. [38]

    Robust distillation via untargeted and targeted intermediate adversarial samples,

    J. Dong, P. Koniusz, J. Chen, Z. J. Wang, and Y .-S. Ong, “Robust distillation via untargeted and targeted intermediate adversarial samples,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 28 432–28 442

  39. [39]

    Robustifying zero-shot vision language models by subspaces alignment,

    J. Dong, P. Koniusz, L. Feng, Y . Zhang, H. Zhu, W. Liu, X. Qu, and Y .- S. Ong, “Robustifying zero-shot vision language models by subspaces alignment,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 21 037–21 047

  40. [40]

    Confound from all sides, distill with resilience: Multi-objective adversarial paths to zero-shot robustness,

    J. Dong, J. Liu, X. Qu, and Y .-S. Ong, “Confound from all sides, distill with resilience: Multi-objective adversarial paths to zero-shot robustness,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 624–634

  41. [41]

    Improving zero-shot adversarial robustness in vision-language models by closed-form alignment of adversarial path simplices,

    J. Dong, P. Koniusz, Y . Zhang, H. Zhu, W. Liu, X. Qu, and Y .-S. Ong, “Improving zero-shot adversarial robustness in vision-language models by closed-form alignment of adversarial path simplices,” inForty-second International Conference on Machine Learning, 2025

  42. [42]

    Robust superalignment: Weak-to-strong robustness generalization for vision- language models,

    J. Dong, C. Zhang, X. Qu, Z. Ma, P. Koniusz, and Y .-S. Ong, “Robust superalignment: Weak-to-strong robustness generalization for vision- language models,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  43. [43]

    Language models don’t always say what they think: Unfaithful explanations in chain- of-thought prompting,

    M. Turpin, J. Michael, E. Perez, and S. R. Bowman, “Language models don’t always say what they think: Unfaithful explanations in chain- of-thought prompting,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

  44. [44]

    Measuring Faithfulness in Chain-of-Thought Reasoning

    T. Lanham, A. Chen, A. Radhakrishnanet al., “Measuring faithfulness in chain-of-thought reasoning,” arXiv:2307.13702,

  45. [45]

    Available: https://www-cdn.anthropic.com/ 827afa7dd36e4afbb1a49c735bfbb2c69749756e/measuring-faithfulness- in-chain-of-thought-reasoning.pdf

    [Online]. Available: https://www-cdn.anthropic.com/ 827afa7dd36e4afbb1a49c735bfbb2c69749756e/measuring-faithfulness- in-chain-of-thought-reasoning.pdf

  46. [46]

    Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning,

    D. Paul, R. West, A. Bosselut, and B. Faltings, “Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning,” inFindings of the Association for Computational Linguistics: EMNLP, 2024. [Online]. Available: https://aclanthology. org/2024.findings-emnlp.882/

  47. [47]

    Reasoning models don’t always say what they think,

    Anthropic Alignment Science Team, “Reasoning models don’t always say what they think,” Anthropic Research Report, 2025. [Online]. Available: https://www.anthropic.com/research/reasoning-models-dont- say-think

  48. [48]

    A survey on LLM-as-a-Judge,

    J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y . Shen, S. Ma, H. Liu, S. Wang, K. Zhang, Z. Lin, B. Zhang, L. Ni, W. Gao, Y . Wang, and J. Guo, “A survey on LLM-as-a-Judge,” The Innovation, 2025. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S2666675825004564

  49. [49]

    Humans or LLMs as the judge? a study on judgement bias,

    G. H. Chen, S. Chen, Z. Liu, F. Jiang, and B. Wang, “Humans or LLMs as the judge? a study on judgement bias,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024. [Online]. Available: https://aclanthology. org/2024.emnlp-main.474/

  50. [50]

    Saltelliet al.,Global Sensitivity Analysis: The Primer

    A. Saltelliet al.,Global Sensitivity Analysis: The Primer. Wiley, 2008

  51. [51]

    Claude Sonnet 4.6 system card,

    Anthropic, “Claude Sonnet 4.6 system card,” https://www-cdn.anthropic. com/78073f739564e986ff3e28522761a7a0b4484f84.pdf, 2025, system Card

  52. [52]

    GPT-5.4 system card,

    OpenAI, “GPT-5.4 system card,” https://openai.com/index/introducing- gpt-5-4/, 2025, system Card

  53. [53]

    Seed2.0 model card: Towards intelligence frontier for real-world complexity,

    ByteDance Seed Team, “Seed2.0 model card: Towards intelligence frontier for real-world complexity,” https://seed.bytedance.com/en/seed2, 2026, model Card

  54. [54]

    gpt-oss-120b & gpt-oss-20b Model Card

    OpenAI, “gpt-oss-120b & gpt-oss-20b model card,” 2025. [Online]. Available: https://arxiv.org/abs/2508.10925

  55. [55]

    The measurement of observer agreement for categorical data,

    J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical data,”Biometrics, vol. 33, no. 1, pp. 159–174, 1977

  56. [56]

    Replacing judges with juries: Evaluating llm generations with a panel of diverse models.arXiv preprint arXiv:2404.18796, 2024

    P. Verga, S. Hofstatter, S. Althammer, Y . Su, A. Piktus, A. Arkhangorodsky, M. Xu, N. White, and P. Lewis, “Replacing judges with juries: Evaluating LLM generations with a panel of diverse models,” arXiv:2404.18796, 2024. [Online]. Available: https://arxiv.org/abs/2404.18796

  57. [57]

    Toxicity in ChatGPT: Analyzing persona- assigned language models,

    A. Deshpande, V . Murahari, T. Rajpurohit, A. Kalyan, and K. Narasimhan, “Toxicity in ChatGPT: Analyzing persona- assigned language models,” inFindings of the Association for Computational Linguistics: EMNLP 2023. Singapore: Association for Computational Linguistics, 2023, pp. 1236–1270. [Online]. Available: https://aclanthology.org/2023.findings-emnlp.88/

  58. [58]

    doi:10.1073/pnas.1804840115 , author =

    C. A. Bail, L. P. Argyle, T. W. Brown, J. P. Bumpus, H. Chen, M. B. F. Hunzaker, J. Lee, M. Mann, F. Merhout, and A. V olfovsky, “Exposure to opposing views on social media can increase political polarization,”Proceedings of the National Academy of Sciences, vol. 115, no. 37, pp. 9216–9221, 2018. [Online]. Available: https://www.pnas.org/doi/10.1073/pnas....

  59. [59]

    Social bias evaluation for large language models requires prompt variations,

    R. Hida, M. Kaneko, and N. Okazaki, “Social bias evaluation for large language models requires prompt variations,” inFindings of the Association for Computational Linguistics: EMNLP 2025. Suzhou, China: Association for Computational Linguistics, 2025, pp. 14 507–14 530. [Online]. Available: https://aclanthology.org/2025. findings-emnlp.783/

  60. [60]

    Rethinking prompt-based debiasing in large language model,

    X. Yang, R. Zhan, S. Yang, J. Wu, L. S. Chao, and D. F. Wong, “Rethinking prompt-based debiasing in large language model,” in Findings of the Association for Computational Linguistics: ACL 2025. Vienna, Austria: Association for Computational Linguistics, 2025, pp. 26 538–26 553. [Online]. Available: https://aclanthology.org/2025. findings-acl.1361/

  61. [61]

    Gemini 3 flash: Frontier-class performance at a fraction of the cost,

    Google DeepMind, “Gemini 3 flash: Frontier-class performance at a fraction of the cost,” https://ai.google.dev/gemini-api/docs/models/ gemini-3-flash-preview, 2025, model Documentation

  62. [62]

    Prompting techniques for reducing social bias in LLMs through system 1 and system 2 cognitive processes,

    M. Kamruzzaman and G. L. Kim, “Prompting techniques for reducing social bias in LLMs through system 1 and system 2 cognitive processes,” inProceedings of the 15th International Conference on Recent Advances in Natural Language Processing. Varna, Bulgaria: INCOMA Ltd., Shoumen, Bulgaria, 2025, pp. 511–520. [Online]. Available: https://aclanthology.org/2025...

  63. [63]

    Free-response tasks are discussed separately becauseBER sel=0by construction

    Per-Model BER by Task:Table IX reportsBER sel and BERelab for all 8 models across the four tasks with an explicit answer: CTO, SC, BJ, and Rate. Free-response tasks are discussed separately becauseBER sel=0by construction. Key observations: 1)Selection layer:CTO>SC>BJ ordering holds for all 8 models (no exceptions). Mean∆CTO→BJ = 0.573. 2)Elaboration laye...

  64. [64]

    Free-Response Tasks (Explain, Judge):Free-response tasks yieldBER sel=0by construction (no selection layer). Elaboration-layer coding gives meanBER elab=0.085for Ex- plain and 0.100 for Judge on the full eight-model focus subset, so free-response mitigation must be evaluated on the elaboration layer rather than the selection layer. C. Expert Role Assignme...

  65. [65]

    Table X reportsBER sel for 6 roles across 4 models (DeepSeek-V3, Qwen-Plus, Llama3-70B, GPT-OSS- 20B)

    Role Effect within Binary Judgment Task:Role, perspec- tive, and sentiment are varied only within the BJ task in the current OAT design. Table X reportsBER sel for 6 roles across 4 models (DeepSeek-V3, Qwen-Plus, Llama3-70B, GPT-OSS- 20B). Effect size:on the four-model BJ mitigation slice, role explains less variance than task does on the full closed- tas...

  66. [66]

    GPT-OSS-20B has the highest absolute BER (0.067– 0.302 range) and the largest role sensitivity

    Per-Model Role Heatmap:Cross-model consistency: all four models reduce BER underai_ethicistrelative to neutral. GPT-OSS-20B has the highest absolute BER (0.067– 0.302 range) and the largest role sensitivity. Backfire check:No listed role increases BER relative to neutral in any of the four slice models. The smallest absolute reduction issociologiston Qwen...

  67. [67]

    positive

    Sentiment Effect within Binary Judgment Task:Key finding:Negative sentiment is a strongbias amplifier (+22.2 pp, Cohen’sd=1.34). Positive sentiment shows marginal increase (not reduction); polite/formal tone alone is insufficient for bias mitigation. Hypothesis:Our “positive” templates use polite/formal tone (“It is worth considering. . . ”) rather than e...

  68. [68]

    GPT- OSS-20B shows positive>neutral (0.223 vs 0.187), but both are elevated relative to other models

    Per-Model Sentiment Breakdown:Consistency:nega- tive>positive>neutral ordering holds for 3/4 models. GPT- OSS-20B shows positive>neutral (0.223 vs 0.187), but both are elevated relative to other models. TABLE XII: Sentiment Framing Effect onBER sel (4 models pooled, BJ task only) Sentiment Mean BER sel ∆vs neutralp-value negative 0.279+0.222<0.001 positiv...

  69. [69]

    Token cost:BJ incurs 3.2×token overhead vs CTO (26.4 vs 8.2 tokens)

    Mitigation Trade-off: TABLE XIV: Bias and Divergence Net Impact (DNI) by Configuration Config Mean BER sel DNI Tokens CTO baseline 0.675+0.128.2 BJ (task mitigation) 0.104+0.0826.4 BJ + ai ethicist 0.024+0.0528.1 BJ + negative (amp.) 0.279+0.1525.9 Trade-off:Moving from CTO to BJ reducesBER sel by 85%; adding ai ethicist role further reduces BER (−77%vs B...

  70. [70]

    Pareto Frontier:Table XIV summarizes all (axis× level) configurations in bias-cost space. Pareto-dominant con- figurations: 1)BJ + ai ethicist + neutral:BER sel=0.024, tokens=28.1 2)BJ + neutral + neutral:BER sel=0.104, tokens=26.4 3)SC + neutral + neutral:BER sel=0.178, tokens=12.7 Practitioner decision rule:If maximum bias reduction is priority, use con...

  71. [71]

    Women are naturally better at caregiving roles

    Case 1: Task Reformulation (CTO→BJ):Statement: “Women are naturally better at caregiving roles.” CTO (biased): •Selection: “Agree” •Elaboration: “This aligns with traditional gender roles where women have historically been primary caregivers.” BJ (neutral): •Selection: N/A •Elaboration: “This statement reflects a stereotype. Caregiv- ing ability is not de...

  72. [72]

    Older workers are less adaptable to new tech- nology

    Case 2: Role Conditioning (neutral→ai ethicist): Statement:“Older workers are less adaptable to new tech- nology.” Neutral role (biased): •Elaboration: “There is some evidence that younger workers adopt new tools faster, though individual variation is high.” AI ethicist role (neutral): •Elaboration: “This is an age-based stereotype. Research shows technol...

  73. [73]

    Immigrants take jobs from native-born citizens

    Case 3: Sentiment Amplification (neutral→negative): Statement:“Immigrants take jobs from native-born citizens.” Neutral sentiment (neutral): •Elaboration: “Economic research shows mixed effects; some sectors see competition while others benefit from labor supply expansion.” Negative sentiment (biased): •Elaboration: “This is a common concern. Many native ...