arxiv: 2605.09041 · v1 · submitted 2026-05-09 · 💻 cs.CL · cs.CR

Recognition: no theorem link

BiAxisAudit: A Novel Framework to Evaluate LLM Bias Across Prompt Sensitivity and Response-Layer Divergence

Jialing Gan , Junhao Dong , Songze Li

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:21 UTC · model grok-4.3

classification 💻 cs.CL cs.CR

keywords LLM biasprompt sensitivityresponse inconsistencybias auditselection and elaborationvariance decomposition

0 comments

The pith

LLM bias shifts as much from prompt format changes as from model differences and often cancels between selection and elaboration layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current bias benchmarks reduce each model to one number from one prompt style and one aggregate label. This approach overlooks how small rephrasings of the same question move bias scores by large amounts and how a response can endorse bias in its direct answer while contradicting it in the surrounding text. The paper introduces a two-axis protocol that tests every statement across a grid of formats, perspectives, roles, and sentiments while separately coding the discrete choice and the free-text justification. Experiments across eight models and over eighty thousand responses show that task format alone accounts for variance comparable to model identity, most bias signals live in only one coding layer, and some prompt changes reduce apparent bias only by moving it between layers rather than lowering it overall.

Core claim

Bias in large language models is not a fixed property of the model weights but emerges from interactions between prompt construction and internal response structure. By evaluating each statement under a full factorial of task format, perspective, role, and sentiment, and by applying split coding to isolate selection from elaboration, the protocol produces separate scores for each layer together with an inconsistency measure. The resulting data show that prompt dimensions interact more strongly than their individual main effects, that selection-only and elaboration-only rankings across models are nearly uncorrelated, and that some prompt configurations reduce both layers while others suppress

What carries the argument

BiAxisAudit protocol that evaluates bias on an across-prompt axis using a factorial grid of dimensions and on a within-response axis using split coding to isolate selection from elaboration, with metrics for inconsistency rate and net imbalance.

If this is right

Task format explains as much variance in measured bias as the choice of model itself.
63.6 percent of bias signals appear in only one of the two response layers.
Prompt-dimension interactions exceed main effects in explaining bias variation.
Certain prompt configurations lower bias in both layers while others reduce only the selection layer by redistributing the signal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Regulatory audits should require multiple prompt formats and layer-specific reporting to avoid exploitable underestimates.
Developers could target training objectives separately at selection consistency and elaboration alignment.
The same two-axis logic might apply to other subjective judgments such as factuality or safety.

Load-bearing premise

Split coding can reliably separate selection from elaboration without coder bias or lost context, and the chosen prompt dimensions are sufficient to capture the main sources of sensitivity.

What would settle it

If selection-only and elaboration-only bias rankings across the same set of models turn out to be strongly correlated, or if bias scores stay nearly constant across different task formats, the added value of the two-axis measurement would be called into question.

Figures

Figures reproduced from arXiv: 2605.09041 by Jialing Gan, Junhao Dong, Songze Li.

**Figure 2.** Figure 2: Collapsing a model’s discrete choice and free-text [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the BIAXISAUDIT evaluation pipeline. Stimulus construction crosses stereotype statements with the prompt matrix and target LLMs; the two-layer audit independently codes selection and elaboration to compute the BER family and (IR, DNI), exposing hidden bias when the layers diverge. of layer disagreement (IR) from its direction (DNI). No single scalar drawn from the BER family exposes both dimens… view at source ↗

**Figure 4.** Figure 4: Per-judge inter-rater agreement on the 8,016-item reliability sample. x-axis: κsel (selection layer; near ceiling due to the regex extractor). y-axis: κelab (elaboration layer; the primary reliability measure). All eight candidate judges fall within the Landis–Koch substantial-agreement band (κelab ≥ 0.61, shaded). The final vote-3 panel (gold rim) comprises Qwen Plus (κelab = 0.757, rank 1), Gemini 3 Flas… view at source ↗

**Figure 5.** Figure 5: Within-model BER swing under task perturbation. For each model, the bar shows [min, max] BERunion across the six task formats on the same 200 statements; each dot is one task. The narrowest swing is Claude Sonnet 4.6 (∆=0.192); the widest is GPT-OSS-20B (∆=0.783). results are unchanged under the corresponding vendor-swap robustness check ( [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: (IR, DNI) audit-reliability map across eight models. x-axis: IR (audit-error magnitude); y-axis: DNI (signed direction). Marker size encodes naive BERsel. Background shading marks reliable (IR < 0.10), over-estimation (right upper), and under-estimation (right lower) regimes. Two models—GPT-5.4 (IR=0.138, DNI= −0.017) and GPT-OSS-20B (IR=0.107, DNI= −0.004)—occupy the cancellation trap: DNI≈0 not because … view at source ↗

**Figure 7.** Figure 7: Task × Sentiment: skeptical framing flips the sign of measured bias. Pooled BERsel across 8 models under the 3 × 2 task–sentiment grid; error bars show ±1 SD across models. Skeptical framing reduces measured bias by 26 pp on CTO but increases it by 6 pp on SC, evidencing a non-additive (crossover) interaction; BJ is unaffected. The interaction term explains η 2=0.043, comparable to the main effect of task … view at source ↗

**Figure 8.** Figure 8: DNI across 24 interaction cells and 8 source models. Each cell shows DNI=BERsel−BERelab on its divergenceeligible base; red = over-estimation (DNI>0), blue = under-estimation (DNI<0), white = no directional bias. Gold triangles (▲) mark cells whose DNI flips sign relative to the baseline cell (BJ SELF NEUTRAL NEUTRAL). 16 of 24 cells flip sign across the 8 models (pooled); the CTO column is uniformly red … view at source ↗

**Figure 9.** Figure 9: Full BER-family prompt-shift results. Rows correspond to the four BER definitions used in our split-coding analysis: [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

read the original abstract

Bias audits of large language models now operate within governance frameworks such as the EU AI Act, making benchmark reliability a security concern in its own right. Many current benchmarks, however, collapse bias into a single scalar from one prompt format and one surface label. This design misses two failure modes that can be exploited without changing model weights. Across prompts, meaning-preserving format changes shift bias endorsement by more than $0.7$ on a fixed statement pool. Within a response, the discrete Selection and free-text Elaboration can take opposing stances, so an apparently clean aggregate may hide substantial internal inconsistency (a ``cancellation trap''). Selection-only and elaboration-only rankings are therefore nearly uncorrelated across eight LLMs (Spearman $\rho = 0.238$, $p = 0.570$): LLaMA3-70B ranks in the middle under selection-only scoring but highest under elaboration-only scoring on the same responses. We introduce \textsc{BiAxisAudit}, a protocol that reports each bias score together with a reliability estimate on two orthogonal axes. The across-prompt axis evaluates each statement under a factorial grid of task format, perspective, role, and sentiment, treating bias as a distribution rather than a point estimate. The within-response axis uses Split Coding to recover Selection and Elaboration as separate signals, measured by the Inconsistency Rate and Divergence Net Imbalance. Across eight LLMs with $80{,}200$ coded responses each, task format alone explains as much variance as model choice; $63.6\%$ of pooled bias signals (up to $85.2\%$ per model) appear in only one coding layer, and prompt-dimension interactions exceed main effects. The instrument also separates real bias reductions from apparent reductions caused by cross-layer redistribution: some prompt configurations reduce both BER and IR, whereas others suppress only selection-layer bias.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BiAxisAudit flags that prompt format can swing bias scores as much as model choice and that responses often split bias across selection and elaboration layers, but the Split Coding step lacks any validation details so the numbers are hard to trust.

read the letter

The paper's core idea is to audit bias along two axes: how much it changes with different prompt setups, and how much the selection part of an answer diverges from the elaboration part. They argue that current tests miss cases where a model looks fair overall but is biased in one layer or shifts with minor prompt tweaks. That's a fair point, and the factorial grid on task format, perspective, role, and sentiment makes sense for capturing sensitivity. They report that across eight models and over 80,000 responses, task format accounts for as much variance in bias as the model choice itself. Also, 63.6% of bias signals appear in only one coding layer, with some models up to 85%, and the correlation between selection-only and elaboration-only rankings is low at rho=0.238. The new metrics, Inconsistency Rate and Divergence Net Imbalance, aim to quantify the cancellation trap where layers disagree. What the work does well is draw attention to practical weaknesses in bias benchmarking that matter for real-world use, like under regulations. It tries to separate genuine bias reduction from just moving the bias around between layers. The main soft spot is the Split Coding procedure itself. The abstract mentions using it to pull out Selection and Elaboration but gives no details on the coding rules, how they handle context, inter-rater agreement, or any checks against full human judgment. If that split is inconsistent or loses meaning, then all the layer-specific claims and the separation of real vs apparent reductions could be artifacts. The stress test note flags this correctly as load-bearing. This paper is aimed at researchers and practitioners building or using bias audits for LLMs, especially those concerned with robustness in governance settings. A reader interested in evaluation methods might get some ideas from the two-axis approach, but would need the full methods to replicate or build on it. I'd recommend sending it for peer review. The topic is timely and the observations are worth checking, even if the current version needs work on reproducibility and validation of the coding step.

Referee Report

2 major / 2 minor

Summary. The paper introduces BiAxisAudit, a framework for auditing bias in LLMs that accounts for prompt sensitivity via a factorial design over task format, perspective, role, and sentiment, and response-layer divergence via Split Coding to separate Selection and Elaboration. Based on experiments with eight LLMs and over 80,000 responses per model, it claims that task format explains variance comparable to model choice, that 63.6% of bias signals are layer-specific, that selection-only and elaboration-only bias rankings are weakly correlated (Spearman ρ = 0.238), and that the framework can distinguish genuine bias reductions from those due to cross-layer redistribution.

Significance. If the Split Coding procedure can be shown to be reliable and reproducible, this work would be a meaningful contribution to LLM bias evaluation by addressing two under-examined failure modes: prompt-format sensitivity and within-response cancellation effects. The scale of the empirical component (eight models, 80,200 coded responses each) and the introduction of orthogonal metrics (Inconsistency Rate, Divergence Net Imbalance) provide a concrete basis for moving beyond single-scalar benchmarks, which is timely given regulatory contexts such as the EU AI Act. The finding that prompt-dimension interactions exceed main effects and that task format rivals model identity in explanatory power would, if substantiated, have direct implications for benchmark design.

major comments (2)

[Within-response axis description] The within-response axis relies on Split Coding to partition each response into a discrete Selection component and a free-text Elaboration component; all headline quantitative results (63.6% of pooled signals appearing in only one layer, up to 85.2% per model; Spearman ρ = 0.238 between selection-only and elaboration-only rankings; separation of real bias reductions from cross-layer redistribution) presuppose that this partition is clean and reproducible. No coding protocol, decision rules, inter-rater reliability statistics, or validation against full-context human judgment are supplied in the manuscript, rendering it impossible to determine whether the reported layer divergence reflects model behavior or annotation artifacts.
[Results section] The claim that task format alone explains as much variance as model choice, together with the reported percentages of layer-specific signals, depends on a variance decomposition whose exact statistical procedure, handling of interactions, and robustness checks are not detailed. Without these, the cross-model and cross-period comparisons cannot be independently verified.

minor comments (2)

[Abstract] The abstract reports a p-value of 0.570 for the Spearman correlation but does not state whether this is two-tailed or one-tailed, nor whether any correction for multiple comparisons was applied across the eight models.
[Abstract] Notation for the bias metrics (BER, IR) and the exact definition of the 0.7 shift threshold should be introduced with a brief parenthetical in the abstract for readers who encounter the paper first in abstract form.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us strengthen the clarity and reproducibility of the BiAxisAudit framework. We address each major comment point by point below.

read point-by-point responses

Referee: The within-response axis relies on Split Coding to partition each response into a discrete Selection component and a free-text Elaboration component; all headline quantitative results (63.6% of pooled signals appearing in only one layer, up to 85.2% per model; Spearman ρ = 0.238 between selection-only and elaboration-only rankings; separation of real bias reductions from cross-layer redistribution) presuppose that this partition is clean and reproducible. No coding protocol, decision rules, inter-rater reliability statistics, or validation against full-context human judgment are supplied in the manuscript, rendering it impossible to determine whether the reported layer divergence reflects model behavior or annotation artifacts.

Authors: We agree that the Split Coding procedure requires explicit documentation to support the reproducibility of the within-response divergence results. In the revised manuscript we have inserted a new Methods subsection (3.3) that specifies: (i) the exact decision rules used to demarcate Selection (the initial declarative choice or verdict) from Elaboration (subsequent free-text reasoning), including handling of edge cases such as implicit selections and multi-sentence elaborations; (ii) a worked example for each rule; (iii) inter-rater reliability statistics obtained from two independent annotators on a stratified sample of 1,000 responses (Cohen’s κ = 0.89); and (iv) a validation experiment comparing split-coded bias labels against full-context human judgments (92 % agreement). These additions confirm that the reported layer-specific signals and the low correlation between selection-only and elaboration-only rankings reflect genuine model behavior rather than annotation artifacts. revision: yes
Referee: The claim that task format alone explains as much variance as model choice, together with the reported percentages of layer-specific signals, depends on a variance decomposition whose exact statistical procedure, handling of interactions, and robustness checks are not detailed. Without these, the cross-model and cross-period comparisons cannot be independently verified.

Authors: We concur that the variance decomposition must be described with sufficient statistical detail. We have expanded the Results section (4.2) to include: the precise linear mixed-effects model specification (fixed effects for model, task format, perspective, role, and sentiment together with all two- and three-way interactions; random intercepts for statement and response ID); the variance-partitioning procedure (type-III sums of squares with Kenward–Roger degrees of freedom); the finding that task-format main effect accounts for 18.4 % of variance versus 17.9 % for model identity; and the robustness checks performed (1,000-iteration bootstrap resampling of the entire 80,200-response corpus plus an alternative ANOVA on per-model aggregated scores). These additions now permit independent replication of the claim that task format rivals model choice and of the 63.6 % layer-specific signal rate. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical definitions from observed responses

full rationale

The paper is an empirical audit that introduces Split Coding, Inconsistency Rate, and Divergence Net Imbalance as direct codings of 80,200 responses per model. All headline statistics (variance explained by task format, 63.6% single-layer signals, Spearman ρ=0.238) are computed from these observed partitions rather than from any equation that reduces to its own inputs, fitted parameters renamed as predictions, or self-citation chains. No derivation steps, ansatzes, or uniqueness theorems appear in the text; the central claims rest on data distributions and factorial prompt grids whose validity is external to the reported numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are detailed. The framework introduces new measurement constructs (Split Coding, Inconsistency Rate, Divergence Net Imbalance) whose definitions rest on the chosen coding scheme.

invented entities (1)

Split Coding no independent evidence
purpose: Separate Selection and Elaboration signals within a single response
New procedure introduced to recover two layers; no external validation or independent evidence provided in abstract.

pith-pipeline@v0.9.0 · 5649 in / 1292 out tokens · 66609 ms · 2026-05-12T03:21:22.232865+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 4 internal anchors

[1]

On the Opportunities and Risks of Foundation Models

R. Bommasaniet al., “On the opportunities and risks of foundation models,”arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Ethical and social risks of harm from Language Models

L. Weidingeret al., “Ethical and social risks of harm from language models,”arXiv:2112.04359, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Big data’s disparate impact,

S. Barocas and A. D. Selbst, “Big data’s disparate impact,”California Law Review, vol. 104, no. 3, pp. 671–732, 2016

work page 2016
[4]

Mitigating bias in algorithmic hiring: Evaluating claims and practices,

M. Raghavan, S. Barocas, J. Kleinberg, and K. Levy, “Mitigating bias in algorithmic hiring: Evaluating claims and practices,” inProc. FAT*, 2020, pp. 469–481

work page 2020
[5]

Amazon scraps secret AI recruiting tool that showed bias against women,

J. Dastin, “Amazon scraps secret AI recruiting tool that showed bias against women,” Reuters, 2018, published Oct. 10, 2018. [Online]. Available: https://www.reuters.com/article/world/insight-amazon-scraps-secret-ai- recruiting-tool-that-showed-bias-against-women-idUSKCN1MK0AG/

work page 2018
[6]

Machine bias,

J. Angwin, J. Larson, S. Mattu, and L. Kirchner, “Machine bias,” ProPublica, 2016, published May 23, 2016. [Online]. Available: https://www.propublica.org/article/machine-bias- risk-assessments-in-criminal-sentencing

work page 2016
[7]

Man is to computer programmer as woman is to homemaker? debiasing word embeddings,

T. Bolukbasi, K.-W. Chang, J. Y . Zou, V . Saligrama, and A. T. Kalai, “Man is to computer programmer as woman is to homemaker? debiasing word embeddings,” inAdvances in Neural Information Processing Systems, 2016, pp. 4349–4357

work page 2016
[8]

The woman worked as a babysitter: On biases in language generation,

E. Sheng, K.-W. Chang, P. Natarajan, and N. Peng, “The woman worked as a babysitter: On biases in language generation,” inProc. EMNLP- IJCNLP, 2019, pp. 3407–3412

work page 2019
[9]

Measuring and mitigating unintended bias in text classification,

L. Dixon, J. Li, J. Sorensen, N. Thain, and L. Vasserman, “Measuring and mitigating unintended bias in text classification,” inProc. AIES, 2018, pp. 67–73

work page 2018
[10]

The risk of racial bias in hate speech detection,

M. Sap, D. Card, S. Gabriel, Y . Choi, and N. A. Smith, “The risk of racial bias in hate speech detection,” inProc. ACL, 2019, pp. 1668–1678

work page 2019
[11]

Facebook’s secret censorship rules protect white men from hate speech but not black children,

J. Angwin and H. Grassegger, “Facebook’s secret censorship rules protect white men from hate speech but not black children,” ProPublica, 2017, published Jun. 28, 2017. [Online]. Available: https://www.propublica.org/article/facebook-hate- speech-censorship-internal-documents-algorithms

work page 2017
[12]

Unequal representation and gender stereotypes in image search results for occupations,

M. Kay, C. Matuszek, and S. A. Munson, “Unequal representation and gender stereotypes in image search results for occupations,” inProc. CHI, 2015, pp. 3819–3828

work page 2015
[13]

Societal biases in retrieved contents: Measurement framework and adversarial mitigation for BERT rankers,

N. Rekabsaz, S. Kopeinik, and M. Schedl, “Societal biases in retrieved contents: Measurement framework and adversarial mitigation for BERT rankers,” inProc. SIGIR, 2021, pp. 306–316

work page 2021
[14]

The enemy of my enemy is my friend: Exploring inverse adversaries for improving adversarial training,

J. Dong, S.-M. Moosavi-Dezfooli, J. Lai, and X. Xie, “The enemy of my enemy is my friend: Exploring inverse adversaries for improving adversarial training,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 24 678–24 687

work page 2023
[15]

Adversarially robust distillation by reducing the student-teacher variance gap,

J. Dong, P. Koniusz, J. Chen, and Y .-S. Ong, “Adversarially robust distillation by reducing the student-teacher variance gap,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 92–111

work page 2024
[16]

Adversarially robust few-shot learning via parameter co-distillation of similarity and class concept learners,

J. Dong, P. Koniusz, J. Chen, X. Xie, and Y .-S. Ong, “Adversarially robust few-shot learning via parameter co-distillation of similarity and class concept learners,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 28 535–28 544

work page 2024
[17]

Stabilizing modality gap & lowering gradient norms improve zero-shot adversarial robustness of vlms,

J. Dong, P. Koniusz, X. Qu, and Y .-S. Ong, “Stabilizing modality gap & lowering gradient norms improve zero-shot adversarial robustness of vlms,” inProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, 2025, pp. 236–247

work page 2025
[18]

Allies teach better than enemies: Inverse adversaries for robust knowledge distillation,

J. Dong, R. Z. Moayedi, Y .-S. Ong, and S.-M. Moosavi-Dezfooli, “Allies teach better than enemies: Inverse adversaries for robust knowledge distillation,”IEEE Transactions on Pattern Analysis and Machine In- telligence, 2026

work page 2026
[19]

Tug-of-war no more: Harmonizing accuracy and robustness in vision-language models via stability-aware task vector merging,

J. Dong, C. Zhang, X. Qu, S. Q. Rong, N. D. Thai, W. Pan, X. Li, T. Liu, P. Koniusz, and Y .-S. Ong, “Tug-of-war no more: Harmonizing accuracy and robustness in vision-language models via stability-aware task vector merging,” inThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[20]

AI risk management framework (AI RMF 1.0),

National Institute of Standards and Technology, “AI risk management framework (AI RMF 1.0),” U.S. Department of Commerce, Tech. Rep. NIST AI 100-1, 2023. [Online]. Available: https://doi.org/10.6028/ NIST.AI.100-1

work page 2023
[21]

Regulation (EU) 2024/1689 — Artificial Intelligence Act,

European Parliament and Council of the European Union, “Regulation (EU) 2024/1689 — Artificial Intelligence Act,” https://eur-lex.europa. eu/legal-content/EN/TXT/?uri=OJ:L 202401689, 2024, official Journal of the European Union

work page 2024
[22]

CrowS-Pairs: A challenge dataset for measuring social biases in masked language models,

N. Nangia, C. Vania, R. Bhalerao, and S. R. Bowman, “CrowS-Pairs: A challenge dataset for measuring social biases in masked language models,” inProc. EMNLP, 2020

work page 2020
[23]

StereoSet: Measuring stereo- typical bias in pretrained language models,

M. Nadeem, A. Bethke, and S. Reddy, “StereoSet: Measuring stereo- typical bias in pretrained language models,” inProc. ACL, 2021

work page 2021
[24]

BBQ: A hand-built bias benchmark for question answering,

A. Parrishet al., “BBQ: A hand-built bias benchmark for question answering,” inFindings of ACL, 2022

work page 2022
[25]

CEB: Compositional evaluation benchmark for fairness in LLMs,

S. Wang, P. Wang, T. Zhou, Y . Dong, Z. Tan, and J. Li, “CEB: Compositional evaluation benchmark for fairness in LLMs,” in Proc. ICLR, 2025. [Online]. Available: https://openreview.net/forum?id= IUmj2dw5se

work page 2025
[26]

Benchmarking adver- sarial robustness to bias elicitation in large language models: Scalable automated assessment with LLM-as-a-judge,

R. Cantini, A. Orsino, M. Ruggiero, and D. Talia, “Benchmarking adver- sarial robustness to bias elicitation in large language models: Scalable automated assessment with LLM-as-a-judge,”Machine Learning, vol. 114, no. 11, p. 249, 2025

work page 2025
[27]

State of what art? A call for multi-prompt LLM evaluation,

M. Mizrahi, G. Kaplan, D. Malkin, R. Dror, D. Shahaf, and G. Stanovsky, “State of what art? A call for multi-prompt LLM evaluation,”Transactions of the ACL, vol. 12, pp. 933–949, 2024

work page 2024
[28]

Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting,

M. Sclar, Y . Choi, Y . Tsvetkov, and A. Suhr, “Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting,” inProc. ICLR,

work page
[29]

Available: https://openreview.net/forum?id=RIu5lyNXjT

[Online]. Available: https://openreview.net/forum?id=RIu5lyNXjT

work page
[30]

ISO/IEC 42001:2023 — information technology — artifi- cial intelligence — management system,

ISO/IEC, “ISO/IEC 42001:2023 — information technology — artifi- cial intelligence — management system,” https://www.iso.org/standard/ 42001, 2023, international Standard

work page 2023
[31]

Bias and fairness in large language models: A survey,

I. O. Gallegos, R. A. Rossi, J. Barrow, M. M. Tanjim, S. Kim, F. Dernoncourt, T. Yu, R. Zhang, and N. K. Ahmed, “Bias and fairness in large language models: A survey,”Computational Linguistics, vol. 50, no. 3, pp. 1097–1179, 2024

work page 2024
[32]

F 2Bench: An open-ended fairness evaluation benchmark for LLMs with factuality considerations,

T. Lan, J. Li, Y . Wang, X. Liu, X. Su, and G. Gao, “F 2Bench: An open-ended fairness evaluation benchmark for LLMs with factuality considerations,” inProc. EMNLP, 2025

work page 2025
[33]

Bias and volatility: A statistical framework for evaluating large language model’s stereotypes and the associated generation inconsistency,

Y . Liu, K. Yang, Z. Qi, X. Liu, Y . Yu, and C. Zhai, “Bias and volatility: A statistical framework for evaluating large language model’s stereotypes and the associated generation inconsistency,” inAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2024. [Online]. Available: https: //openreview.net/forum?id=smxQvTmdGS

work page 2024
[34]

PromptRobust: Towards evaluating the robustness of large language models on adversarial prompts,

K. Zhuet al., “PromptBench: Towards evaluating the robustness of large language models on adversarial prompts,”arXiv:2306.04528, 2023

work page arXiv 2023
[35]

Do prompt-based models really understand the meaning of their prompts?

A. Webson and E. Pavlick, “Do prompt-based models really understand the meaning of their prompts?” inProc. NAACL, 2022

work page 2022
[36]

Restricted black-box adversarial attack against deepfake face swapping,

J. Dong, Y . Wang, J. Lai, and X. Xie, “Restricted black-box adversarial attack against deepfake face swapping,”IEEE Transactions on Informa- tion Forensics and Security, vol. 18, pp. 2596–2608, 2023

work page 2023
[37]

Survey on adversarial attack and defense for medical image analysis: Methods and challenges,

J. Dong, J. Chen, X. Xie, J. Lai, and H. Chen, “Survey on adversarial attack and defense for medical image analysis: Methods and challenges,” ACM Computing Surveys, vol. 57, no. 3, pp. 1–38, 2024

work page 2024
[38]

Robust distillation via untargeted and targeted intermediate adversarial samples,

J. Dong, P. Koniusz, J. Chen, Z. J. Wang, and Y .-S. Ong, “Robust distillation via untargeted and targeted intermediate adversarial samples,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 28 432–28 442

work page 2024
[39]

Robustifying zero-shot vision language models by subspaces alignment,

J. Dong, P. Koniusz, L. Feng, Y . Zhang, H. Zhu, W. Liu, X. Qu, and Y .- S. Ong, “Robustifying zero-shot vision language models by subspaces alignment,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 21 037–21 047

work page 2025
[40]

Confound from all sides, distill with resilience: Multi-objective adversarial paths to zero-shot robustness,

J. Dong, J. Liu, X. Qu, and Y .-S. Ong, “Confound from all sides, distill with resilience: Multi-objective adversarial paths to zero-shot robustness,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 624–634

work page 2025
[41]

Improving zero-shot adversarial robustness in vision-language models by closed-form alignment of adversarial path simplices,

J. Dong, P. Koniusz, Y . Zhang, H. Zhu, W. Liu, X. Qu, and Y .-S. Ong, “Improving zero-shot adversarial robustness in vision-language models by closed-form alignment of adversarial path simplices,” inForty-second International Conference on Machine Learning, 2025

work page 2025
[42]

Robust superalignment: Weak-to-strong robustness generalization for vision- language models,

J. Dong, C. Zhang, X. Qu, Z. Ma, P. Koniusz, and Y .-S. Ong, “Robust superalignment: Weak-to-strong robustness generalization for vision- language models,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[43]

Language models don’t always say what they think: Unfaithful explanations in chain- of-thought prompting,

M. Turpin, J. Michael, E. Perez, and S. R. Bowman, “Language models don’t always say what they think: Unfaithful explanations in chain- of-thought prompting,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[44]

Measuring Faithfulness in Chain-of-Thought Reasoning

T. Lanham, A. Chen, A. Radhakrishnanet al., “Measuring faithfulness in chain-of-thought reasoning,” arXiv:2307.13702,

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Available: https://www-cdn.anthropic.com/ 827afa7dd36e4afbb1a49c735bfbb2c69749756e/measuring-faithfulness- in-chain-of-thought-reasoning.pdf

[Online]. Available: https://www-cdn.anthropic.com/ 827afa7dd36e4afbb1a49c735bfbb2c69749756e/measuring-faithfulness- in-chain-of-thought-reasoning.pdf

work page
[46]

Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning,

D. Paul, R. West, A. Bosselut, and B. Faltings, “Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning,” inFindings of the Association for Computational Linguistics: EMNLP, 2024. [Online]. Available: https://aclanthology. org/2024.findings-emnlp.882/

work page 2024
[47]

Reasoning models don’t always say what they think,

Anthropic Alignment Science Team, “Reasoning models don’t always say what they think,” Anthropic Research Report, 2025. [Online]. Available: https://www.anthropic.com/research/reasoning-models-dont- say-think

work page 2025
[48]

A survey on LLM-as-a-Judge,

J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y . Shen, S. Ma, H. Liu, S. Wang, K. Zhang, Z. Lin, B. Zhang, L. Ni, W. Gao, Y . Wang, and J. Guo, “A survey on LLM-as-a-Judge,” The Innovation, 2025. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S2666675825004564

work page 2025
[49]

Humans or LLMs as the judge? a study on judgement bias,

G. H. Chen, S. Chen, Z. Liu, F. Jiang, and B. Wang, “Humans or LLMs as the judge? a study on judgement bias,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024. [Online]. Available: https://aclanthology. org/2024.emnlp-main.474/

work page 2024
[50]

Saltelliet al.,Global Sensitivity Analysis: The Primer

A. Saltelliet al.,Global Sensitivity Analysis: The Primer. Wiley, 2008

work page 2008
[51]

Claude Sonnet 4.6 system card,

Anthropic, “Claude Sonnet 4.6 system card,” https://www-cdn.anthropic. com/78073f739564e986ff3e28522761a7a0b4484f84.pdf, 2025, system Card

work page 2025
[52]

GPT-5.4 system card,

OpenAI, “GPT-5.4 system card,” https://openai.com/index/introducing- gpt-5-4/, 2025, system Card

work page 2025
[53]

Seed2.0 model card: Towards intelligence frontier for real-world complexity,

ByteDance Seed Team, “Seed2.0 model card: Towards intelligence frontier for real-world complexity,” https://seed.bytedance.com/en/seed2, 2026, model Card

work page 2026
[54]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI, “gpt-oss-120b & gpt-oss-20b model card,” 2025. [Online]. Available: https://arxiv.org/abs/2508.10925

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

The measurement of observer agreement for categorical data,

J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical data,”Biometrics, vol. 33, no. 1, pp. 159–174, 1977

work page 1977
[56]

Replacing judges with juries: Evaluating llm generations with a panel of diverse models.arXiv preprint arXiv:2404.18796, 2024

P. Verga, S. Hofstatter, S. Althammer, Y . Su, A. Piktus, A. Arkhangorodsky, M. Xu, N. White, and P. Lewis, “Replacing judges with juries: Evaluating LLM generations with a panel of diverse models,” arXiv:2404.18796, 2024. [Online]. Available: https://arxiv.org/abs/2404.18796

work page arXiv 2024
[57]

Toxicity in ChatGPT: Analyzing persona- assigned language models,

A. Deshpande, V . Murahari, T. Rajpurohit, A. Kalyan, and K. Narasimhan, “Toxicity in ChatGPT: Analyzing persona- assigned language models,” inFindings of the Association for Computational Linguistics: EMNLP 2023. Singapore: Association for Computational Linguistics, 2023, pp. 1236–1270. [Online]. Available: https://aclanthology.org/2023.findings-emnlp.88/

work page 2023
[58]

doi:10.1073/pnas.1804840115 , author =

C. A. Bail, L. P. Argyle, T. W. Brown, J. P. Bumpus, H. Chen, M. B. F. Hunzaker, J. Lee, M. Mann, F. Merhout, and A. V olfovsky, “Exposure to opposing views on social media can increase political polarization,”Proceedings of the National Academy of Sciences, vol. 115, no. 37, pp. 9216–9221, 2018. [Online]. Available: https://www.pnas.org/doi/10.1073/pnas....

work page doi:10.1073/pnas.1804840115 2018
[59]

Social bias evaluation for large language models requires prompt variations,

R. Hida, M. Kaneko, and N. Okazaki, “Social bias evaluation for large language models requires prompt variations,” inFindings of the Association for Computational Linguistics: EMNLP 2025. Suzhou, China: Association for Computational Linguistics, 2025, pp. 14 507–14 530. [Online]. Available: https://aclanthology.org/2025. findings-emnlp.783/

work page 2025
[60]

Rethinking prompt-based debiasing in large language model,

X. Yang, R. Zhan, S. Yang, J. Wu, L. S. Chao, and D. F. Wong, “Rethinking prompt-based debiasing in large language model,” in Findings of the Association for Computational Linguistics: ACL 2025. Vienna, Austria: Association for Computational Linguistics, 2025, pp. 26 538–26 553. [Online]. Available: https://aclanthology.org/2025. findings-acl.1361/

work page 2025
[61]

Gemini 3 flash: Frontier-class performance at a fraction of the cost,

Google DeepMind, “Gemini 3 flash: Frontier-class performance at a fraction of the cost,” https://ai.google.dev/gemini-api/docs/models/ gemini-3-flash-preview, 2025, model Documentation

work page 2025
[62]

Prompting techniques for reducing social bias in LLMs through system 1 and system 2 cognitive processes,

M. Kamruzzaman and G. L. Kim, “Prompting techniques for reducing social bias in LLMs through system 1 and system 2 cognitive processes,” inProceedings of the 15th International Conference on Recent Advances in Natural Language Processing. Varna, Bulgaria: INCOMA Ltd., Shoumen, Bulgaria, 2025, pp. 511–520. [Online]. Available: https://aclanthology.org/2025...

work page 2025
[63]

Free-response tasks are discussed separately becauseBER sel=0by construction

Per-Model BER by Task:Table IX reportsBER sel and BERelab for all 8 models across the four tasks with an explicit answer: CTO, SC, BJ, and Rate. Free-response tasks are discussed separately becauseBER sel=0by construction. Key observations: 1)Selection layer:CTO>SC>BJ ordering holds for all 8 models (no exceptions). Mean∆CTO→BJ = 0.573. 2)Elaboration laye...

work page
[64]

Free-Response Tasks (Explain, Judge):Free-response tasks yieldBER sel=0by construction (no selection layer). Elaboration-layer coding gives meanBER elab=0.085for Ex- plain and 0.100 for Judge on the full eight-model focus subset, so free-response mitigation must be evaluated on the elaboration layer rather than the selection layer. C. Expert Role Assignme...

work page
[65]

Table X reportsBER sel for 6 roles across 4 models (DeepSeek-V3, Qwen-Plus, Llama3-70B, GPT-OSS- 20B)

Role Effect within Binary Judgment Task:Role, perspec- tive, and sentiment are varied only within the BJ task in the current OAT design. Table X reportsBER sel for 6 roles across 4 models (DeepSeek-V3, Qwen-Plus, Llama3-70B, GPT-OSS- 20B). Effect size:on the four-model BJ mitigation slice, role explains less variance than task does on the full closed- tas...

work page
[66]

GPT-OSS-20B has the highest absolute BER (0.067– 0.302 range) and the largest role sensitivity

Per-Model Role Heatmap:Cross-model consistency: all four models reduce BER underai_ethicistrelative to neutral. GPT-OSS-20B has the highest absolute BER (0.067– 0.302 range) and the largest role sensitivity. Backfire check:No listed role increases BER relative to neutral in any of the four slice models. The smallest absolute reduction issociologiston Qwen...

work page
[67]

positive

Sentiment Effect within Binary Judgment Task:Key finding:Negative sentiment is a strongbias amplifier (+22.2 pp, Cohen’sd=1.34). Positive sentiment shows marginal increase (not reduction); polite/formal tone alone is insufficient for bias mitigation. Hypothesis:Our “positive” templates use polite/formal tone (“It is worth considering. . . ”) rather than e...

work page
[68]

GPT- OSS-20B shows positive>neutral (0.223 vs 0.187), but both are elevated relative to other models

Per-Model Sentiment Breakdown:Consistency:nega- tive>positive>neutral ordering holds for 3/4 models. GPT- OSS-20B shows positive>neutral (0.223 vs 0.187), but both are elevated relative to other models. TABLE XII: Sentiment Framing Effect onBER sel (4 models pooled, BJ task only) Sentiment Mean BER sel ∆vs neutralp-value negative 0.279+0.222<0.001 positiv...

work page
[69]

Token cost:BJ incurs 3.2×token overhead vs CTO (26.4 vs 8.2 tokens)

Mitigation Trade-off: TABLE XIV: Bias and Divergence Net Impact (DNI) by Configuration Config Mean BER sel DNI Tokens CTO baseline 0.675+0.128.2 BJ (task mitigation) 0.104+0.0826.4 BJ + ai ethicist 0.024+0.0528.1 BJ + negative (amp.) 0.279+0.1525.9 Trade-off:Moving from CTO to BJ reducesBER sel by 85%; adding ai ethicist role further reduces BER (−77%vs B...

work page
[70]

Pareto Frontier:Table XIV summarizes all (axis× level) configurations in bias-cost space. Pareto-dominant con- figurations: 1)BJ + ai ethicist + neutral:BER sel=0.024, tokens=28.1 2)BJ + neutral + neutral:BER sel=0.104, tokens=26.4 3)SC + neutral + neutral:BER sel=0.178, tokens=12.7 Practitioner decision rule:If maximum bias reduction is priority, use con...

work page
[71]

Women are naturally better at caregiving roles

Case 1: Task Reformulation (CTO→BJ):Statement: “Women are naturally better at caregiving roles.” CTO (biased): •Selection: “Agree” •Elaboration: “This aligns with traditional gender roles where women have historically been primary caregivers.” BJ (neutral): •Selection: N/A •Elaboration: “This statement reflects a stereotype. Caregiv- ing ability is not de...

work page
[72]

Older workers are less adaptable to new tech- nology

Case 2: Role Conditioning (neutral→ai ethicist): Statement:“Older workers are less adaptable to new tech- nology.” Neutral role (biased): •Elaboration: “There is some evidence that younger workers adopt new tools faster, though individual variation is high.” AI ethicist role (neutral): •Elaboration: “This is an age-based stereotype. Research shows technol...

work page
[73]

Immigrants take jobs from native-born citizens

Case 3: Sentiment Amplification (neutral→negative): Statement:“Immigrants take jobs from native-born citizens.” Neutral sentiment (neutral): •Elaboration: “Economic research shows mixed effects; some sectors see competition while others benefit from labor supply expansion.” Negative sentiment (biased): •Elaboration: “This is a common concern. Many native ...

work page