arxiv: 2604.05555 · v1 · submitted 2026-04-07 · 💻 cs.SE

Recognition: no theorem link

SCOPE: A Dataset of Stereotyped Prompts for Counterfactual Fairness Assessment of LLMs

Alessandra Parziale, Andrea De Lucia, Fabio Palomba, Gemma Catolino, Gianmario Voria, Valeria Pontillo

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:46 UTC · model grok-4.3

classification 💻 cs.SE

keywords LLM fairnesscounterfactual promptsstereotype evaluationbias assessmentdemographic groupscommunicative intentdataset for AI evaluationfairness benchmarking

0 comments

The pith

SCOPE supplies 241,280 stereotype-conditioned prompts in 120,640 counterfactual pairs to test whether LLMs give different answers to the same request when different demographic groups are named.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a large new resource called SCOPE to let researchers check how large language models behave when the same question or request is phrased with references to different groups of people. Earlier collections used only small numbers of fixed template sentences that lacked variety in wording, subjects, and ways of speaking. SCOPE instead builds prompt pairs that differ only in the demographic group while keeping the underlying request identical, and it does this for four common types of user intent across more than a thousand topics and nine kinds of bias. If the pairs truly hold meaning and intent constant, the dataset would let people measure shifts in model output that come from group references alone. This kind of controlled comparison matters because models now answer questions and make suggestions that affect real decisions about individuals.

Core claim

SCOPE is a dataset of 241,280 prompts organized into 120,640 counterfactual pairs, each pair grounded in one of 1,438 topics and spanning nine bias dimensions and 1,536 demographic groups, with all prompts generated under four distinct communicative intents: Question, Recommendation, Direction, and Clarification.

What carries the argument

The SCOPE collection of stereotype-conditioned counterfactual prompt pairs, which vary only the referenced demographic group while preserving semantic content and one of four communicative intents.

If this is right

Evaluators can measure how often an LLM produces different outputs for semantically identical requests that differ only in the demographic group named.
The four intent categories allow separate analysis of whether bias appears more in questions, recommendations, directions, or clarifications.
Coverage of 1,438 topics and nine bias dimensions supports tests that go beyond narrow domains used in earlier small-scale benchmarks.
The scale of 120,640 pairs makes statistical comparisons of group-sensitive behavior feasible across many models and settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams building fairness toolkits could add SCOPE pairs to existing test suites to check consistency across more realistic phrasing styles.
Model developers might run the pairs during training or post-training audits to flag and reduce output differences tied to specific demographic references.
Future work could test whether models that pass SCOPE checks still show bias when the same groups appear in longer, multi-turn conversations.

Load-bearing premise

The generated prompts maintain semantic equivalence and intent across each counterfactual pair while accurately representing stereotypes and real communicative styles.

What would settle it

An audit that shows a substantial fraction of SCOPE pairs lose or alter their intended meaning when the demographic group is swapped, or that the stereotypes in the prompts do not match documented patterns of real-world group references.

read the original abstract

Large Language Models (LLMs) now serve as the foundation for a wide range of applications, from conversational assistants to decision support tools, making the issue of fairness in their results increasingly important. Previous studies have shown that LLM outputs can shift when prompts reference different demographic groups, even when intent and semantic content remain constant. However, existing resources for probing such disparities rely primarily on small, template-based counterfactual examples or fixed sentence pairs. These benchmarks offer limited linguistic diversity, narrow topical coverage, and little support for analyzing how communicative intent affects model behavior. To address these limitations, we introduce SCOPE (Stereotype-COnditioned Prompts for Evaluation), a large-scale dataset of counterfactual prompt pairs designed to enable systematic investigation of group-sensitive behavior in LLMs. SCOPE contains 241,280 prompts organized into 120,640 counterfactual pairs, each grounded in one of 1,438 topics and spanning nine bias dimensions and 1,536 demographic groups. All prompts are generated under four distinct communicative intents: Question, Recommendation, Direction, and Clarification, ensuring broad coverage of common interaction styles. This resource provides a controlled, semantically aligned, and intent-aware basis for evaluating fairness, robustness, and counterfactual consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SCOPE is a bigger dataset of counterfactual LLM prompts than prior fairness benchmarks, but the paper gives almost no evidence that the pairs actually preserve meaning across demographic swaps.

read the letter

SCOPE puts out 120,640 prompt pairs covering 1,438 topics, nine bias dimensions, 1,536 groups, and four intents. That scale and the addition of communicative intent variation are the concrete advances over the small template sets that have been the norm so far. The paper correctly notes that existing resources lack topical breadth and do not let people test how intent changes model behavior on the same underlying content. Those points land cleanly and explain why a larger resource could be useful for systematic fairness checks. The construction itself is described at a high level only. The abstract and main text state that the pairs are generated and semantically aligned, yet supply no account of the generation process, no embedding or entailment checks for equivalence, and no human validation numbers. Without those steps, it is impossible to rule out that lexical or structural differences introduced during creation are driving any observed disparities rather than the demographic references alone. This is the load-bearing assumption for the whole resource, and it is left unsupported. The paper is aimed at people who run or design LLM fairness evaluations. A researcher who needs a ready-made test bed with intent coverage would get immediate practical value from the size and organization, even if they end up adding their own quality filters. It deserves a serious referee. The gap it targets is real and the numbers are large enough to matter if the pairs hold up. Referees can reasonably ask for the missing generation and validation details in revision; once those are in place the dataset could become a standard reference. I would send it out rather than desk reject.

Referee Report

1 major / 0 minor

Summary. The paper claims to introduce SCOPE, a large-scale dataset of 241,280 prompts forming 120,640 counterfactual pairs for evaluating counterfactual fairness in LLMs. The pairs are grounded in 1,438 topics, span nine bias dimensions and 1,536 demographic groups, and are generated under four communicative intents: Question, Recommendation, Direction, and Clarification. It addresses limitations in existing benchmarks by providing greater linguistic diversity, topical coverage, and intent awareness while maintaining semantic alignment across demographic variants.

Significance. If the counterfactual pairs preserve semantic equivalence and intent as claimed, SCOPE would be a significant resource for the LLM fairness community. Its scale and structured coverage across topics, groups, and intents could support systematic studies of group-sensitive model behavior that smaller template-based datasets cannot. The contribution is primarily the dataset release rather than novel algorithms or empirical findings.

major comments (1)

[Abstract] Abstract: The abstract states that the prompts are generated under four intents and grounded in topics but supplies no details on the generation procedure, validation of semantic equivalence across pairs, quality controls, or inter-annotator agreement. This is a load-bearing issue for the central claim, as the dataset's value for counterfactual fairness assessment depends on evidence that pairs maintain identical meaning and intent while differing only in demographic references.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The central contribution of SCOPE is the dataset itself, and we agree that its utility hinges on demonstrating semantic equivalence and intent preservation. Below we address the major comment directly, clarifying where these details appear in the manuscript and indicating the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states that the prompts are generated under four intents and grounded in topics but supplies no details on the generation procedure, validation of semantic equivalence across pairs, quality controls, or inter-annotator agreement. This is a load-bearing issue for the central claim, as the dataset's value for counterfactual fairness assessment depends on evidence that pairs maintain identical meaning and intent while differing only in demographic references.

Authors: We agree that the abstract, as currently written, is too terse on methodology and does not surface the evidence for semantic equivalence. The full manuscript addresses these points in detail: Section 3 describes the generation pipeline (topic sampling, bias-dimension mapping, intent conditioning, and demographic substitution rules); Section 4 reports the validation protocol, including both automated metrics (BERTScore, entailment checks) and human evaluation of 2,000 pairs for meaning preservation and intent fidelity; Section 5 presents quality controls and inter-annotator agreement (Cohen’s κ = 0.82 on a 500-pair subset). To make this evidence visible at the abstract level, we will revise the abstract to include a single sentence summarizing the generation and validation procedures. We will also add a short “Dataset Validation” paragraph to the abstract if space permits. These changes will be reflected in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset introduction paper is self-contained

full rationale

The paper introduces the SCOPE dataset of counterfactual prompt pairs without any mathematical derivations, equations, fitted parameters, predictions, or load-bearing self-citations. The central contribution is the resource itself (size, structure, intents, topics), and claims about semantic alignment are descriptive rather than derived by construction from prior results. No patterns of self-definition, fitted-input-as-prediction, or ansatz smuggling apply. The work is a standard dataset paper whose value rests on external validation of prompt quality, not internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Dataset introduction paper with no mathematical derivation, free parameters, axioms, or invented entities required for the central claim.

pith-pipeline@v0.9.0 · 5530 in / 1015 out tokens · 79342 ms · 2026-05-10T19:46:01.544378+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 6 canonical work pages · 1 internal anchor

[1]

The social impact of generative ai: An analysis on chatgpt,

M. T. Baldassarre, D. Caivano, B. Fernandez Nieto, D. Gigante, and A. Ragone, “The social impact of generative ai: An analysis on chatgpt,” inProceedings of the 2023 ACM Conference on Information Technology for Social Good, ser. GoodIT ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 363–373. [Online]. Available: https://doi.org/10.1...

work page doi:10.1145/3582515.3609555 2023
[2]

Chatgpt for good? on opportunities and challenges of large language models for education,

E. Kasneci, K. Seßler, S. K ¨uchemann, M. Bannert, D. Dementieva, F. Fischer, U. Gasser, G. Groh, S. G ¨unnemann, E. H ¨ullermeieret al., “Chatgpt for good? on opportunities and challenges of large language models for education,”Learning and Individual Differences, vol. 103, p. 102274, 2023

2023
[3]

Students’ perception of chatgpt in software engineering: Lessons learned from five courses,

L. Baresi, A. De Lucia, A. Di Marco, M. Di Penta, D. Di Ruscio, L. Mar- iani, D. Micucci, F. Palomba, M. T. Rossi, and F. Zampetti, “Students’ perception of chatgpt in software engineering: Lessons learned from five courses,” in2025 IEEE/ACM 37th International Conference on Software Engineering Education and Training (CSEE&T). IEEE, 2025, pp. 158– 169

2025
[4]

A review on fairness in machine learning,

D. Pessach and E. Shmueli, “A review on fairness in machine learning,” ACM Computing Surveys (CSUR), vol. 55, no. 3, pp. 1–44, 2022

2022
[5]

Bias and unfairness in information retrieval systems: New challenges in the llm era,

S. Dai, C. Xu, S. Xu, L. Pang, Z. Dong, and J. Xu, “Bias and unfairness in information retrieval systems: New challenges in the llm era,” in Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024, pp. 6437–6447

2024
[6]

Nigerian software engineer or american data scientist? github profile recruitment bias in large language models,

T. Nakano, K. Shimari, R. G. Kula, C. Treude, M. Cheong, and K. Matsumoto, “Nigerian software engineer or american data scientist? github profile recruitment bias in large language models,” in2024 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 2024, pp. 624–629

2024
[7]

She elicits requirements and he tests: Software engineering gender bias in large language models,

C. Treude and H. Hata, “She elicits requirements and he tests: Software engineering gender bias in large language models,” in2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR). IEEE, 2023, pp. 624–629

2023
[8]

Counterfactual fairness,

M. Kusner, J. Loftus, C. Russell, and R. Silva, “Counterfactual fairness,” inProceedings of the 31st International Conference on Neural Informa- tion Processing Systems, ser. NIPS’17. Red Hook, NY , USA: Curran Associates Inc., 2017, p. 4069–4079

2017
[9]

Zhao Liu, Tian Xie, and Xueru Zhang

Y . Li, M. Du, R. Song, X. Wang, and Y . Wang, “A survey on fairness in large language models,”arXiv preprint arXiv:2308.10149, 2023

work page arXiv 2023
[10]

Datasets for fairness in language models: An in-depth survey,

J. Zhang, Z. Wang, A. Palikhe, Z. Yin, and W. Zhang, “Datasets for fairness in language models: An in-depth survey,”arXiv preprint arXiv:2506.23411, 2025

work page arXiv 2025
[11]

CrowS- Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models,

N. Nangia, C. Vania, R. Bhalerao, and S. R. Bowman, “CrowS- Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Online: Association for Computational Linguistics, Nov. 2020

2020
[12]

Stereoset: Measuring stereotyp- ical bias in pretrained language models,

M. Nadeem, A. Bethke, and S. Reddy, “Stereoset: Measuring stereotyp- ical bias in pretrained language models,” inProceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), 2021, pp. 5356–5371

2021
[13]

arXiv preprint arXiv:1804.06876 , year=

J. Zhao, T. Wang, M. Yatskar, V . Ordonez, and K.-W. Chang, “Gender bias in coreference resolution: Evaluation and debiasing methods,”arXiv preprint arXiv:1804.06876, 2018

work page arXiv 2018
[14]

Bbq: A hand-built bias benchmark for question answering,

A. Parrish, A. Chen, N. Nangia, V . Padmakumar, J. Phang, J. Thompson, P. M. Htut, and S. Bowman, “Bbq: A hand-built bias benchmark for question answering,” inFindings of the Association for Computational Linguistics: ACL 2022, 2022, pp. 2086–2105

2022
[15]

GPT-4o System Card

OpenAI, “Gpt-4o system card,” 2024. [Online]. Available: https: //arxiv.org/abs/2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

ACM, 860–871

P. Robe, S. K. Kuttal, J. AuBuchon, and J. Hart, “Pair programming conversations with agents vs. developers: Challenges and opportunities for se community,” inProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2022. New York, NY , USA: Association for Computing...

work page doi:10.1145/3540250.3549127 2022
[17]

Online appendix

“Online appendix.” [Online]. Available: https://github.com/gianwario/ Counterfactual-Prompts-Dataset