Recognition: no theorem link
SCOPE: A Dataset of Stereotyped Prompts for Counterfactual Fairness Assessment of LLMs
Pith reviewed 2026-05-10 19:46 UTC · model grok-4.3
The pith
SCOPE supplies 241,280 stereotype-conditioned prompts in 120,640 counterfactual pairs to test whether LLMs give different answers to the same request when different demographic groups are named.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SCOPE is a dataset of 241,280 prompts organized into 120,640 counterfactual pairs, each pair grounded in one of 1,438 topics and spanning nine bias dimensions and 1,536 demographic groups, with all prompts generated under four distinct communicative intents: Question, Recommendation, Direction, and Clarification.
What carries the argument
The SCOPE collection of stereotype-conditioned counterfactual prompt pairs, which vary only the referenced demographic group while preserving semantic content and one of four communicative intents.
If this is right
- Evaluators can measure how often an LLM produces different outputs for semantically identical requests that differ only in the demographic group named.
- The four intent categories allow separate analysis of whether bias appears more in questions, recommendations, directions, or clarifications.
- Coverage of 1,438 topics and nine bias dimensions supports tests that go beyond narrow domains used in earlier small-scale benchmarks.
- The scale of 120,640 pairs makes statistical comparisons of group-sensitive behavior feasible across many models and settings.
Where Pith is reading between the lines
- Teams building fairness toolkits could add SCOPE pairs to existing test suites to check consistency across more realistic phrasing styles.
- Model developers might run the pairs during training or post-training audits to flag and reduce output differences tied to specific demographic references.
- Future work could test whether models that pass SCOPE checks still show bias when the same groups appear in longer, multi-turn conversations.
Load-bearing premise
The generated prompts maintain semantic equivalence and intent across each counterfactual pair while accurately representing stereotypes and real communicative styles.
What would settle it
An audit that shows a substantial fraction of SCOPE pairs lose or alter their intended meaning when the demographic group is swapped, or that the stereotypes in the prompts do not match documented patterns of real-world group references.
read the original abstract
Large Language Models (LLMs) now serve as the foundation for a wide range of applications, from conversational assistants to decision support tools, making the issue of fairness in their results increasingly important. Previous studies have shown that LLM outputs can shift when prompts reference different demographic groups, even when intent and semantic content remain constant. However, existing resources for probing such disparities rely primarily on small, template-based counterfactual examples or fixed sentence pairs. These benchmarks offer limited linguistic diversity, narrow topical coverage, and little support for analyzing how communicative intent affects model behavior. To address these limitations, we introduce SCOPE (Stereotype-COnditioned Prompts for Evaluation), a large-scale dataset of counterfactual prompt pairs designed to enable systematic investigation of group-sensitive behavior in LLMs. SCOPE contains 241,280 prompts organized into 120,640 counterfactual pairs, each grounded in one of 1,438 topics and spanning nine bias dimensions and 1,536 demographic groups. All prompts are generated under four distinct communicative intents: Question, Recommendation, Direction, and Clarification, ensuring broad coverage of common interaction styles. This resource provides a controlled, semantically aligned, and intent-aware basis for evaluating fairness, robustness, and counterfactual consistency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce SCOPE, a large-scale dataset of 241,280 prompts forming 120,640 counterfactual pairs for evaluating counterfactual fairness in LLMs. The pairs are grounded in 1,438 topics, span nine bias dimensions and 1,536 demographic groups, and are generated under four communicative intents: Question, Recommendation, Direction, and Clarification. It addresses limitations in existing benchmarks by providing greater linguistic diversity, topical coverage, and intent awareness while maintaining semantic alignment across demographic variants.
Significance. If the counterfactual pairs preserve semantic equivalence and intent as claimed, SCOPE would be a significant resource for the LLM fairness community. Its scale and structured coverage across topics, groups, and intents could support systematic studies of group-sensitive model behavior that smaller template-based datasets cannot. The contribution is primarily the dataset release rather than novel algorithms or empirical findings.
major comments (1)
- [Abstract] Abstract: The abstract states that the prompts are generated under four intents and grounded in topics but supplies no details on the generation procedure, validation of semantic equivalence across pairs, quality controls, or inter-annotator agreement. This is a load-bearing issue for the central claim, as the dataset's value for counterfactual fairness assessment depends on evidence that pairs maintain identical meaning and intent while differing only in demographic references.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The central contribution of SCOPE is the dataset itself, and we agree that its utility hinges on demonstrating semantic equivalence and intent preservation. Below we address the major comment directly, clarifying where these details appear in the manuscript and indicating the revisions we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract states that the prompts are generated under four intents and grounded in topics but supplies no details on the generation procedure, validation of semantic equivalence across pairs, quality controls, or inter-annotator agreement. This is a load-bearing issue for the central claim, as the dataset's value for counterfactual fairness assessment depends on evidence that pairs maintain identical meaning and intent while differing only in demographic references.
Authors: We agree that the abstract, as currently written, is too terse on methodology and does not surface the evidence for semantic equivalence. The full manuscript addresses these points in detail: Section 3 describes the generation pipeline (topic sampling, bias-dimension mapping, intent conditioning, and demographic substitution rules); Section 4 reports the validation protocol, including both automated metrics (BERTScore, entailment checks) and human evaluation of 2,000 pairs for meaning preservation and intent fidelity; Section 5 presents quality controls and inter-annotator agreement (Cohen’s κ = 0.82 on a 500-pair subset). To make this evidence visible at the abstract level, we will revise the abstract to include a single sentence summarizing the generation and validation procedures. We will also add a short “Dataset Validation” paragraph to the abstract if space permits. These changes will be reflected in the revised manuscript. revision: yes
Circularity Check
No circularity: dataset introduction paper is self-contained
full rationale
The paper introduces the SCOPE dataset of counterfactual prompt pairs without any mathematical derivations, equations, fitted parameters, predictions, or load-bearing self-citations. The central contribution is the resource itself (size, structure, intents, topics), and claims about semantic alignment are descriptive rather than derived by construction from prior results. No patterns of self-definition, fitted-input-as-prediction, or ansatz smuggling apply. The work is a standard dataset paper whose value rests on external validation of prompt quality, not internal reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
The social impact of generative ai: An analysis on chatgpt,
M. T. Baldassarre, D. Caivano, B. Fernandez Nieto, D. Gigante, and A. Ragone, “The social impact of generative ai: An analysis on chatgpt,” inProceedings of the 2023 ACM Conference on Information Technology for Social Good, ser. GoodIT ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 363–373. [Online]. Available: https://doi.org/10.1...
-
[2]
Chatgpt for good? on opportunities and challenges of large language models for education,
E. Kasneci, K. Seßler, S. K ¨uchemann, M. Bannert, D. Dementieva, F. Fischer, U. Gasser, G. Groh, S. G ¨unnemann, E. H ¨ullermeieret al., “Chatgpt for good? on opportunities and challenges of large language models for education,”Learning and Individual Differences, vol. 103, p. 102274, 2023
2023
-
[3]
Students’ perception of chatgpt in software engineering: Lessons learned from five courses,
L. Baresi, A. De Lucia, A. Di Marco, M. Di Penta, D. Di Ruscio, L. Mar- iani, D. Micucci, F. Palomba, M. T. Rossi, and F. Zampetti, “Students’ perception of chatgpt in software engineering: Lessons learned from five courses,” in2025 IEEE/ACM 37th International Conference on Software Engineering Education and Training (CSEE&T). IEEE, 2025, pp. 158– 169
2025
-
[4]
A review on fairness in machine learning,
D. Pessach and E. Shmueli, “A review on fairness in machine learning,” ACM Computing Surveys (CSUR), vol. 55, no. 3, pp. 1–44, 2022
2022
-
[5]
Bias and unfairness in information retrieval systems: New challenges in the llm era,
S. Dai, C. Xu, S. Xu, L. Pang, Z. Dong, and J. Xu, “Bias and unfairness in information retrieval systems: New challenges in the llm era,” in Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024, pp. 6437–6447
2024
-
[6]
Nigerian software engineer or american data scientist? github profile recruitment bias in large language models,
T. Nakano, K. Shimari, R. G. Kula, C. Treude, M. Cheong, and K. Matsumoto, “Nigerian software engineer or american data scientist? github profile recruitment bias in large language models,” in2024 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 2024, pp. 624–629
2024
-
[7]
She elicits requirements and he tests: Software engineering gender bias in large language models,
C. Treude and H. Hata, “She elicits requirements and he tests: Software engineering gender bias in large language models,” in2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR). IEEE, 2023, pp. 624–629
2023
-
[8]
Counterfactual fairness,
M. Kusner, J. Loftus, C. Russell, and R. Silva, “Counterfactual fairness,” inProceedings of the 31st International Conference on Neural Informa- tion Processing Systems, ser. NIPS’17. Red Hook, NY , USA: Curran Associates Inc., 2017, p. 4069–4079
2017
-
[9]
Zhao Liu, Tian Xie, and Xueru Zhang
Y . Li, M. Du, R. Song, X. Wang, and Y . Wang, “A survey on fairness in large language models,”arXiv preprint arXiv:2308.10149, 2023
-
[10]
Datasets for fairness in language models: An in-depth survey,
J. Zhang, Z. Wang, A. Palikhe, Z. Yin, and W. Zhang, “Datasets for fairness in language models: An in-depth survey,”arXiv preprint arXiv:2506.23411, 2025
-
[11]
CrowS- Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models,
N. Nangia, C. Vania, R. Bhalerao, and S. R. Bowman, “CrowS- Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Online: Association for Computational Linguistics, Nov. 2020
2020
-
[12]
Stereoset: Measuring stereotyp- ical bias in pretrained language models,
M. Nadeem, A. Bethke, and S. Reddy, “Stereoset: Measuring stereotyp- ical bias in pretrained language models,” inProceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), 2021, pp. 5356–5371
2021
-
[13]
arXiv preprint arXiv:1804.06876 , year=
J. Zhao, T. Wang, M. Yatskar, V . Ordonez, and K.-W. Chang, “Gender bias in coreference resolution: Evaluation and debiasing methods,”arXiv preprint arXiv:1804.06876, 2018
-
[14]
Bbq: A hand-built bias benchmark for question answering,
A. Parrish, A. Chen, N. Nangia, V . Padmakumar, J. Phang, J. Thompson, P. M. Htut, and S. Bowman, “Bbq: A hand-built bias benchmark for question answering,” inFindings of the Association for Computational Linguistics: ACL 2022, 2022, pp. 2086–2105
2022
-
[15]
OpenAI, “Gpt-4o system card,” 2024. [Online]. Available: https: //arxiv.org/abs/2410.21276
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
P. Robe, S. K. Kuttal, J. AuBuchon, and J. Hart, “Pair programming conversations with agents vs. developers: Challenges and opportunities for se community,” inProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2022. New York, NY , USA: Association for Computing...
-
[17]
Online appendix
“Online appendix.” [Online]. Available: https://github.com/gianwario/ Counterfactual-Prompts-Dataset
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.