Making AI Evaluation Deployment Relevant Through Context Specification

Matthew Holmes; Reva Schwartz; Thiago Lacerda

arxiv: 2603.06811 · v3 · submitted 2026-03-06 · 💻 cs.AI

Making AI Evaluation Deployment Relevant Through Context Specification

Matthew Holmes , Thiago Lacerda , Reva Schwartz This is my paper

Pith reviewed 2026-05-15 14:43 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI evaluationcontext specificationdeployment successstakeholder perspectivesoperational realitiesAI deploymentevaluation processdecision making

0 comments

The pith

Context specification converts vague stakeholder views into explicit, measurable constructs that guide AI evaluations to match real deployment conditions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current AI evaluations often overlook the specific operational realities that decide whether deployments succeed, leaving decision makers without clear guidance on value. The paper introduces context specification as the remedy: a process that takes diffuse stakeholder perspectives and turns them into named, observable definitions of properties, behaviors, and outcomes. These definitions become the targets that evaluations can actually measure within the contexts organizations manage. If the approach works, evaluations stop masking deployment risks and instead provide a direct roadmap for what AI systems will do once live.

Core claim

Context specification turns diffuse stakeholder perspectives about what matters in a given setting into clear, named constructs: explicit definitions of the properties, behaviors, and outcomes that evaluations aim to capture, so they can be observed and measured in context. The process serves as a foundational roadmap for evaluating what AI systems are likely to do in the deployment contexts that organizations actually manage.

What carries the argument

Context specification, the process of translating stakeholder perspectives about a deployment setting into explicit, named definitions of measurable properties, behaviors, and outcomes.

If this is right

Evaluations become tied to the specific contexts organizations manage rather than generic benchmarks.
Decision makers can directly assess whether an AI tool will deliver durable value before full deployment.
Operational realities that currently remain hidden gain explicit definitions that evaluations must address.
The same stakeholder inputs produce consistent, observable targets across repeated evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same translation step could be applied to non-AI systems to surface hidden deployment risks.
Documented context constructs might serve as reusable templates when similar AI tools move to new organizations.
Long-term tracking of whether specified outcomes actually occur could reveal which constructs are most predictive.

Load-bearing premise

Stakeholder perspectives can be reliably translated into explicit, measurable constructs that accurately reflect the operational realities determining deployment success.

What would settle it

A side-by-side trial in which one group uses context specification to define evaluation targets and another uses standard methods; the group with specified constructs shows measurably higher accuracy when its predictions of live deployment outcomes are compared against actual results.

Figures

Figures reproduced from arXiv: 2603.06811 by Matthew Holmes, Reva Schwartz, Thiago Lacerda.

**Figure 1.** Figure 1: Unlike participatory methods that focus on what users want a technology to do or how it should be designed, systematic context specification uses stakeholder input to define and structure the deployment-relevant concepts that evaluations must measure. It clarifies what matters for an AI deployment in a particular setting so that assessments of utility, risk, and safety are tied to realistic operational mat… view at source ↗

**Figure 1.** Figure 1: Context specification serves as the ”Contextualize” step in the CIRCLE [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Context specification as the deployment-to-evaluation translation [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

With many organizations struggling to gain value from AI deployments, pressure to evaluate AI in an informed manner has intensified. Status quo AI evaluation approaches often mask the operational realities that ultimately determine deployment success, making it difficult for organizational decision makers to know whether and how AI tools will deliver durable value. We introduce and describe context specification as a process to support and inform this decision making process. Context specification turns diffuse stakeholder perspectives about what matters in a given setting into clear, named constructs: explicit definitions of the properties, behaviors, and outcomes that evaluations aim to capture, so they can be observed and measured in context. The process serves as a foundational roadmap for evaluating what AI systems are likely to do in the deployment contexts that organizations actually manage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper names context specification as a process to turn stakeholder views into measurable evaluation constructs for AI deployments, but stays at a high-level description without steps or evidence.

read the letter

The main takeaway is that this paper introduces context specification as a foundational step for making AI evaluations match real deployment conditions. It argues that turning diffuse stakeholder perspectives into explicit, named properties and outcomes will help organizations avoid mismatched evaluations and make better decisions about AI tools. The framing is direct and points to a genuine practical problem with generic status-quo evaluations. That part is useful as a reminder of why context matters. The paper does not deliver much beyond the naming and high-level description. It offers no worked example, no sequence of steps for eliciting and reconciling stakeholder input, and no way to check whether the resulting constructs actually predict deployment success. The central assumption that perspectives can be reliably translated into observables is stated but left unaddressed. There is also no discussion of how this process would integrate with existing evaluation methods or requirements engineering work. This is aimed at practitioners and researchers who build or oversee organizational AI evaluation programs. A reader already thinking about context-aware assessment might pick up the terminology and use it in conversation, but the lack of operational detail limits its value for anyone needing to apply or test the idea. I would not send it for peer review in this form; it reads more like a short position piece that would benefit from a concrete methodology section and at least one illustrative case before going further.

Referee Report

2 major / 0 minor

Summary. The paper claims that status-quo AI evaluation methods mask the operational realities that determine deployment success, and introduces context specification as a process that converts diffuse stakeholder perspectives into explicit, named constructs (definitions of properties, behaviors, and outcomes) that can be observed and measured to inform organizational AI deployment decisions.

Significance. If the proposed process can be operationalized with repeatable steps and validation, it could meaningfully improve the alignment between AI evaluations and real-world deployment contexts, helping organizations make better-informed decisions about whether and how AI tools deliver durable value.

major comments (2)

[Abstract / Process Description] Abstract and process description: the central claim that context specification reliably turns stakeholder perspectives into observable, measurable constructs that predict deployment success rests on an unelaborated translation step; the manuscript supplies only a high-level description with no elicitation method, reconciliation criteria, formalization template, or worked example.
[Abstract] No empirical data, case study, or validation procedure is presented to demonstrate that the resulting constructs improve decision outcomes or evaluation relevance, leaving the practical utility of the proposal untested.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below, providing clarifications on the paper's conceptual scope and outlining revisions to strengthen the presentation of context specification.

read point-by-point responses

Referee: [Abstract / Process Description] Abstract and process description: the central claim that context specification reliably turns stakeholder perspectives into observable, measurable constructs that predict deployment success rests on an unelaborated translation step; the manuscript supplies only a high-level description with no elicitation method, reconciliation criteria, formalization template, or worked example.

Authors: The manuscript introduces context specification as a high-level conceptual process and foundational roadmap for aligning evaluations with deployment contexts, without claiming a fully operationalized methodology. We agree that the translation from stakeholder perspectives to named constructs would benefit from greater elaboration. In the revised manuscript, we will expand the process description section to include a detailed worked example, specific elicitation steps, reconciliation criteria for conflicting perspectives, and a formalization template. This addition will make the steps more concrete while maintaining the paper's focus on the overall approach rather than exhaustive implementation details. revision: yes
Referee: [Abstract] No empirical data, case study, or validation procedure is presented to demonstrate that the resulting constructs improve decision outcomes or evaluation relevance, leaving the practical utility of the proposal untested.

Authors: The paper is explicitly positioned as a conceptual contribution that defines context specification and motivates its role in improving AI evaluation relevance, without presenting or claiming empirical validation. We do not assert that the process has been tested for predictive power or decision improvement in this work. To address the concern, we will revise the abstract, introduction, and discussion sections to explicitly clarify the conceptual nature of the contribution and identify empirical validation and case studies as important directions for future research. This will prevent any overstatement of the current claims. revision: partial

Circularity Check

0 steps flagged

No significant circularity; conceptual process description without equations, fits, or self-referential derivations.

full rationale

The paper presents context specification as a high-level process for turning stakeholder perspectives into named constructs for AI evaluation. No mathematical derivations, equations, or parameter-fitting steps exist. The central claim is a definitional proposal rather than a result derived from prior inputs by construction. No self-citations are load-bearing for any uniqueness theorem or ansatz. The translation from perspectives to observables is stated as an assumption without reduction to fitted data or self-referential logic. This is a standard non-circular outcome for a descriptive framework paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that stakeholder perspectives can be translated into explicit measurable constructs that capture operational realities, with no free parameters or invented physical entities.

axioms (1)

domain assumption Stakeholder perspectives about what matters can be converted into explicit, named, and measurable constructs without significant loss of relevant information.
This assumption underpins the entire context specification process described in the abstract.

invented entities (1)

context specification process no independent evidence
purpose: To convert diffuse stakeholder perspectives into clear constructs for AI evaluation
Newly proposed process introduced to address limitations of status quo evaluations.

pith-pipeline@v0.9.0 · 5412 in / 1250 out tokens · 44111 ms · 2026-05-15T14:43:32.264539+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Context specification turns diffuse stakeholder perspectives about what matters in a given setting into clear, named constructs: explicit definitions of the properties, behaviors, and outcomes that evaluations aim to capture, so they can be observed and measured in context.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The process yields a structured set of outputs... Named stakeholder priorities... Evaluable constructs... Linking mechanisms... Candidate observables...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

[1]

State of ai in business 2025,

MLQ AI, “State of ai in business 2025,” 2025, version 0.1. [Online]. Available: https://mlq.ai/media/quarterly decks/v0.1 State of AI in Business 2025 Report.pdf

work page 2025
[2]

Sociotechnical transformation: A systematic review on the impact of artificial intelligence on society and organizations,

M. Selgas-Cors, “Sociotechnical transformation: A systematic review on the impact of artificial intelligence on society and organizations,” FinTech and Sustainable Innovation, vol. 00, no. 00, pp. 1–16, 2025. [Online]. Available: https://ojs.bonviewpress.com/index.php/FSI/article/ view/6076/1547

work page 2025
[3]

Measuring what Matters: Construct Validity in Large Language Model Benchmarks,

A. M. Bean, R. O. Kearns, A. Romanou, F. S. Hafner, H. Mayne, J. Batzner, N. Foroutan, C. Schmitz, K. Korgul, H. Batra, O. Deb, E. Beharry, C. Emde, T. Foster, A. Gausen, M. Grandury, S. Han, V . Hofmann, L. Ibrahim, H. Kim, H. R. Kirk, F. Lin, G. K.-M. Liu, L. Luettgau, J. Magomere, J. Rystrøm, A. Sotnikova, Y . Yang, Y . Zhao, A. Bibi, A. Bosselut, R. C...

work page 2025
[4]

ACM Transactions on Computer-Human Interaction, 27(5)

H. Wallach, M. Desai, A. F. Cooper, A. Wang, C. Atalla, S. Barocas, S. L. Blodgett, A. Chouldechova, E. Corvi, P. A. Dow, J. Garcia- Gathright, A. Olteanu, N. Pangakis, S. Reed, E. Sheng, D. Vann, J. W. Vaughan, M. V ogel, H. Washington, and A. Z. Jacobs, “Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge,” Jun. 2025, ar...

work page arXiv 2025
[5]

Measurement to Meaning: A Validity-Centered Framework for AI Evaluation,

O. Salaudeen, A. Reuel, A. Ahmed, S. Bedi, Z. Robertson, S. Sundar, B. Domingue, A. Wang, and S. Koyejo, “Measurement to Meaning: A Validity-Centered Framework for AI Evaluation,” Jun. 2025, arXiv:2505.10573 [cs]. [Online]. Available: http://arxiv.org/abs/2505. 10573 7

work page arXiv 2025
[6]

evaluating student performance

L. Weidinger, I. D. Raji, H. Wallach, M. Mitchell, A. Wang, O. Salaudeen, R. Bommasani, D. Ganguli, S. Koyejo, and W. Isaac, “Toward an evaluation science for generative ai systems,” 2025. [Online]. Available: https://arxiv.org/abs/2503.05336

work page arXiv 2025
[7]

A shared standard for valid measurement of generative ai systems’ capabilities, risks, and impacts.arXiv preprint arXiv:2412.01934,

A. Chouldechova, C. Atalla, S. Barocas, A. F. Cooper, E. Corvi, P. A. Dow, J. Garcia-Gathright, N. Pangakis, S. Reed, E. Sheng, D. Vann, M. V ogel, H. Washington, and H. Wallach, “A Shared Standard for Valid Measurement of Generative AI Systems’ Capabilities, Risks, and Impacts,” Dec. 2024, arXiv:2412.01934 [cs]. [Online]. Available: http://arxiv.org/abs/...

work page arXiv 2024
[8]

AI Value Alignment: Guiding Artificial Intelligence Towards Shared Human Goals,

World Economic Forum, “AI Value Alignment: Guiding Artificial Intelligence Towards Shared Human Goals,” World Economic Forum, Geneva, Switzerland, Tech. Rep. [Online]. Available: https://www.weforum.org/publications/ ai-value-alignment-guiding-artificial-intelligence-towards-shared-human-goals/

work page
[9]

The Participatory Turn in AI Design: Theoretical Foundations and the Current State of Practice,

F. Delgado, S. Yang, M. Madaio, and Q. Yang, “The Participatory Turn in AI Design: Theoretical Foundations and the Current State of Practice,” inEquity and Access in Algorithms, Mechanisms, and Optimization. Boston MA USA: ACM, Oct. 2023, pp. 1–23. [Online]. Available: https://dl.acm.org/doi/10.1145/3617694.3623261

work page doi:10.1145/3617694.3623261 2023
[10]

On the Design and Evaluation of Human-centered Explainable AI Systems: A Systematic Review and Taxonomy,

A. Mangold, J. Zietz, S. Weinhold, and S. Pannasch, “On the Design and Evaluation of Human-centered Explainable AI Systems: A Systematic Review and Taxonomy,” Oct. 2025, arXiv:2510.12201 [cs]. [Online]. Available: http://arxiv.org/abs/2510.12201

work page arXiv 2025
[11]

Value Sensitive Design: The- ory and Methods,

B. Friedman, P. Kahn Jr., and A. Borning, “Value Sensitive Design: The- ory and Methods,” inProceedings of the 7th Conference on Designing Interactive Systems, 2002

work page 2002
[12]

Measurement and Fairness,

A. Jacobs and H. Wallach, “Measurement and Fairness,”Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 375–385, Mar. 2021. [Online]. Available: http: //arxiv.org/abs/1912.05511

work page arXiv 2021
[13]

Social Bias Frames: Reasoning about Social and Power Implications of Language,

M. Sap, S. Gabriel, L. Qin, D. Jurafsky, N. A. Smith, and Y . Choi, “Social Bias Frames: Reasoning about Social and Power Implications of Language,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 2020, pp. 5477–5490. [Online]. Available: https://www.aclweb.org/a...

work page 2020
[14]

This Thing Called Fairness: Disciplinary Confusion Realizing a Value in Technology,

D. K. Mulligan, J. A. Kroll, N. Kohli, and R. Y . Wong, “This Thing Called Fairness: Disciplinary Confusion Realizing a Value in Technology,”Proceedings of the ACM on Human-Computer Interaction, vol. 3, no. CSCW, pp. 1–36, Nov. 2019. [Online]. Available: https://dl.acm.org/doi/10.1145/3359221

work page doi:10.1145/3359221 2019
[15]

Barocas, M

S. Barocas, M. Hardt, and A. Narayanan,Fairness and Machine Learn- ing. MIT Press, 2023

work page 2023
[16]

arXiv:2505.18893 [cs.CY] https://arxiv.org/abs/2505.18893

R. Schwartz, R. Chowdhury, A. Kundu, H. Frase, M. Fadaee, T. David, G. Waters, A. Taik, M. Briggs, P. Hall, S. Jain, K. Yee, S. Thomas, S. Bhandari, P. Duncan, A. Thompson, M. Carlyle, Q. Lu, M. Holmes, and T. Skeadas, “Reality check: A new evaluation ecosystem is necessary to understand ai’s real world effects,” 2025. [Online]. Available: https://arxiv.o...

work page arXiv 2025
[17]

Measurement validity: A shared standard for qualitative and quantitative research,

R. Adcock and D. Collier, “Measurement validity: A shared standard for qualitative and quantitative research,”American Political Science Review, vol. 95, no. 3, p. 529–546, 2001

work page 2001
[18]

How should ai safety benchmarks benchmark safety?

C. Yu, S. Engelmann, R. Cao, D. Ali, and O. Papakyriakopoulos, “How should ai safety benchmarks benchmark safety?” 2026. [Online]. Available: https://arxiv.org/abs/2601.23112

work page arXiv 2026
[19]

Ai sustainability in practice part one: Foundations for sustainable ai projects,

D. Leslie, C. Rinc ´on, M. Briggs, A. Perini, S. Jayadeva, A. Borda, S. Bennett, C. Burr, M. Aitken, M. Katell, C. Fischer, J. Wong, and I. Kherroubi Garcia, “Ai sustainability in practice part one: Foundations for sustainable ai projects,” 2023. [Online]. Available: https://zenodo.org/doi/10.5281/zenodo.10680113

work page doi:10.5281/zenodo.10680113 2023
[20]

The tree of participation: a new model for inclusive decision-making,

K. Bell and M. Reed, “The tree of participation: a new model for inclusive decision-making,”Community Development Journal, vol. 57, no. 4, pp. 595–614, 06 2021. [Online]. Available: https://doi.org/10.1093/cdj/bsab018

work page doi:10.1093/cdj/bsab018 2021
[21]

Responsible ai systems: Who are the stakeholders?

A. Deshpande and H. Sharp, “Responsible ai systems: Who are the stakeholders?” inProceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, ser. AIES ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 227–236. [Online]. Available: https://doi.org/10.1145/3514094.3534187

work page doi:10.1145/3514094.3534187 2022
[22]

Theory-based evaluation: Past, present, and future,

C. H. Weiss, “Theory-based evaluation: Past, present, and future,”New Directions for Evaluation, no. 76, pp. 41–55, 1997

work page 1997
[23]

Generative ai needs adaptive governance,

A. Reuel and T. A. Undheim, “Generative ai needs adaptive governance,”

work page
[24]

Available: https://arxiv.org/abs/2406.04554

[Online]. Available: https://arxiv.org/abs/2406.04554

work page arXiv
[25]

Troubling translation: Sociotechnical research in ai policy and governance,

S. Oduro, A. E. Marwick, C. Johnson, and E. Meyer, “Troubling translation: Sociotechnical research in ai policy and governance,”Internet Policy Review, vol. 14, no. 4,

work page
[26]

Available: https://policyreview.info/articles/analysis/ troubling-translation-ai-policy-and-governance

[Online]. Available: https://policyreview.info/articles/analysis/ troubling-translation-ai-policy-and-governance

work page
[27]

do anything now

M. I. Maga ˜na and K. Shilton, “Frameworks, methods and shared tasks: Connecting participatory ai to trustworthy ai through a systematic review of global projects,” inProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, ser. FAccT ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 2166–2179. [Online]. A...

work page doi:10.1145/3715275.3732148 2025
[28]

Applying a theory of change approach to the evaluation of comprehensive community initiatives: Progress, prospects, and problems,

J. P. Connell and A. C. Kubisch, “Applying a theory of change approach to the evaluation of comprehensive community initiatives: Progress, prospects, and problems,” 1998. [Online]. Available: https: //api.semanticscholar.org/CorpusID:2320879

work page 1998
[29]

Advancements and innovations in requirements elicitation: Developing a comprehensive conceptual model,

O. A. Popoola, H. E. Adama, C. D. Okeke, and A. E. Akinoso, “Advancements and innovations in requirements elicitation: Developing a comprehensive conceptual model,”World Journal of Advanced Research and Reviews, vol. 22, no. 1, pp. 1209–1220, 2024. [Online]. Available: https://wjarr.com/sites/default/files/WJARR-2024-1202.pdf

work page 2024
[30]

Exploring the use of llms for requirements specification in an it consulting company,

L. Pasquale, A. Ragone, E. Piemontese, and A. A. Darban, “Exploring the use of llms for requirements specification in an it consulting company,” 2025. [Online]. Available: https://arxiv.org/abs/2507.19113

work page arXiv 2025
[31]

Large scale summarization using ensemble prompts and in context learning approaches,

A. Leiva-Araos, B. Gana, H. Allende-Cid, J. Garc ´ıa, and M. J. Saikia, “Large scale summarization using ensemble prompts and in context learning approaches,”Scientific Reports, vol. 15, no. 1, p. 10259, Mar. 2025. [Online]. Available: https://www.nature.com/articles/ s41598-025-94551-8

work page 2025
[32]

Soliman, and Amr Mohamed AbdelAziz

W. M. Aly, T. H. A. Soliman, and A. M. AbdelAziz, “An evaluation of large language models on text summarization tasks using prompt engineering techniques,” 2025. [Online]. Available: https://arxiv.org/abs/2507.05123

work page arXiv 2025
[33]

Distinguishing task-specific and general-purpose ai in regulation,

J. Wang, A. Selbst, S. Barocas, and S. Venkatasubramanian, “Distinguishing task-specific and general-purpose ai in regulation,”

work page
[34]

Available: https://arxiv.org/abs/2506.17347

[Online]. Available: https://arxiv.org/abs/2506.17347

work page arXiv
[35]

Gender, race, and intersectional bias in resume screening via language model retrieval,

K. Wilson and A. Caliskan, “Gender, race, and intersectional bias in resume screening via language model retrieval,”ArXiv, vol. abs/2407.20371, 2024

work page arXiv 2024
[36]

The forgotten contexts of evaluation,

B. Harris, L. Alderman, and J. Staheli, “The forgotten contexts of evaluation,”Evaluation, vol. 31, no. 2, pp. 240–261, 2025

work page 2025
[37]

A guide to fundamental rights impact assessments (fria),

European Center for Not-for-Profit Law and Danish Institute for Human Rights, “A guide to fundamental rights impact assessments (fria),” The Hague, 2025, accessed 20 February 2026. [Online]. Available: https:// ecnl.org/publications/guide-fundamental-rights-impact-assessments-fria

work page 2025
[38]

A socio-technical-based process for questionnaire development in requirements elicitation via interviews,

A. Wahbeh, S. Sarnikar, and O. El-Gayar, “A socio-technical-based process for questionnaire development in requirements elicitation via interviews,”Requirements Engineering, vol. 25, 09 2020

work page 2020
[39]

Improving governance outcomes through ai documentation: Bridging theory and practice,

A. A. Winecoff and M. Bogen, “Improving governance outcomes through ai documentation: Bridging theory and practice,” 2024. [Online]. Available: https://arxiv.org/abs/2409.08960

work page arXiv 2024
[40]

ISO/IEC/IEEE 24765:2017, systems and software engineering — vocabulary,

I. O. for Standardization (ISO), I. E. C. (IEC), I. of Electrical, and E. E. (IEEE), “ISO/IEC/IEEE 24765:2017, systems and software engineering — vocabulary,” Geneva, Switzerland, 2017. [Online]. Available: https://www.iso.org/standard/71952.html 8

work page 2017

[1] [1]

State of ai in business 2025,

MLQ AI, “State of ai in business 2025,” 2025, version 0.1. [Online]. Available: https://mlq.ai/media/quarterly decks/v0.1 State of AI in Business 2025 Report.pdf

work page 2025

[2] [2]

Sociotechnical transformation: A systematic review on the impact of artificial intelligence on society and organizations,

M. Selgas-Cors, “Sociotechnical transformation: A systematic review on the impact of artificial intelligence on society and organizations,” FinTech and Sustainable Innovation, vol. 00, no. 00, pp. 1–16, 2025. [Online]. Available: https://ojs.bonviewpress.com/index.php/FSI/article/ view/6076/1547

work page 2025

[3] [3]

Measuring what Matters: Construct Validity in Large Language Model Benchmarks,

A. M. Bean, R. O. Kearns, A. Romanou, F. S. Hafner, H. Mayne, J. Batzner, N. Foroutan, C. Schmitz, K. Korgul, H. Batra, O. Deb, E. Beharry, C. Emde, T. Foster, A. Gausen, M. Grandury, S. Han, V . Hofmann, L. Ibrahim, H. Kim, H. R. Kirk, F. Lin, G. K.-M. Liu, L. Luettgau, J. Magomere, J. Rystrøm, A. Sotnikova, Y . Yang, Y . Zhao, A. Bibi, A. Bosselut, R. C...

work page 2025

[4] [4]

ACM Transactions on Computer-Human Interaction, 27(5)

H. Wallach, M. Desai, A. F. Cooper, A. Wang, C. Atalla, S. Barocas, S. L. Blodgett, A. Chouldechova, E. Corvi, P. A. Dow, J. Garcia- Gathright, A. Olteanu, N. Pangakis, S. Reed, E. Sheng, D. Vann, J. W. Vaughan, M. V ogel, H. Washington, and A. Z. Jacobs, “Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge,” Jun. 2025, ar...

work page arXiv 2025

[5] [5]

Measurement to Meaning: A Validity-Centered Framework for AI Evaluation,

O. Salaudeen, A. Reuel, A. Ahmed, S. Bedi, Z. Robertson, S. Sundar, B. Domingue, A. Wang, and S. Koyejo, “Measurement to Meaning: A Validity-Centered Framework for AI Evaluation,” Jun. 2025, arXiv:2505.10573 [cs]. [Online]. Available: http://arxiv.org/abs/2505. 10573 7

work page arXiv 2025

[6] [6]

evaluating student performance

L. Weidinger, I. D. Raji, H. Wallach, M. Mitchell, A. Wang, O. Salaudeen, R. Bommasani, D. Ganguli, S. Koyejo, and W. Isaac, “Toward an evaluation science for generative ai systems,” 2025. [Online]. Available: https://arxiv.org/abs/2503.05336

work page arXiv 2025

[7] [7]

A shared standard for valid measurement of generative ai systems’ capabilities, risks, and impacts.arXiv preprint arXiv:2412.01934,

A. Chouldechova, C. Atalla, S. Barocas, A. F. Cooper, E. Corvi, P. A. Dow, J. Garcia-Gathright, N. Pangakis, S. Reed, E. Sheng, D. Vann, M. V ogel, H. Washington, and H. Wallach, “A Shared Standard for Valid Measurement of Generative AI Systems’ Capabilities, Risks, and Impacts,” Dec. 2024, arXiv:2412.01934 [cs]. [Online]. Available: http://arxiv.org/abs/...

work page arXiv 2024

[8] [8]

AI Value Alignment: Guiding Artificial Intelligence Towards Shared Human Goals,

World Economic Forum, “AI Value Alignment: Guiding Artificial Intelligence Towards Shared Human Goals,” World Economic Forum, Geneva, Switzerland, Tech. Rep. [Online]. Available: https://www.weforum.org/publications/ ai-value-alignment-guiding-artificial-intelligence-towards-shared-human-goals/

work page

[9] [9]

The Participatory Turn in AI Design: Theoretical Foundations and the Current State of Practice,

F. Delgado, S. Yang, M. Madaio, and Q. Yang, “The Participatory Turn in AI Design: Theoretical Foundations and the Current State of Practice,” inEquity and Access in Algorithms, Mechanisms, and Optimization. Boston MA USA: ACM, Oct. 2023, pp. 1–23. [Online]. Available: https://dl.acm.org/doi/10.1145/3617694.3623261

work page doi:10.1145/3617694.3623261 2023

[10] [10]

On the Design and Evaluation of Human-centered Explainable AI Systems: A Systematic Review and Taxonomy,

A. Mangold, J. Zietz, S. Weinhold, and S. Pannasch, “On the Design and Evaluation of Human-centered Explainable AI Systems: A Systematic Review and Taxonomy,” Oct. 2025, arXiv:2510.12201 [cs]. [Online]. Available: http://arxiv.org/abs/2510.12201

work page arXiv 2025

[11] [11]

Value Sensitive Design: The- ory and Methods,

B. Friedman, P. Kahn Jr., and A. Borning, “Value Sensitive Design: The- ory and Methods,” inProceedings of the 7th Conference on Designing Interactive Systems, 2002

work page 2002

[12] [12]

Measurement and Fairness,

A. Jacobs and H. Wallach, “Measurement and Fairness,”Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 375–385, Mar. 2021. [Online]. Available: http: //arxiv.org/abs/1912.05511

work page arXiv 2021

[13] [13]

Social Bias Frames: Reasoning about Social and Power Implications of Language,

M. Sap, S. Gabriel, L. Qin, D. Jurafsky, N. A. Smith, and Y . Choi, “Social Bias Frames: Reasoning about Social and Power Implications of Language,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 2020, pp. 5477–5490. [Online]. Available: https://www.aclweb.org/a...

work page 2020

[14] [14]

This Thing Called Fairness: Disciplinary Confusion Realizing a Value in Technology,

D. K. Mulligan, J. A. Kroll, N. Kohli, and R. Y . Wong, “This Thing Called Fairness: Disciplinary Confusion Realizing a Value in Technology,”Proceedings of the ACM on Human-Computer Interaction, vol. 3, no. CSCW, pp. 1–36, Nov. 2019. [Online]. Available: https://dl.acm.org/doi/10.1145/3359221

work page doi:10.1145/3359221 2019

[15] [15]

Barocas, M

S. Barocas, M. Hardt, and A. Narayanan,Fairness and Machine Learn- ing. MIT Press, 2023

work page 2023

[16] [16]

arXiv:2505.18893 [cs.CY] https://arxiv.org/abs/2505.18893

R. Schwartz, R. Chowdhury, A. Kundu, H. Frase, M. Fadaee, T. David, G. Waters, A. Taik, M. Briggs, P. Hall, S. Jain, K. Yee, S. Thomas, S. Bhandari, P. Duncan, A. Thompson, M. Carlyle, Q. Lu, M. Holmes, and T. Skeadas, “Reality check: A new evaluation ecosystem is necessary to understand ai’s real world effects,” 2025. [Online]. Available: https://arxiv.o...

work page arXiv 2025

[17] [17]

Measurement validity: A shared standard for qualitative and quantitative research,

R. Adcock and D. Collier, “Measurement validity: A shared standard for qualitative and quantitative research,”American Political Science Review, vol. 95, no. 3, p. 529–546, 2001

work page 2001

[18] [18]

How should ai safety benchmarks benchmark safety?

C. Yu, S. Engelmann, R. Cao, D. Ali, and O. Papakyriakopoulos, “How should ai safety benchmarks benchmark safety?” 2026. [Online]. Available: https://arxiv.org/abs/2601.23112

work page arXiv 2026

[19] [19]

Ai sustainability in practice part one: Foundations for sustainable ai projects,

D. Leslie, C. Rinc ´on, M. Briggs, A. Perini, S. Jayadeva, A. Borda, S. Bennett, C. Burr, M. Aitken, M. Katell, C. Fischer, J. Wong, and I. Kherroubi Garcia, “Ai sustainability in practice part one: Foundations for sustainable ai projects,” 2023. [Online]. Available: https://zenodo.org/doi/10.5281/zenodo.10680113

work page doi:10.5281/zenodo.10680113 2023

[20] [20]

The tree of participation: a new model for inclusive decision-making,

K. Bell and M. Reed, “The tree of participation: a new model for inclusive decision-making,”Community Development Journal, vol. 57, no. 4, pp. 595–614, 06 2021. [Online]. Available: https://doi.org/10.1093/cdj/bsab018

work page doi:10.1093/cdj/bsab018 2021

[21] [21]

Responsible ai systems: Who are the stakeholders?

A. Deshpande and H. Sharp, “Responsible ai systems: Who are the stakeholders?” inProceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, ser. AIES ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 227–236. [Online]. Available: https://doi.org/10.1145/3514094.3534187

work page doi:10.1145/3514094.3534187 2022

[22] [22]

Theory-based evaluation: Past, present, and future,

C. H. Weiss, “Theory-based evaluation: Past, present, and future,”New Directions for Evaluation, no. 76, pp. 41–55, 1997

work page 1997

[23] [23]

Generative ai needs adaptive governance,

A. Reuel and T. A. Undheim, “Generative ai needs adaptive governance,”

work page

[24] [24]

Available: https://arxiv.org/abs/2406.04554

[Online]. Available: https://arxiv.org/abs/2406.04554

work page arXiv

[25] [25]

Troubling translation: Sociotechnical research in ai policy and governance,

S. Oduro, A. E. Marwick, C. Johnson, and E. Meyer, “Troubling translation: Sociotechnical research in ai policy and governance,”Internet Policy Review, vol. 14, no. 4,

work page

[26] [26]

Available: https://policyreview.info/articles/analysis/ troubling-translation-ai-policy-and-governance

[Online]. Available: https://policyreview.info/articles/analysis/ troubling-translation-ai-policy-and-governance

work page

[27] [27]

do anything now

M. I. Maga ˜na and K. Shilton, “Frameworks, methods and shared tasks: Connecting participatory ai to trustworthy ai through a systematic review of global projects,” inProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, ser. FAccT ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 2166–2179. [Online]. A...

work page doi:10.1145/3715275.3732148 2025

[28] [28]

Applying a theory of change approach to the evaluation of comprehensive community initiatives: Progress, prospects, and problems,

J. P. Connell and A. C. Kubisch, “Applying a theory of change approach to the evaluation of comprehensive community initiatives: Progress, prospects, and problems,” 1998. [Online]. Available: https: //api.semanticscholar.org/CorpusID:2320879

work page 1998

[29] [29]

Advancements and innovations in requirements elicitation: Developing a comprehensive conceptual model,

O. A. Popoola, H. E. Adama, C. D. Okeke, and A. E. Akinoso, “Advancements and innovations in requirements elicitation: Developing a comprehensive conceptual model,”World Journal of Advanced Research and Reviews, vol. 22, no. 1, pp. 1209–1220, 2024. [Online]. Available: https://wjarr.com/sites/default/files/WJARR-2024-1202.pdf

work page 2024

[30] [30]

Exploring the use of llms for requirements specification in an it consulting company,

L. Pasquale, A. Ragone, E. Piemontese, and A. A. Darban, “Exploring the use of llms for requirements specification in an it consulting company,” 2025. [Online]. Available: https://arxiv.org/abs/2507.19113

work page arXiv 2025

[31] [31]

Large scale summarization using ensemble prompts and in context learning approaches,

A. Leiva-Araos, B. Gana, H. Allende-Cid, J. Garc ´ıa, and M. J. Saikia, “Large scale summarization using ensemble prompts and in context learning approaches,”Scientific Reports, vol. 15, no. 1, p. 10259, Mar. 2025. [Online]. Available: https://www.nature.com/articles/ s41598-025-94551-8

work page 2025

[32] [32]

Soliman, and Amr Mohamed AbdelAziz

W. M. Aly, T. H. A. Soliman, and A. M. AbdelAziz, “An evaluation of large language models on text summarization tasks using prompt engineering techniques,” 2025. [Online]. Available: https://arxiv.org/abs/2507.05123

work page arXiv 2025

[33] [33]

Distinguishing task-specific and general-purpose ai in regulation,

J. Wang, A. Selbst, S. Barocas, and S. Venkatasubramanian, “Distinguishing task-specific and general-purpose ai in regulation,”

work page

[34] [34]

Available: https://arxiv.org/abs/2506.17347

[Online]. Available: https://arxiv.org/abs/2506.17347

work page arXiv

[35] [35]

Gender, race, and intersectional bias in resume screening via language model retrieval,

K. Wilson and A. Caliskan, “Gender, race, and intersectional bias in resume screening via language model retrieval,”ArXiv, vol. abs/2407.20371, 2024

work page arXiv 2024

[36] [36]

The forgotten contexts of evaluation,

B. Harris, L. Alderman, and J. Staheli, “The forgotten contexts of evaluation,”Evaluation, vol. 31, no. 2, pp. 240–261, 2025

work page 2025

[37] [37]

A guide to fundamental rights impact assessments (fria),

European Center for Not-for-Profit Law and Danish Institute for Human Rights, “A guide to fundamental rights impact assessments (fria),” The Hague, 2025, accessed 20 February 2026. [Online]. Available: https:// ecnl.org/publications/guide-fundamental-rights-impact-assessments-fria

work page 2025

[38] [38]

A socio-technical-based process for questionnaire development in requirements elicitation via interviews,

A. Wahbeh, S. Sarnikar, and O. El-Gayar, “A socio-technical-based process for questionnaire development in requirements elicitation via interviews,”Requirements Engineering, vol. 25, 09 2020

work page 2020

[39] [39]

Improving governance outcomes through ai documentation: Bridging theory and practice,

A. A. Winecoff and M. Bogen, “Improving governance outcomes through ai documentation: Bridging theory and practice,” 2024. [Online]. Available: https://arxiv.org/abs/2409.08960

work page arXiv 2024

[40] [40]

ISO/IEC/IEEE 24765:2017, systems and software engineering — vocabulary,

I. O. for Standardization (ISO), I. E. C. (IEC), I. of Electrical, and E. E. (IEEE), “ISO/IEC/IEEE 24765:2017, systems and software engineering — vocabulary,” Geneva, Switzerland, 2017. [Online]. Available: https://www.iso.org/standard/71952.html 8

work page 2017