arxiv: 2605.07986 · v1 · submitted 2026-05-08 · 💻 cs.HC · cs.AI· cs.CY

Recognition: 1 theorem link

· Lean Theorem

Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios

Alice Qian, Anand Rao, Hong Shen, Kristen Greene, Laura Dabbish, Meryem Marasli, Sophia Chen, Yee-Yin Choong, Ziqi Yang

Pith reviewed 2026-05-11 02:51 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.CY

keywords AI evaluationhuman-centered designevaluation scenariosuse casesfinancial servicesLLM pipelinemethodological transparencyoperational grounding

0 comments

The pith

Defining key components for AI evaluation scenarios supports consistent and meaningful human-centered comparisons.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that a structured process can convert high-level AI use cases into detailed scenarios suitable for fair evaluations. It does this by using a worksheet with six specific elements collected from subject matter experts and then expanding them through an LLM-assisted pipeline with human reviews at each stage. This approach is demonstrated with examples from the financial services sector, producing 107 scenarios. A sympathetic reader would care because it addresses the problem of incomparable AI evaluations by grounding them in real-world contexts, users, and metrics. If successful, it would allow better assessment of AI systems' impacts.

Core claim

The central claim is that by defining key scenario components through a repeatable three-stage process that combines LLM prompting with iterative human reviews, AI evaluations can achieve greater methodological transparency, operational grounding, and adherence to human-centered design principles, as shown in the generation of scenarios from financial services use cases like cyber defense and developer productivity.

What carries the argument

The AI Use Case Worksheet with its six key elements and the three-stage expansion pipeline that integrates LLM prompting and human reviews at every step to ensure scenarios are grounded in real-world usage.

If this is right

AI evaluations can move from apples-to-oranges to apples-to-apples comparisons by using standardized scenario components.
Scenarios will include considerations of positive and negative impacts on direct and indirect users.
The process supports the inclusion of specific KPIs and metrics tailored to each use case.
Human reviews ensure that generated scenarios remain reflective of actual needs rather than abstract assumptions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying this method across more sectors could identify common patterns in AI use cases that generalize beyond finance.
Future work might test whether scenarios created this way lead to more reproducible evaluation outcomes when different teams assess the same AI system.
Integrating this into standard practices could shift AI development towards designs that explicitly account for operational risks and benefits.

Load-bearing premise

That the iterative human reviews will consistently eliminate bias and ensure the scenarios accurately reflect real-world usage without the reviewers imposing their own views.

What would settle it

A test where independent teams apply the process to the same use cases and produce scenarios that result in conflicting rankings or assessments of the same AI systems would falsify the claim of achieving consistent comparisons.

Figures

Figures reproduced from arXiv: 2605.07986 by Alice Qian, Anand Rao, Hong Shen, Kristen Greene, Laura Dabbish, Meryem Marasli, Sophia Chen, Yee-Yin Choong, Ziqi Yang.

**Figure 2.** Figure 2: Example of Step 2 in Scenario Expansion Pipeline [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Example of Step 3 in Scenario Expansion Pipeline [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

read the original abstract

AI measurement science has a wide variety of methodologies and measurements for comparing AI systems, resulting in what often appear to be "apples-to-oranges" comparisons across AI evaluations. To move toward "apples-to-apples" comparisons in real-world AI evaluations, this work advocates for methodological transparency in evaluation scenarios, operational grounding, and human-centered design (HCD) principles. We propose a repeatable process for transforming high-level use cases to detailed scenarios by eliciting use cases from subject matter experts (SMEs) via a structured AI Use Case Worksheet with six key elements: use case, sector, user (direct and indirect), intended outcomes, expected impacts (positive and negative), and KPIs and metrics. We demonstrate utility of the worksheet and process in the U.S. financial services sector. This paper reports on example high-level AI use cases identified by financial services sector SMEs: cyber defense enablement, developer productivity, financial crime aggregation, suspicious activity report (SAR) filing, credit memo generation, and internal call center support. These AI use cases provided are illustrative of the process and not exhaustive. Central to our work is a three-stage expansion pipeline combining LLM prompting with human reviews to generate 107 scenarios from those use cases elicited from SMEs. This process integrates iterative human reviews at every juncture to ensure operational grounding: for scenario titles and descriptions; for core scenario elements like users, benefits and risks, and metrics; and for scenario narratives and evaluation objectives. Human checkpoints ensure scenarios remain reflective of real-world usage and human needs. We describe a validation rubric to assess scenario quality. By defining key scenario components, this work supports a more consistent and meaningful paradigm for human-centered AI evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical worksheet and LLM-human pipeline for generating grounded AI evaluation scenarios but provides no validation data to show it improves consistency.

read the letter

The main takeaway is that this paper lays out a six-element AI Use Case Worksheet and a three-stage pipeline that mixes LLM generation with human reviews to turn broad use cases into detailed evaluation scenarios. They demonstrate it on financial services examples and produce 107 scenarios, but stop without testing if the output actually improves evaluation consistency. What is new is the particular structure of the worksheet—use case, sector, user types, outcomes, impacts, and metrics—paired with the specific checkpoints in the pipeline for scenario titles, core elements, and full narratives. This gives a repeatable way to incorporate subject matter expert input and keep things operationally grounded. The paper does well in spelling out the process clearly and showing how it applies to real sector cases like cyber defense enablement and suspicious activity report filing. The inclusion of both positive and negative expected impacts adds a human-centered angle that many evaluations overlook. The soft spots are around evidence. The authors describe a validation rubric but provide no scores or analysis from applying it. There are no inter-rater reliability numbers for the human review steps, no comparison against other scenario generation approaches, and no data on whether these scenarios lead to more comparable results across different evaluation teams. The central idea that this process reduces apples-to-oranges problems stays untested in the current version. This work is for people in human-computer interaction and AI measurement who are building or refining evaluation frameworks. Practitioners in applied domains like finance who need to define scenarios for AI systems will get the most direct value from the worksheet and examples. It is worth sending to peer review because the methodological gap is real and the proposal is detailed enough to review constructively, even if it requires added validation to reach stronger conclusions.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a repeatable process to transform high-level AI use cases into detailed evaluation scenarios, using a six-element AI Use Case Worksheet (use case, sector, user, intended outcomes, expected impacts, KPIs/metrics) elicited from subject matter experts, followed by a three-stage LLM-prompting pipeline with iterative human reviews at each checkpoint. It illustrates the approach in the U.S. financial services sector by eliciting six use cases (cyber defense, developer productivity, financial crime aggregation, SAR filing, credit memo generation, internal call center support) and generating 107 scenarios, while describing a validation rubric for scenario quality. The central claim is that this structured, human-centered method promotes methodological transparency, operational grounding, and more consistent 'apples-to-apples' AI evaluations.

Significance. If the process can be empirically shown to yield scenarios that demonstrably improve evaluation consistency and relevance, it would offer a practical contribution to AI measurement science by bridging abstract use cases to evaluable, human-need-aligned scenarios. The worksheet's explicit elements and the emphasis on human checkpoints at multiple stages are clear strengths for operational relevance, but the current absence of any quantitative validation results substantially limits the demonstrated significance.

major comments (2)

[Abstract and pipeline/results sections] Abstract and the section describing the three-stage pipeline and 107 scenarios: the manuscript states that the process 'supports a more consistent and meaningful paradigm' and that 'human checkpoints ensure scenarios remain reflective of real-world usage,' yet reports no application of the described validation rubric, no rubric scores, no inter-rater reliability statistics for the human reviews, and no comparison of the generated scenarios against existing methods or benchmarks. This leaves the central claim without direct empirical support.
[Human review checkpoints] Section on human review checkpoints: the repeated assertion that iterative human reviews reliably produce operationally grounded, bias-free scenarios is presented as a core strength, but no data on reviewer agreement, bias assessment, or grounding verification are supplied, rendering the human-centered design premise untested rather than demonstrated.

minor comments (1)

[Abstract] The abstract notes that the six use cases 'are illustrative of the process and not exhaustive'; adding a brief discussion of selection criteria or coverage limitations would improve transparency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the operational relevance of the worksheet and human checkpoints. The manuscript presents a methodological framework illustrated with examples rather than a completed empirical validation study. We address each major comment below with clarifications on scope and planned revisions.

read point-by-point responses

Referee: [Abstract and pipeline/results sections] Abstract and the section describing the three-stage pipeline and 107 scenarios: the manuscript states that the process 'supports a more consistent and meaningful paradigm' and that 'human checkpoints ensure scenarios remain reflective of real-world usage,' yet reports no application of the described validation rubric, no rubric scores, no inter-rater reliability statistics for the human reviews, and no comparison of the generated scenarios against existing methods or benchmarks. This leaves the central claim without direct empirical support.

Authors: We agree that the manuscript does not report quantitative application of the validation rubric, rubric scores, inter-rater reliability statistics, or direct comparisons to existing methods. This work is a descriptive proposal of the repeatable process, the six-element worksheet, the three-stage LLM-plus-human pipeline, and the rubric itself, illustrated by generating 107 scenarios from six SME-elicited use cases. The claims about supporting consistency are based on the explicit structure and human checkpoints rather than post-hoc empirical results from this study. We will revise the abstract, pipeline description, and results sections to state more precisely that the framework is designed to enable such consistency and that the rubric is offered for future use. A new limitations subsection will be added to note the absence of these quantitative measures in the current report and to recommend their collection in subsequent applications. No fabricated data will be added. revision: partial
Referee: [Human review checkpoints] Section on human review checkpoints: the repeated assertion that iterative human reviews reliably produce operationally grounded, bias-free scenarios is presented as a core strength, but no data on reviewer agreement, bias assessment, or grounding verification are supplied, rendering the human-centered design premise untested rather than demonstrated.

Authors: The manuscript describes the human review checkpoints as integral to the process to maintain operational grounding and alignment with real-world needs, but we acknowledge that no quantitative data on reviewer agreement, bias assessment, or verification metrics are provided. The human reviews were performed by the authors and domain experts during scenario generation, yet formal reliability statistics were not collected for this initial demonstration. We will revise the human review checkpoints section to provide additional detail on the review stages, the qualifications of the reviewers involved, and the specific criteria applied at each checkpoint. Assertions about reliability will be softened to reflect that the design incorporates these safeguards, while explicitly stating that empirical testing of inter-rater agreement and bias reduction is recommended for future work using the method. revision: partial

Circularity Check

0 steps flagged

No significant circularity; procedural framework is self-contained

full rationale

The manuscript outlines a methodological process involving an AI Use Case Worksheet with six elements and a three-stage pipeline of LLM prompting combined with iterative human reviews to expand use cases into detailed scenarios. This is demonstrated through examples in the financial services sector, generating 107 scenarios. There are no equations, fitted parameters, or predictions that reduce to inputs by construction. No self-citations are invoked as load-bearing for uniqueness theorems or ansatzes. The derivation chain consists of descriptive steps for transforming high-level use cases to scenarios, which does not exhibit self-definitional or renaming patterns. The proposal remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that structured SME elicitation plus iterative human reviews will produce scenarios that accurately reflect real-world usage; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Iterative human reviews at every stage ensure scenarios remain reflective of real-world usage and human needs
Invoked as the mechanism that maintains operational grounding throughout the three-stage pipeline.

pith-pipeline@v0.9.0 · 5636 in / 1183 out tokens · 51090 ms · 2026-05-11T02:51:06.448354+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
AI Use Case Worksheet with six key elements: use case, sector, user, intended outcomes, expected impacts, KPIs and metrics; three-stage expansion pipeline combining LLM prompting with human reviews to generate 107 scenarios

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 4 internal anchors

[1]

What Is the US Economy’s Potential Growth Rate?, 2025

Manuel Abecasis. What Is the US Economy’s Potential Growth Rate?, 2025. Re- trieved December 16, 2025 from https://www.goldmansachs.com/insights/articles/ what-is-the-us-economys-potential-growth-rate

work page 2025
[2]

Explore AI Use Cases

Amazon Web Services. Explore AI Use Cases. Retrieved December 15, 2025 from https://aws.amazon.com/ ai/generative-ai/use-cases/, 2025

work page 2025
[3]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, et al. Constitutional AI: Harmlessness From AI Feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Foundation Model Transparency Reports

Rishi Bommasani, Kevin Klyman, Shayne Longpre, Betty Xiong, Sayash Kapoor, Nestor Maslej, Arvind Narayanan, and Percy Liang. Foundation Model Transparency Reports. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 7, pages 181–195, 2024

work page 2024
[5]

Integrated Innovation Strategy 2025

Cabinet Office, Government of Japan. Integrated Innovation Strategy 2025. Retrieved December 16, 2025 from https://www8.cao.go.jp/cstp/tougosenryaku/togo2025_honbun_eiyaku.pdf, 2025

work page 2025
[6]

A Survey on Evaluation of Large Language Models.ACM Transactions on Intelligent Systems and Technology, 15(3):1–45, 2024

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A Survey on Evaluation of Large Language Models.ACM Transactions on Intelligent Systems and Technology, 15(3):1–45, 2024

work page 2024
[7]

Notes From the AI Frontier: Insights From Hundreds of Use Cases.McK- insey Global Institute, 2(267):1–31, 2018

Michael Chui, James Manyika, Mehdi Miremadi, Nicolaus Henke, Rita Chung, Pieter Nel, and Sankalp Malhotra. Notes From the AI Frontier: Insights From Hundreds of Use Cases.McK- insey Global Institute, 2(267):1–31, 2018. Retrieved December 15, 2025 from https://www. mckinsey.com/~/media/mckinsey/featured%20insights/artificial%20intelligence/notes% 20from%20...

work page 2018
[8]

The Effects of Generative AI on High-Skilled Work: Evidence From Three Field Experiments With Software Developers.SSRN,

Kevin Zheyuan Cui, Mert Demirer, Sonia Jaffe, Leon Musolff, Sida Peng, and Tobias Salz. The Effects of Generative AI on High-Skilled Work: Evidence From Three Field Experiments With Software Developers.SSRN,

work page
[9]

Available at SSRN:http://dx.doi.org/10.2139/ssrn.4945566

work page doi:10.2139/ssrn.4945566
[10]

Financial Services AI Risk Management Framework (FS AI RMF)

Cyber Risk Institute. Financial Services AI Risk Management Framework (FS AI RMF). Retrieved April 20, 2026 fromhttps://cyberriskinstitute.org/artificial-intelligence-risk-management/, 2026

work page 2026
[11]

The AI Dossier: 80+ AI Use Cases: A Collection of New, High- Impact AI Use Cases Organized by Industry, Enterprise Function, and AI Type, 2025

Deloitte AI Institute. The AI Dossier: 80+ AI Use Cases: A Collection of New, High- Impact AI Use Cases Organized by Industry, Enterprise Function, and AI Type, 2025. Re- trieved December 15, 2025 from https://www.deloitte.com/us/en/what-we-do/capabilities/ applied-artificial-intelligence/content/ai-use-cases.html

work page 2025
[12]

Next-Gen Controllership: AI and Emerging Tech’s Impact on Finance,

Deloitte Center for Controllership. Next-Gen Controllership: AI and Emerging Tech’s Impact on Finance,

work page
[13]

Retrieved December 16, 2025 from https://www.deloitte.com/content/dam/assets-zone3/us/ en/docs/services/consulting/2025/agentic-ai-dbriefs-poll-results-deck.pdf

work page 2025
[14]

Gen AI’s Productivity Promise: Huge Potential but Most Have Not Yet Reached Scaled Impact

Marie El Hoyek, Nicolai Müller, and Jonas Ronellenfitsch. Gen AI’s Productivity Promise: Huge Potential but Most Have Not Yet Reached Scaled Impact. McKinsey & Company. Retrieved December 16, 2025 from https://www.mckinsey.com/capabilities/operations/our-insights/operations-blog/ gen-ais-productivity-promise-huge-potential-but-most-have-not-yet-reached-sc...

work page 2025
[15]

Commission Launches Two Strategies to Speed Up AI Uptake in European Industry and Science

European Union. Commission Launches Two Strategies to Speed Up AI Uptake in European Industry and Science. Retrieved December 16, 2025 from https://ec.europa.eu/commission/presscorner/detail/en/ip_ 25_2299, 2025

work page 2025
[16]

Empowering Communities: The Impact of Financial Institu- tions on Economic Growth

Hannah Fischer-Lauder. Empowering Communities: The Impact of Financial Institu- tions on Economic Growth. Retrieved Jan.06, 2025 from https://impakter.com/ empowering-communities-the-impact-of-financial-institutions-on-economic-growth/ , 2025

work page 2025
[17]

Lindsey Gailmard, Drew Spence, Christie Lawrence, and Daniel E. Ho. Known Unknowns and Unknown Unknowns: Designing a Scalable Adverse Event Reporting System for AI. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 8, pages 1004–1017, 2025

work page 2025
[18]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Bull Kadavath, Ben Mann, et al. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned.arXiv preprint arXiv:2209.07858, 2022

work page internal anchor Pith review arXiv 2022
[19]

1,001 Real-World Gen AI Use Cases From the World’s Leading Organiza- tions

Google. 1,001 Real-World Gen AI Use Cases From the World’s Leading Organiza- tions. Retrieved December 15, 2025 from https://cloud.google.com/transform/ 101-real-world-generative-ai-use-cases-from-industry-leaders, 2025

work page 2025
[20]

AI Strategy for the Federal Public Service 2025-2027

Government of Canada. AI Strategy for the Federal Public Service 2025-2027. Retrieved De- cember 16, 2025 from https://www.canada.ca/en/government/system/digital-government/ digital-government-innovations/responsible-use-ai/gc-ai-strategy-priority-areas.html , 2025

work page 2025
[21]

Transforming India With AI

Government of India. Transforming India With AI. Retrieved December 16, 2025 from https://www.pib.gov. in/PressReleasePage.aspx?PRID=2178092&reg=3&lang=2, 2025

work page 2025
[22]

ISO 9241-11:2018 Ergonomics of Human-System Interaction — Part 11: Usability: Definitions and Concepts, 2018

ISO. ISO 9241-11:2018 Ergonomics of Human-System Interaction — Part 11: Usability: Definitions and Concepts, 2018

work page 2018
[23]

ISO 9241-210:2019 Ergonomics of Human-System Interaction — Part 210: Human-Centred Design for Interactive Systems, 2019

ISO. ISO 9241-210:2019 Ergonomics of Human-System Interaction — Part 210: Human-Centred Design for Interactive Systems, 2019

work page 2019
[24]

ISO/IEC 42001:2023 Information Technology — Artificial Intelligence — Management System, 2023

ISO/IEC. ISO/IEC 42001:2023 Information Technology — Artificial Intelligence — Management System, 2023

work page 2023
[25]

ISO/IEC TR 24030:2024 Information Technology – Artificial Intelligence (AI) – Use Cases, 2024

ISO/IEC. ISO/IEC TR 24030:2024 Information Technology – Artificial Intelligence (AI) – Use Cases, 2024

work page 2024
[26]

Human + AI: Redefining the Standard of Care in Medicine

Johns Hopkins, Malone Center for Engineering in Healthcare. Human + AI: Redefining the Standard of Care in Medicine. The 9th Annual Johns Hopkins Research Symposium on Engineering in Healthcare. https: //malonecenter.jhu.edu/johns-hopkins-malone-center-2025-symposium/, 2025

work page 2025
[27]

Giulia Karanxha and Paulinus Ofem. Evaluating Transparency in the Development of Artificial Intelligence Systems: A Systematic Literature Review.International Journal of Advanced Computer Science & Applications, 16(10), 2025

work page 2025
[28]

KPMG Global AI in Finance Report, 2024

KPMG. KPMG Global AI in Finance Report, 2024. Retrieved December 16, 2025 from https://assets.kpmg. com/content/dam/kpmgsites/xx/pdf/2024/11/ai-in-finance.pdf.coredownload.inline.pdf

work page 2024
[29]

The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects From a Survey of Knowledge Workers

Hao-Ping Lee, Advait Sarkar, Lev Tankelevitch, Ian Drosos, Sean Rintel, Richard Banks, and Nicholas Wilson. The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects From a Survey of Knowledge Workers. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–22, 2025

work page 2025
[30]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, et al. Holistic Evaluation of Language Models.arXiv preprint arXiv:2211.09110, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

Association for Computing Machinery

Nestor Maslej, Loredana Fattorini, Raymond Perrault, Yolanda Gil, Vanessa Parli, Njenga Kariuki, Emily Capstick, et al. The AI Index 2025 Annual Report.arXiv preprint arXiv:2504.07139, 2025. AI Index Steering Committee, Institute for Human-Centered AI, Stanford University

work page arXiv 2025
[32]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, et al. Harm- bench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal.arXiv preprint arXiv:2402.04249, 2024

work page internal anchor Pith review arXiv 2024
[33]

Financial Markets: The Backbone of the Global Economy.Journal of Stock & F orex Trading, 12:292, 2025

Anna Milley. Financial Markets: The Backbone of the Global Economy.Journal of Stock & F orex Trading, 12:292, 2025

work page 2025
[34]

How Artificial Intelligence Impacts the US Labor Market

Seb Murray. How Artificial Intelligence Impacts the US Labor Market. MIT Sloan School of Man- agement. Retrieved December 16, 2025 from https://mitsloan.mit.edu/ideas-made-to-matter/ how-artificial-intelligence-impacts-us-labor-market, 2025. 15 From Real-World AI Use Cases to Evaluation Scenarios

work page 2025
[35]

A Review of Evaluation Metrics in Machine Learning Algorithms

Gireen Naidu, Tranos Zuva, and Elias Mmbongeni Sibanda. A Review of Evaluation Metrics in Machine Learning Algorithms. InComputer Science On-line Conference, pages 15–25. Springer International Publishing, 2023

work page 2023
[36]

Artificial Intelligence Risk Management Framework (AI RMF 1.0), 2023

NIST AI 100-1. Artificial Intelligence Risk Management Framework (AI RMF 1.0), 2023

work page 2023
[37]

Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, 2024

NIST AI 600-1. Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, 2024

work page 2024
[38]

Assessing Risks and Impacts of AI (ARIA): Pilot Evaluation Report, 2025

NIST AI 700-2. Assessing Risks and Impacts of AI (ARIA): Pilot Evaluation Report, 2025

work page 2025
[39]

OECD AI Principles

OECD. OECD AI Principles. Retrieved December 16, 2025 from https://oecd.ai/en/ai-principles, 2025

work page 2025
[40]

Catalogue of Tools & Metrics for Trustworthy AI, List of Metric Use Cases

OECD.AI. Catalogue of Tools & Metrics for Trustworthy AI, List of Metric Use Cases. Retrieved December 16, 2025 fromhttps://oecd.ai/en/catalogue/metric-use-cases, 2025

work page 2025
[41]

Digital Transformations & Tech Adoption by Sector (2025)

Levi Olmstead. Digital Transformations & Tech Adoption by Sector (2025). Retrieved January 06, 2026 from https://whatfix.com/blog/digital-transformation-by-sector/, 2025

work page 2025
[42]

Evaluation: From Precision, Recall and F-measure to ROC, Informedness, Markedness & Correlation.Journal of Machine Learning Technologies, 2(1):37–63, 2011

David Powers. Evaluation: From Precision, Recall and F-measure to ROC, Informedness, Markedness & Correlation.Journal of Machine Learning Technologies, 2(1):37–63, 2011

work page 2011
[43]

Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI

Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1776–1826, 2022

work page 2022
[44]

A Sociotechnical Audit: Assessing Police Use of Facial Recognition

Evani Radiya-Dixit and Gina Neff. A Sociotechnical Audit: Assessing Police Use of Facial Recognition. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pages 1334–1346, 2023

work page 2023
[45]

Kevin Roose. A.I. Has a Measurement Problem. The New York Times. Retrieved January 6, 2026 from https://www.nytimes.com/2024/04/15/technology/ai-models-measurement.html, 2025

work page 2026
[46]

Disclosure Without Engagement: An Empirical Review of Positionality Statements at FAccT

Hope Schroeder, Akshansh Pareek, and Solon Barocas. Disclosure Without Engagement: An Empirical Review of Positionality Statements at FAccT. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, pages 1195–1210, 2025

work page 2025
[47]

Selbst, danah Boyd, Sorelle A

Andrew D. Selbst, danah Boyd, Sorelle A. Friedler, Suresh Venkatasubramanian, and Janet Vertesi. Fairness and Abstraction in Sociotechnical Systems. InProceedings of the 2019 ACM Conference on Fairness, Accountability, and Transparency, pages 59–68, 2019

work page 2019
[48]

Generative AI in the Wild: Prospects, Challenges, and Strategies

Yuan Sun, Eunchae Jang, Fenglong Ma, and Ting Wang. Generative AI in the Wild: Prospects, Challenges, and Strategies. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pages 1–16, 2024

work page 2024
[49]

U.S. Census. Business Trends and Outlook Survey (BTOS) Key Performance Indicators. Retrieved December 16, 2025 fromhttps://www.census.gov/hfp/btos/data, 2025

work page 2025
[50]

Chamber of Commerce

U.S. Chamber of Commerce. Empowering Small Business: The Impact of Technology on U.S. Small Busi- ness, 4th Ed., 2025. Retrieved December 16, 2025 from https://www.uschamber.com/assets/documents/ 20251621-CTEC-Empowering-Small-Business-Report-2025-v1-r10-Digital-FINAL.pdf

work page 2025
[51]

Chief Information Officers Council

U.S. Chief Information Officers Council. 2024 Federal AI Use Case Inventory. Re- trieved December 15, 2025 from https://www.cio.gov/policies-and-priorities/ Executive-Order-13960-AI-Use-Case-Inventories-Reference/, 2024

work page 2024
[52]

Government Accountability Office

U.S. Government Accountability Office. Generative AI Use and Management at Federal Agencies, 2025. Accessed April 27, 2026 fromhttps://www.gao.gov/products/gao-25-107653

work page 2025
[53]

The White House

U.S. The White House. Winning the Race: America’s AI Action Plan, 2025. Retrieved December 16, 2025 from https://www.whitehouse.gov/wp-content/uploads/2025/07/Americas-AI-Action-Plan.pdf

work page 2025
[54]

Treasury

U.S. Treasury. Artificial Intelligence in Financial Services: Report on the Uses, Opportunities, and Risks of Artificial Intelligence in the Financial Services Sector, 2024. Retrieved December 16, 2025 from https://home. treasury.gov/system/files/136/Artificial-Intelligence-in-Financial-Services.pdf

work page 2024
[55]

VanNostrand, Dennis M

Peter M. VanNostrand, Dennis M. Hofmann, Lei Ma, and Elke A. Rundensteiner. Actionable Recourse for Automated Decisions: Examining the Effects of Counterfactual Explanation Type and Presentation on Lay User Understanding. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 1682–1700, 2024. 16 From Real-World AI Us...

work page 2024
[56]

Verga, S

Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing Judges With Juries: Evaluating LLM Generations With a Panel of Diverse Models.arXiv preprint arXiv:2404.18796, 2024

work page arXiv 2024
[57]

Walker, Neville A

Guy H. Walker, Neville A. Stanton, Paul M. Salmon, and Daniel P. Jenkins. A Review of Sociotechnical Systems Theory: A Classic Concept for New Command and Control Paradigms.Theoretical Issues in Ergonomics Science, 9(6):479–499, 2008

work page 2008
[58]

Position: Evaluating generative ai systems is a social science measurement challenge.arXiv preprint arXiv:2502.00561,

Hanna Wallach, Meera Desai, A. Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, et al. Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge.arXiv preprint arXiv:2502.00561, 2025

work page arXiv 2025
[59]

Jailbroken: How Does LLM Safety Training Fail? In Advances in Neural Information Processing Systems, volume 36, pages 80079–80110, 2023

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How Does LLM Safety Training Fail? In Advances in Neural Information Processing Systems, volume 36, pages 80079–80110, 2023

work page 2023
[60]

MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, et al. MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024

work page 2024
[61]

AIR-Bench 2024: A Safety Benchmark Based on Regulation and Policies Specified Risk Categories

Yi Zeng, Yu Yang, Andy Zou, Jeffrey Ziwei Tan, Yuheng Tu, Yifan Mai, Kevin Klyman, Minzhou Pan, Ruoxi Jia, Dawn Song, Percy Liang, and Bo Li. AIR-Bench 2024: A Safety Benchmark Based on Regulation and Policies Specified Risk Categories. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2024
[62]

Judging LLM-as-a-Judge With MT-bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, et al. Judging LLM-as-a-Judge With MT-bench and Chatbot Arena. InAdvances in Neural Information Processing Systems, volume 36, pages 46595–46623, 2023

work page 2023
[63]

Security challenges in ai agent deployment: Insights from a large scale public competition.arXiv preprint arXiv:2507.20526, 2025

Andy Zou, Maxwell Lin, Eliot Jones, Micha Nowak, Mateusz Dziemian, Nick Winter, Alexander Grattan, et al. Security Challenges in AIAgent Deployment: Insights From a Large Scale Public Competition.arXiv preprint arXiv:2507.20526, 2025. 17 From Real-World AI Use Cases to Evaluation Scenarios Appendix As discussed in Section 4, a summary list of the six high...

work page arXiv 2025