Recognition: 1 theorem link
· Lean TheoremTowards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios
Pith reviewed 2026-05-11 02:51 UTC · model grok-4.3
The pith
Defining key components for AI evaluation scenarios supports consistent and meaningful human-centered comparisons.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that by defining key scenario components through a repeatable three-stage process that combines LLM prompting with iterative human reviews, AI evaluations can achieve greater methodological transparency, operational grounding, and adherence to human-centered design principles, as shown in the generation of scenarios from financial services use cases like cyber defense and developer productivity.
What carries the argument
The AI Use Case Worksheet with its six key elements and the three-stage expansion pipeline that integrates LLM prompting and human reviews at every step to ensure scenarios are grounded in real-world usage.
If this is right
- AI evaluations can move from apples-to-oranges to apples-to-apples comparisons by using standardized scenario components.
- Scenarios will include considerations of positive and negative impacts on direct and indirect users.
- The process supports the inclusion of specific KPIs and metrics tailored to each use case.
- Human reviews ensure that generated scenarios remain reflective of actual needs rather than abstract assumptions.
Where Pith is reading between the lines
- Applying this method across more sectors could identify common patterns in AI use cases that generalize beyond finance.
- Future work might test whether scenarios created this way lead to more reproducible evaluation outcomes when different teams assess the same AI system.
- Integrating this into standard practices could shift AI development towards designs that explicitly account for operational risks and benefits.
Load-bearing premise
That the iterative human reviews will consistently eliminate bias and ensure the scenarios accurately reflect real-world usage without the reviewers imposing their own views.
What would settle it
A test where independent teams apply the process to the same use cases and produce scenarios that result in conflicting rankings or assessments of the same AI systems would falsify the claim of achieving consistent comparisons.
Figures
read the original abstract
AI measurement science has a wide variety of methodologies and measurements for comparing AI systems, resulting in what often appear to be "apples-to-oranges" comparisons across AI evaluations. To move toward "apples-to-apples" comparisons in real-world AI evaluations, this work advocates for methodological transparency in evaluation scenarios, operational grounding, and human-centered design (HCD) principles. We propose a repeatable process for transforming high-level use cases to detailed scenarios by eliciting use cases from subject matter experts (SMEs) via a structured AI Use Case Worksheet with six key elements: use case, sector, user (direct and indirect), intended outcomes, expected impacts (positive and negative), and KPIs and metrics. We demonstrate utility of the worksheet and process in the U.S. financial services sector. This paper reports on example high-level AI use cases identified by financial services sector SMEs: cyber defense enablement, developer productivity, financial crime aggregation, suspicious activity report (SAR) filing, credit memo generation, and internal call center support. These AI use cases provided are illustrative of the process and not exhaustive. Central to our work is a three-stage expansion pipeline combining LLM prompting with human reviews to generate 107 scenarios from those use cases elicited from SMEs. This process integrates iterative human reviews at every juncture to ensure operational grounding: for scenario titles and descriptions; for core scenario elements like users, benefits and risks, and metrics; and for scenario narratives and evaluation objectives. Human checkpoints ensure scenarios remain reflective of real-world usage and human needs. We describe a validation rubric to assess scenario quality. By defining key scenario components, this work supports a more consistent and meaningful paradigm for human-centered AI evaluations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a repeatable process to transform high-level AI use cases into detailed evaluation scenarios, using a six-element AI Use Case Worksheet (use case, sector, user, intended outcomes, expected impacts, KPIs/metrics) elicited from subject matter experts, followed by a three-stage LLM-prompting pipeline with iterative human reviews at each checkpoint. It illustrates the approach in the U.S. financial services sector by eliciting six use cases (cyber defense, developer productivity, financial crime aggregation, SAR filing, credit memo generation, internal call center support) and generating 107 scenarios, while describing a validation rubric for scenario quality. The central claim is that this structured, human-centered method promotes methodological transparency, operational grounding, and more consistent 'apples-to-apples' AI evaluations.
Significance. If the process can be empirically shown to yield scenarios that demonstrably improve evaluation consistency and relevance, it would offer a practical contribution to AI measurement science by bridging abstract use cases to evaluable, human-need-aligned scenarios. The worksheet's explicit elements and the emphasis on human checkpoints at multiple stages are clear strengths for operational relevance, but the current absence of any quantitative validation results substantially limits the demonstrated significance.
major comments (2)
- [Abstract and pipeline/results sections] Abstract and the section describing the three-stage pipeline and 107 scenarios: the manuscript states that the process 'supports a more consistent and meaningful paradigm' and that 'human checkpoints ensure scenarios remain reflective of real-world usage,' yet reports no application of the described validation rubric, no rubric scores, no inter-rater reliability statistics for the human reviews, and no comparison of the generated scenarios against existing methods or benchmarks. This leaves the central claim without direct empirical support.
- [Human review checkpoints] Section on human review checkpoints: the repeated assertion that iterative human reviews reliably produce operationally grounded, bias-free scenarios is presented as a core strength, but no data on reviewer agreement, bias assessment, or grounding verification are supplied, rendering the human-centered design premise untested rather than demonstrated.
minor comments (1)
- [Abstract] The abstract notes that the six use cases 'are illustrative of the process and not exhaustive'; adding a brief discussion of selection criteria or coverage limitations would improve transparency.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the operational relevance of the worksheet and human checkpoints. The manuscript presents a methodological framework illustrated with examples rather than a completed empirical validation study. We address each major comment below with clarifications on scope and planned revisions.
read point-by-point responses
-
Referee: [Abstract and pipeline/results sections] Abstract and the section describing the three-stage pipeline and 107 scenarios: the manuscript states that the process 'supports a more consistent and meaningful paradigm' and that 'human checkpoints ensure scenarios remain reflective of real-world usage,' yet reports no application of the described validation rubric, no rubric scores, no inter-rater reliability statistics for the human reviews, and no comparison of the generated scenarios against existing methods or benchmarks. This leaves the central claim without direct empirical support.
Authors: We agree that the manuscript does not report quantitative application of the validation rubric, rubric scores, inter-rater reliability statistics, or direct comparisons to existing methods. This work is a descriptive proposal of the repeatable process, the six-element worksheet, the three-stage LLM-plus-human pipeline, and the rubric itself, illustrated by generating 107 scenarios from six SME-elicited use cases. The claims about supporting consistency are based on the explicit structure and human checkpoints rather than post-hoc empirical results from this study. We will revise the abstract, pipeline description, and results sections to state more precisely that the framework is designed to enable such consistency and that the rubric is offered for future use. A new limitations subsection will be added to note the absence of these quantitative measures in the current report and to recommend their collection in subsequent applications. No fabricated data will be added. revision: partial
-
Referee: [Human review checkpoints] Section on human review checkpoints: the repeated assertion that iterative human reviews reliably produce operationally grounded, bias-free scenarios is presented as a core strength, but no data on reviewer agreement, bias assessment, or grounding verification are supplied, rendering the human-centered design premise untested rather than demonstrated.
Authors: The manuscript describes the human review checkpoints as integral to the process to maintain operational grounding and alignment with real-world needs, but we acknowledge that no quantitative data on reviewer agreement, bias assessment, or verification metrics are provided. The human reviews were performed by the authors and domain experts during scenario generation, yet formal reliability statistics were not collected for this initial demonstration. We will revise the human review checkpoints section to provide additional detail on the review stages, the qualifications of the reviewers involved, and the specific criteria applied at each checkpoint. Assertions about reliability will be softened to reflect that the design incorporates these safeguards, while explicitly stating that empirical testing of inter-rater agreement and bias reduction is recommended for future work using the method. revision: partial
Circularity Check
No significant circularity; procedural framework is self-contained
full rationale
The manuscript outlines a methodological process involving an AI Use Case Worksheet with six elements and a three-stage pipeline of LLM prompting combined with iterative human reviews to expand use cases into detailed scenarios. This is demonstrated through examples in the financial services sector, generating 107 scenarios. There are no equations, fitted parameters, or predictions that reduce to inputs by construction. No self-citations are invoked as load-bearing for uniqueness theorems or ansatzes. The derivation chain consists of descriptive steps for transforming high-level use cases to scenarios, which does not exhibit self-definitional or renaming patterns. The proposal remains independent of its own outputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Iterative human reviews at every stage ensure scenarios remain reflective of real-world usage and human needs
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearAI Use Case Worksheet with six key elements: use case, sector, user, intended outcomes, expected impacts, KPIs and metrics; three-stage expansion pipeline combining LLM prompting with human reviews to generate 107 scenarios
Reference graph
Works this paper leans on
-
[1]
What Is the US Economy’s Potential Growth Rate?, 2025
Manuel Abecasis. What Is the US Economy’s Potential Growth Rate?, 2025. Re- trieved December 16, 2025 from https://www.goldmansachs.com/insights/articles/ what-is-the-us-economys-potential-growth-rate
work page 2025
-
[2]
Amazon Web Services. Explore AI Use Cases. Retrieved December 15, 2025 from https://aws.amazon.com/ ai/generative-ai/use-cases/, 2025
work page 2025
-
[3]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, et al. Constitutional AI: Harmlessness From AI Feedback.arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Foundation Model Transparency Reports
Rishi Bommasani, Kevin Klyman, Shayne Longpre, Betty Xiong, Sayash Kapoor, Nestor Maslej, Arvind Narayanan, and Percy Liang. Foundation Model Transparency Reports. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 7, pages 181–195, 2024
work page 2024
-
[5]
Integrated Innovation Strategy 2025
Cabinet Office, Government of Japan. Integrated Innovation Strategy 2025. Retrieved December 16, 2025 from https://www8.cao.go.jp/cstp/tougosenryaku/togo2025_honbun_eiyaku.pdf, 2025
work page 2025
-
[6]
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A Survey on Evaluation of Large Language Models.ACM Transactions on Intelligent Systems and Technology, 15(3):1–45, 2024
work page 2024
-
[7]
Michael Chui, James Manyika, Mehdi Miremadi, Nicolaus Henke, Rita Chung, Pieter Nel, and Sankalp Malhotra. Notes From the AI Frontier: Insights From Hundreds of Use Cases.McK- insey Global Institute, 2(267):1–31, 2018. Retrieved December 15, 2025 from https://www. mckinsey.com/~/media/mckinsey/featured%20insights/artificial%20intelligence/notes% 20from%20...
work page 2018
-
[8]
Kevin Zheyuan Cui, Mert Demirer, Sonia Jaffe, Leon Musolff, Sida Peng, and Tobias Salz. The Effects of Generative AI on High-Skilled Work: Evidence From Three Field Experiments With Software Developers.SSRN,
-
[9]
Available at SSRN:http://dx.doi.org/10.2139/ssrn.4945566
-
[10]
Financial Services AI Risk Management Framework (FS AI RMF)
Cyber Risk Institute. Financial Services AI Risk Management Framework (FS AI RMF). Retrieved April 20, 2026 fromhttps://cyberriskinstitute.org/artificial-intelligence-risk-management/, 2026
work page 2026
-
[11]
Deloitte AI Institute. The AI Dossier: 80+ AI Use Cases: A Collection of New, High- Impact AI Use Cases Organized by Industry, Enterprise Function, and AI Type, 2025. Re- trieved December 15, 2025 from https://www.deloitte.com/us/en/what-we-do/capabilities/ applied-artificial-intelligence/content/ai-use-cases.html
work page 2025
-
[12]
Next-Gen Controllership: AI and Emerging Tech’s Impact on Finance,
Deloitte Center for Controllership. Next-Gen Controllership: AI and Emerging Tech’s Impact on Finance,
-
[13]
Retrieved December 16, 2025 from https://www.deloitte.com/content/dam/assets-zone3/us/ en/docs/services/consulting/2025/agentic-ai-dbriefs-poll-results-deck.pdf
work page 2025
-
[14]
Gen AI’s Productivity Promise: Huge Potential but Most Have Not Yet Reached Scaled Impact
Marie El Hoyek, Nicolai Müller, and Jonas Ronellenfitsch. Gen AI’s Productivity Promise: Huge Potential but Most Have Not Yet Reached Scaled Impact. McKinsey & Company. Retrieved December 16, 2025 from https://www.mckinsey.com/capabilities/operations/our-insights/operations-blog/ gen-ais-productivity-promise-huge-potential-but-most-have-not-yet-reached-sc...
work page 2025
-
[15]
Commission Launches Two Strategies to Speed Up AI Uptake in European Industry and Science
European Union. Commission Launches Two Strategies to Speed Up AI Uptake in European Industry and Science. Retrieved December 16, 2025 from https://ec.europa.eu/commission/presscorner/detail/en/ip_ 25_2299, 2025
work page 2025
-
[16]
Empowering Communities: The Impact of Financial Institu- tions on Economic Growth
Hannah Fischer-Lauder. Empowering Communities: The Impact of Financial Institu- tions on Economic Growth. Retrieved Jan.06, 2025 from https://impakter.com/ empowering-communities-the-impact-of-financial-institutions-on-economic-growth/ , 2025
work page 2025
-
[17]
Lindsey Gailmard, Drew Spence, Christie Lawrence, and Daniel E. Ho. Known Unknowns and Unknown Unknowns: Designing a Scalable Adverse Event Reporting System for AI. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 8, pages 1004–1017, 2025
work page 2025
-
[18]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Bull Kadavath, Ben Mann, et al. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned.arXiv preprint arXiv:2209.07858, 2022
work page internal anchor Pith review arXiv 2022
-
[19]
1,001 Real-World Gen AI Use Cases From the World’s Leading Organiza- tions
Google. 1,001 Real-World Gen AI Use Cases From the World’s Leading Organiza- tions. Retrieved December 15, 2025 from https://cloud.google.com/transform/ 101-real-world-generative-ai-use-cases-from-industry-leaders, 2025
work page 2025
-
[20]
AI Strategy for the Federal Public Service 2025-2027
Government of Canada. AI Strategy for the Federal Public Service 2025-2027. Retrieved De- cember 16, 2025 from https://www.canada.ca/en/government/system/digital-government/ digital-government-innovations/responsible-use-ai/gc-ai-strategy-priority-areas.html , 2025
work page 2025
-
[21]
Government of India. Transforming India With AI. Retrieved December 16, 2025 from https://www.pib.gov. in/PressReleasePage.aspx?PRID=2178092®=3&lang=2, 2025
work page 2025
-
[22]
ISO. ISO 9241-11:2018 Ergonomics of Human-System Interaction — Part 11: Usability: Definitions and Concepts, 2018
work page 2018
-
[23]
ISO. ISO 9241-210:2019 Ergonomics of Human-System Interaction — Part 210: Human-Centred Design for Interactive Systems, 2019
work page 2019
-
[24]
ISO/IEC 42001:2023 Information Technology — Artificial Intelligence — Management System, 2023
ISO/IEC. ISO/IEC 42001:2023 Information Technology — Artificial Intelligence — Management System, 2023
work page 2023
-
[25]
ISO/IEC TR 24030:2024 Information Technology – Artificial Intelligence (AI) – Use Cases, 2024
ISO/IEC. ISO/IEC TR 24030:2024 Information Technology – Artificial Intelligence (AI) – Use Cases, 2024
work page 2024
-
[26]
Human + AI: Redefining the Standard of Care in Medicine
Johns Hopkins, Malone Center for Engineering in Healthcare. Human + AI: Redefining the Standard of Care in Medicine. The 9th Annual Johns Hopkins Research Symposium on Engineering in Healthcare. https: //malonecenter.jhu.edu/johns-hopkins-malone-center-2025-symposium/, 2025
work page 2025
-
[27]
Giulia Karanxha and Paulinus Ofem. Evaluating Transparency in the Development of Artificial Intelligence Systems: A Systematic Literature Review.International Journal of Advanced Computer Science & Applications, 16(10), 2025
work page 2025
-
[28]
KPMG Global AI in Finance Report, 2024
KPMG. KPMG Global AI in Finance Report, 2024. Retrieved December 16, 2025 from https://assets.kpmg. com/content/dam/kpmgsites/xx/pdf/2024/11/ai-in-finance.pdf.coredownload.inline.pdf
work page 2024
-
[29]
Hao-Ping Lee, Advait Sarkar, Lev Tankelevitch, Ian Drosos, Sean Rintel, Richard Banks, and Nicholas Wilson. The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects From a Survey of Knowledge Workers. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–22, 2025
work page 2025
-
[30]
Holistic Evaluation of Language Models
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, et al. Holistic Evaluation of Language Models.arXiv preprint arXiv:2211.09110, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[31]
Association for Computing Machinery
Nestor Maslej, Loredana Fattorini, Raymond Perrault, Yolanda Gil, Vanessa Parli, Njenga Kariuki, Emily Capstick, et al. The AI Index 2025 Annual Report.arXiv preprint arXiv:2504.07139, 2025. AI Index Steering Committee, Institute for Human-Centered AI, Stanford University
-
[32]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, et al. Harm- bench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal.arXiv preprint arXiv:2402.04249, 2024
work page internal anchor Pith review arXiv 2024
-
[33]
Anna Milley. Financial Markets: The Backbone of the Global Economy.Journal of Stock & F orex Trading, 12:292, 2025
work page 2025
-
[34]
How Artificial Intelligence Impacts the US Labor Market
Seb Murray. How Artificial Intelligence Impacts the US Labor Market. MIT Sloan School of Man- agement. Retrieved December 16, 2025 from https://mitsloan.mit.edu/ideas-made-to-matter/ how-artificial-intelligence-impacts-us-labor-market, 2025. 15 From Real-World AI Use Cases to Evaluation Scenarios
work page 2025
-
[35]
A Review of Evaluation Metrics in Machine Learning Algorithms
Gireen Naidu, Tranos Zuva, and Elias Mmbongeni Sibanda. A Review of Evaluation Metrics in Machine Learning Algorithms. InComputer Science On-line Conference, pages 15–25. Springer International Publishing, 2023
work page 2023
-
[36]
Artificial Intelligence Risk Management Framework (AI RMF 1.0), 2023
NIST AI 100-1. Artificial Intelligence Risk Management Framework (AI RMF 1.0), 2023
work page 2023
-
[37]
Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, 2024
NIST AI 600-1. Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, 2024
work page 2024
-
[38]
Assessing Risks and Impacts of AI (ARIA): Pilot Evaluation Report, 2025
NIST AI 700-2. Assessing Risks and Impacts of AI (ARIA): Pilot Evaluation Report, 2025
work page 2025
-
[39]
OECD. OECD AI Principles. Retrieved December 16, 2025 from https://oecd.ai/en/ai-principles, 2025
work page 2025
-
[40]
Catalogue of Tools & Metrics for Trustworthy AI, List of Metric Use Cases
OECD.AI. Catalogue of Tools & Metrics for Trustworthy AI, List of Metric Use Cases. Retrieved December 16, 2025 fromhttps://oecd.ai/en/catalogue/metric-use-cases, 2025
work page 2025
-
[41]
Digital Transformations & Tech Adoption by Sector (2025)
Levi Olmstead. Digital Transformations & Tech Adoption by Sector (2025). Retrieved January 06, 2026 from https://whatfix.com/blog/digital-transformation-by-sector/, 2025
work page 2025
-
[42]
David Powers. Evaluation: From Precision, Recall and F-measure to ROC, Informedness, Markedness & Correlation.Journal of Machine Learning Technologies, 2(1):37–63, 2011
work page 2011
-
[43]
Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI
Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1776–1826, 2022
work page 2022
-
[44]
A Sociotechnical Audit: Assessing Police Use of Facial Recognition
Evani Radiya-Dixit and Gina Neff. A Sociotechnical Audit: Assessing Police Use of Facial Recognition. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pages 1334–1346, 2023
work page 2023
-
[45]
Kevin Roose. A.I. Has a Measurement Problem. The New York Times. Retrieved January 6, 2026 from https://www.nytimes.com/2024/04/15/technology/ai-models-measurement.html, 2025
work page 2026
-
[46]
Disclosure Without Engagement: An Empirical Review of Positionality Statements at FAccT
Hope Schroeder, Akshansh Pareek, and Solon Barocas. Disclosure Without Engagement: An Empirical Review of Positionality Statements at FAccT. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, pages 1195–1210, 2025
work page 2025
-
[47]
Andrew D. Selbst, danah Boyd, Sorelle A. Friedler, Suresh Venkatasubramanian, and Janet Vertesi. Fairness and Abstraction in Sociotechnical Systems. InProceedings of the 2019 ACM Conference on Fairness, Accountability, and Transparency, pages 59–68, 2019
work page 2019
-
[48]
Generative AI in the Wild: Prospects, Challenges, and Strategies
Yuan Sun, Eunchae Jang, Fenglong Ma, and Ting Wang. Generative AI in the Wild: Prospects, Challenges, and Strategies. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pages 1–16, 2024
work page 2024
-
[49]
U.S. Census. Business Trends and Outlook Survey (BTOS) Key Performance Indicators. Retrieved December 16, 2025 fromhttps://www.census.gov/hfp/btos/data, 2025
work page 2025
-
[50]
U.S. Chamber of Commerce. Empowering Small Business: The Impact of Technology on U.S. Small Busi- ness, 4th Ed., 2025. Retrieved December 16, 2025 from https://www.uschamber.com/assets/documents/ 20251621-CTEC-Empowering-Small-Business-Report-2025-v1-r10-Digital-FINAL.pdf
work page 2025
-
[51]
Chief Information Officers Council
U.S. Chief Information Officers Council. 2024 Federal AI Use Case Inventory. Re- trieved December 15, 2025 from https://www.cio.gov/policies-and-priorities/ Executive-Order-13960-AI-Use-Case-Inventories-Reference/, 2024
work page 2024
-
[52]
Government Accountability Office
U.S. Government Accountability Office. Generative AI Use and Management at Federal Agencies, 2025. Accessed April 27, 2026 fromhttps://www.gao.gov/products/gao-25-107653
work page 2025
-
[53]
U.S. The White House. Winning the Race: America’s AI Action Plan, 2025. Retrieved December 16, 2025 from https://www.whitehouse.gov/wp-content/uploads/2025/07/Americas-AI-Action-Plan.pdf
work page 2025
-
[54]
U.S. Treasury. Artificial Intelligence in Financial Services: Report on the Uses, Opportunities, and Risks of Artificial Intelligence in the Financial Services Sector, 2024. Retrieved December 16, 2025 from https://home. treasury.gov/system/files/136/Artificial-Intelligence-in-Financial-Services.pdf
work page 2024
-
[55]
Peter M. VanNostrand, Dennis M. Hofmann, Lei Ma, and Elke A. Rundensteiner. Actionable Recourse for Automated Decisions: Examining the Effects of Counterfactual Explanation Type and Presentation on Lay User Understanding. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 1682–1700, 2024. 16 From Real-World AI Us...
work page 2024
-
[56]
Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing Judges With Juries: Evaluating LLM Generations With a Panel of Diverse Models.arXiv preprint arXiv:2404.18796, 2024
-
[57]
Guy H. Walker, Neville A. Stanton, Paul M. Salmon, and Daniel P. Jenkins. A Review of Sociotechnical Systems Theory: A Classic Concept for New Command and Control Paradigms.Theoretical Issues in Ergonomics Science, 9(6):479–499, 2008
work page 2008
-
[58]
Hanna Wallach, Meera Desai, A. Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, et al. Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge.arXiv preprint arXiv:2502.00561, 2025
-
[59]
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How Does LLM Safety Training Fail? In Advances in Neural Information Processing Systems, volume 36, pages 80079–80110, 2023
work page 2023
-
[60]
MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, et al. MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024
work page 2024
-
[61]
AIR-Bench 2024: A Safety Benchmark Based on Regulation and Policies Specified Risk Categories
Yi Zeng, Yu Yang, Andy Zou, Jeffrey Ziwei Tan, Yuheng Tu, Yifan Mai, Kevin Klyman, Minzhou Pan, Ruoxi Jia, Dawn Song, Percy Liang, and Bo Li. AIR-Bench 2024: A Safety Benchmark Based on Regulation and Policies Specified Risk Categories. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2024
-
[62]
Judging LLM-as-a-Judge With MT-bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, et al. Judging LLM-as-a-Judge With MT-bench and Chatbot Arena. InAdvances in Neural Information Processing Systems, volume 36, pages 46595–46623, 2023
work page 2023
-
[63]
Andy Zou, Maxwell Lin, Eliot Jones, Micha Nowak, Mateusz Dziemian, Nick Winter, Alexander Grattan, et al. Security Challenges in AIAgent Deployment: Insights From a Large Scale Public Competition.arXiv preprint arXiv:2507.20526, 2025. 17 From Real-World AI Use Cases to Evaluation Scenarios Appendix As discussed in Section 4, a summary list of the six high...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.