pith. machine review for the scientific record.
sign in

arxiv: 2511.05501 · v3 · submitted 2025-09-30 · 💻 cs.HC · cs.AI

Towards Real-World Validity in Generative AI Benchmarks: Understanding and Designing Domain-Centered Evaluations for Journalism Practitioners

Pith reviewed 2026-05-18 10:56 UTC · model grok-4.3

classification 💻 cs.HC cs.AI
keywords generative AIbenchmarkshuman-centered designjournalismecological validityconstruct validityevaluation cookbookdomain-centered
0
0 comments X

The pith

A human-centered design process with journalism professionals produces a contextualized evaluation cookbook for generative AI benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to improve the real-world relevance of generative AI benchmarks by involving domain experts in their creation. Through a workshop with 23 journalism professionals, the authors identify specific challenges in creating evaluations that reflect actual journalistic tasks and values. This leads to an evaluation structure that practitioners can use and general design requirements for contextualized, value-aligned assessments. Sympathetic readers would care because existing benchmarks often lack ecological and construct validity, misrepresenting AI capabilities in practical settings. If successful, this approach could help domain users better understand and evaluate AI tools in their work.

Core claim

By engaging journalism professionals in a design workshop, the authors develop a domain-oriented evaluation cookbook that addresses challenges in translating tasks to constructs, aligning metrics with values, and balancing stakeholder needs, thereby laying out requirements for AI evaluations that are contextualized, value-aligned, and that cultivate evaluative literacy for end-users.

What carries the argument

The domain-oriented evaluation cookbook, derived from workshop findings on domain-specific challenges and tensions in benchmark design.

If this is right

  • Journalism practitioners receive a practical structure to experiment with AI evaluations.
  • AI benchmarks gain better alignment with domain-specific values and real-world usages.
  • Evaluations promote evaluative literacy among domain end-users.
  • General design requirements emerge for creating contextualized AI assessments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar design-based approaches could be tested in other professional domains to create tailored benchmarks.
  • This work implies that purely technical or general benchmarks may overlook critical domain tensions.
  • Broader sampling and validation studies would strengthen the generalizability of the cookbook.

Load-bearing premise

The assumption that findings from a single workshop with 23 self-selected journalism professionals are sufficient to surface generalizable domain-specific challenges and ground a reusable evaluation cookbook.

What would settle it

A larger, more diverse study of journalists testing the cookbook and finding it does not improve the relevance or usability of AI evaluations would falsify the claim.

Figures

Figures reproduced from arXiv: 2511.05501 by Charlotte Li, Jeremy Gilbert, Nick Diakopoulos, Nick Hagar, Sachita Nishal.

Figure 1
Figure 1. Figure 1: Google Colaboratory Notebooks are modular and allow us to interweave markdown text and code explaining rationales [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Task Context is presented in a table where categories [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Benchmarks play a significant role in how technology companies communicate about model capabilities and how researchers and the public understand generative AI systems. However, existing benchmarks have been criticized for their failure to adequately capture real-world usages (i.e. ecological validity) or to measure underlying concepts (i.e. construct validity). Building on approaches in HCI, we adopt a human-centered design process to address such critiques. Working within the journalism domain we engaged 23 professionals in a workshop which informed the design of a domain-oriented evaluation ``cookbook''. Our workshop findings surface domain-specific challenges and tensions faced by designers in translating specific tasks to evaluation constructs, aligning metrics with domain-specific values, and balancing needs among different stakeholders when constructing evaluations. Through an instantiation of design-based approaches for benchmark creation in the journalism domain, this work not only produces an evaluation structure for journalism practitioners to experiment with, but also lays out design requirements for AI evaluations that are contextualized, value-aligned, and cultivate evaluative literacy for domain end-users.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that existing generative AI benchmarks lack ecological and construct validity for real-world use, and that a human-centered design process—specifically a workshop with 23 journalism professionals—can surface domain-specific challenges (task-to-construct translation, value alignment, and stakeholder balancing) to inform the creation of a reusable 'evaluation cookbook' that supplies design requirements for contextualized, value-aligned evaluations cultivating evaluative literacy among domain end-users.

Significance. If the design requirements prove transferable, the work offers a concrete template for improving benchmark validity in applied domains by grounding evaluation design in practitioner input rather than generic metrics. This aligns with HCI traditions of participatory and design-based methods and could help address documented critiques of current AI evaluation practices.

major comments (1)
  1. [Workshop Findings / Method] The central claim that workshop outputs ground reusable design requirements for contextualized and value-aligned evaluations rests on findings from a single workshop with 23 self-selected participants. The abstract and method description provide no details on recruitment strategy, participant diversity, or follow-up validation, leaving open whether the surfaced tensions reflect stable journalism-domain features or group-specific priorities; this directly limits support for the 'cookbook' as a generalizable contribution.
minor comments (2)
  1. [Abstract] The abstract introduces the 'evaluation cookbook' without a concise description of its structure or concrete examples of how workshop insights were translated into specific requirements; a short illustrative example would aid reader comprehension.
  2. [Related Work] The manuscript would benefit from explicit positioning against prior HCI work on ecological validity and benchmark co-design (e.g., references to participatory evaluation frameworks) to clarify the incremental contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for acknowledging the potential significance of grounding AI benchmark design in practitioner input. We address the major methodological concern below and will revise the manuscript to increase transparency while clarifying the scope of our claims.

read point-by-point responses
  1. Referee: [Workshop Findings / Method] The central claim that workshop outputs ground reusable design requirements for contextualized and value-aligned evaluations rests on findings from a single workshop with 23 self-selected participants. The abstract and method description provide no details on recruitment strategy, participant diversity, or follow-up validation, leaving open whether the surfaced tensions reflect stable journalism-domain features or group-specific priorities; this directly limits support for the 'cookbook' as a generalizable contribution.

    Authors: We agree that the manuscript would benefit from greater detail on recruitment and participant characteristics. In the revision we will expand the Methods section to describe the recruitment approach (outreach via journalism professional networks and associations), include a summary of participant roles (e.g., reporters, editors, data journalists), experience levels, and organizational contexts, and add a demographics table. We will also explicitly note that this was a single exploratory workshop without a separate follow-up validation study. We will revise the framing to present the evaluation cookbook and design requirements as an initial, domain-grounded prototype and transferable process rather than a fully validated general instrument. This will better align the claims with the evidence while preserving the contribution as a concrete HCI-informed example for other applied domains. revision: yes

Circularity Check

0 steps flagged

No circularity; claims derived from external workshop observations

full rationale

The paper derives its domain-specific challenges, tensions, and design requirements for contextualized AI evaluations directly from the findings of a single workshop with 23 journalism professionals. This is an external empirical input collected via human-centered design methods rather than any self-referential definition, fitted parameter renamed as prediction, or load-bearing self-citation chain. No equations, ansatzes, uniqueness theorems, or renamings of prior results appear in the derivation; the evaluation cookbook and requirements are presented as outputs of the workshop process itself. The work is therefore self-contained against external benchmarks with no reduction of claims to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the premise that a human-centered workshop can reliably surface transferable domain challenges and that the resulting cookbook will improve validity; no numerical free parameters are introduced, and no new physical or mathematical entities are postulated.

axioms (1)
  • domain assumption Engaging domain professionals in a design workshop produces evaluation constructs that better capture ecological and construct validity than generic benchmarks.
    This premise is invoked when the authors move from workshop findings to the claim that the cookbook addresses real-world validity critiques.
invented entities (1)
  • evaluation cookbook no independent evidence
    purpose: A structured set of guidelines and templates that journalism practitioners can use to design their own AI evaluations.
    The cookbook is presented as a new practical artifact emerging from the workshop; no independent evidence outside the paper is provided for its effectiveness.

pith-pipeline@v0.9.0 · 5716 in / 1386 out tokens · 30198 ms · 2026-05-18T10:56:33.395414+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 16 internal anchors

  1. [1]

    Anthropic. 2025. Introducing Claude 4. https://www.anthropic.com/news/claude-

  2. [2]

    Accessed: 2025-08-22

  3. [3]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. arXiv:2108.07732 [cs.PL] https://arxiv.org/abs/2108.07732

  4. [4]

    Lessons from the Trenches on Reproducible Evaluation of Language Models

    Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Jeffrey Hsu, Mimansa Jaiswal, Wilson Y. Lee, Haonan Li, Charles Lovering, Niklas Muennighoff, Ellie Pavlick,...

  5. [5]

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2019. PIQA: Reasoning about Physical Commonsense in Natural Language. InAAAI Conference on Artificial Intelligence. https://api.semanticscholar.org/CorpusID: 208290939

  6. [6]

    Kathrin Blagec, Jakob Kraiger, Wolfgang Frühwirt, and Matthias Samwald. 2023. Benchmark datasets driving artificial intelligence development fail to capture the needs of medical professionals.Journal of Biomedical Informatics137 (2023), 104274. doi:10.1016/j.jbi.2022.104274

  7. [7]

    Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative research in psychology3, 2 (2006), 77–101

  8. [8]

    2011.Media Ethics: A Guide for Professional Conduct(5th ed.)

    Fred Brown (Ed.). 2011.Media Ethics: A Guide for Professional Conduct(5th ed.). Society of Professional Journalists, Nashville, TN, USA

  9. [9]

    Colleen Cheek, Elizabeth Austin, Lieke Richardson, Luke Testa, Natalia Ran- solin, Emilie Francis-Auton, Mariam Safi, Margaret Murphy, Aaron De Los Santos, Matthew Vukasovic, and Robyn Clay-Williams. 2024. Non-Participant Observations in Experience-Based Codesign: An example using a Case Study Research approach to explore Emergency Department Care.In- ter...

  10. [10]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  11. [11]

    Peter Clark and Oren Etzioni. 2016. My Computer Is an Honor Student — but How Intelligent Is It? Standardized Tests as a Measure of AI.AI Magazine37, 1 (Apr. 2016), 5–12. doi:10.1609/aimag.v37i1.2636

  12. [12]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168 [cs.LG] https://arxiv.org/abs/2110.14168

  13. [13]

    Hannes Cools and Nicholas Diakopoulos. 2024. Uses of Generative AI in the Newsroom: Mapping Journalists’ Perceptions of Perils and Possibilities.Journal- ism Practiceahead-of-print, ahead-of-print (2024), 1–19. doi:10.1080/17512786. 2024.2394558

  14. [14]

    Ernest Davis. 2015. The Limitations of Standardized Science Tests as Benchmarks for Artificial Intelligence Research: Position Paper. arXiv:1411.1629 [cs.AI] https: //arxiv.org/abs/1411.1629

  15. [15]

    Ernest Davis. 2023. Benchmarks for Automated Commonsense Reasoning: A Survey.ACM Comput. Surv.56, 4, Article 81 (Oct. 2023), 41 pages. doi:10.1145/ 3615355

  16. [16]

    Cheng Deng, Tianhang Zhang, Zhongmou He, Qiyuan Chen, Yuanyuan Shi, Yi Xu, Luoyi Fu, Weinan Zhang, Xinbing Wang, Chenghu Zhou, Zhouhan Lin, and Junxian He. 2024. K2: A Foundation Language Model for Geoscience Knowl- edge Understanding and Utilization. InProceedings of the 17th ACM Interna- tional Conference on Web Search and Data Mining(Merida, Mexico)(WS...

  17. [17]

    M Deuze. 2005. What is journalism?: Professional identity and ideology of journalists reconsidered.Journalism6, 4 (11 2005), 442 – 464. doi:10.1177/ 1464884905056815

  18. [18]

    2024.Generative AI in Journalism: The Evolution of Newswork and Ethics in a Generative Information Ecosystem

    Nicholas Diakopoulos, Hannes Cools, Charlotte Li, Natali Helberger, Ernest Kung, Aimee Rinehart, and Lisa Gibbs. 2024.Generative AI in Journalism: The Evolution of Newswork and Ethics in a Generative Information Ecosystem. Technical Report. The Associated Press. doi:10.13140/RG.2.2.31540.05765

  19. [19]

    Nicholas Diakopoulos, Christoph Trattner, Dietmar Jannach, Irene Costera Meijer, and Enrico Motta. 2023. Leveraging Professional Ethics for Responsible AI. Commun. ACM(2023). doi:10.1145/3625252

  20. [20]

    Tomás Dodds, Valeria Reséndez, Gerret von Nordheim, Theo Araujo, and Judith Moeller. 2024. Collaborative Coding Cultures: How Journalists Use GitHub as a Trading Zone.Digital Journalism12, 7 (2024), 1030–1051. arXiv:https://doi.org/10.1080/21670811.2024.2342468 doi:10.1080/21670811.2024. 2342468

  21. [21]

    Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. arXiv:1903.00161 [cs.CL] https://arxiv.org/ abs/1903.00161

  22. [22]

    Stephanie D’haeseleer, Kristin Van Damme, Hannes Cools, Sarah Van Leuven, and Tom Evens. 2025. AI Divides in Newsrooms? How Journalists in the Low Countries Use and Perceive Generative AI.Journalism Practice0, 0 (2025), 1–28. arXiv:https://doi.org/10.1080/17512786.2025.2538120 doi:10.1080/17512786.2025. 2538120

  23. [23]

    Kawin Ethayarajh and Dan Jurafsky. 2021. Utility is in the Eye of the User: A Critique of NLP Leaderboards. arXiv:2009.13888 [cs.CL] https://arxiv.org/abs/ 2009.13888

  24. [24]

    Logan Fisher, Bibi Halima, and Keli Yerian. 2024. Approaches to (language) learning. InLearning How to Learn Languages: A Theoretical and Practical Guide for Learning Additional Languages, Bibi Halima and Keli Yerian (Eds.). University of Oregon

  25. [25]

    Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N

    Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rock- more, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fagan, Galit Sarfaty, Gregory M. Dickinson, Haggai Porat, Jason Hegland, Jessica Wu, Joe Nudell, Joel Niklaus, John Nay, Jonathan H. C...

  26. [26]

    Chawla, Olaf Wiest, and Xiangliang Zhang

    Taicheng Guo, Kehan Guo, Bozhao Nan, Zhenwen Liang, Zhichun Guo, Nitesh V. Chawla, Olaf Wiest, and Xiangliang Zhang. 2023. What can large language models do in chemistry? a comprehensive benchmark on eight tasks. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates I...

  27. [27]

    Thomas Hanitzsch. 2007. Deconstructing Journalism Culture: Toward a Universal Theory.Communication theory17, 4 (11 2007), 367 – 385. doi:10.1111/j.1468- 2885.2007.00303.x

  28. [28]

    Bernstein, and Mykel John Kochenderfer

    Amelia Hardy, Anka Reuel, Kiana Jafari Meimandi, Lisa Soder, Allie Griffith, Dy- lan M Asmar, Sanmi Koyejo, Michael S. Bernstein, and Mykel John Kochenderfer

  29. [29]

    InProceedings of the 30th International Conference on Intelligent User Interfaces (IUI ’25)

    More than Marketing? On the Information Value of AI Benchmarks for Practitioners. InProceedings of the 30th International Conference on Intelligent User Interfaces (IUI ’25). Association for Computing Machinery, New York, NY, USA, 1032–1047. doi:10.1145/3708359.3712152 11

  30. [30]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Lan- guage Understanding. InInternational Conference on Learning Representations. https://openreview.net/forum?id=d7KBjmI3GmQ

  31. [31]

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring Mathematical Problem Solving With the MATH Dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https: //openreview.net/forum?id=7Bywt2mQsCe

  32. [32]

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=VTF8yNQM66

  33. [33]

    Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. RACE: Large-scale ReAding Comprehension Dataset From Examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Martha Palmer, Rebecca Hwa, and Sebastian Riedel (Eds.). Association for Computational Linguistics, Copenhagen, Denmark, 785–79...

  34. [34]

    Miao Li, Ming-Bin Chen, Bo Tang, ShengbinHou ShengbinHou, Pengyu Wang, Haiying Deng, Zhiyu Li, Feiyu Xiong, Keming Mao, Cheng Peng, and Yi Luo

  35. [35]

    doi:10.18653/v1/2024.acl-long.538

    NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism.Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)(2024), 9993–10014. doi:10.18653/v1/2024.acl-long.538

  36. [36]

    Zhuoqun Li, Hongyu Lin, Yaojie Lu, Hao Xiang, Xianpei Han, and Le Sun. 2024. Meta-Cognitive Analysis: Evaluating Declarative and Procedural Knowledge in Datasets and Large Language Models. arXiv:2403.09750 [cs.CL] https://arxiv. org/abs/2403.09750

  37. [37]

    Holistic Evaluation of Language Models

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michi- hiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hong...

  38. [38]

    Vera Liao and Ziang Xiao

    Q. Vera Liao and Ziang Xiao. 2025. Rethinking Model Evaluation as Narrowing the Socio-Technical Gap. arXiv:2306.03100 [cs.HC] https://arxiv.org/abs/2306.03100

  39. [39]

    Vera Liao, Alexandra Olteanu, and Ziang Xiao

    Yu Lu Liu, Su Lin Blodgett, Jackie Chi Kit Cheung, Q. Vera Liao, Alexandra Olteanu, and Ziang Xiao. 2024. ECBD: Evidence-Centered Benchmark Design for NLP. arXiv:2406.08723 [cs.CL] https://arxiv.org/abs/2406.08723

  40. [40]

    Eric Martínez. 2024. Re-evaluating GPT-4’s bar exam performance.Artificial Intelligence and Law(2024). doi:10.1007/s10506-024-09396-9

  41. [41]

    Alexandre Matton, Tom Sherborne, Dennis Aumiller, Elena Tommasone, Milad Alizadeh, Jingyi He, Raymond Ma, Maxime Voisin, Ellen Gilsenan-McMahon, and Matthias Gallé. 2024. On Leakage of Code Generation Evaluation Datasets. arXiv:2407.07565 [cs.CL] https://arxiv.org/abs/2407.07565

  42. [42]

    Sachita Nishal and Nicholas Diakopoulos. 2024. Envisioning the Applications and Implications of Generative AI for News Media.arXiv(2024). arXiv:2402.18835 doi:10.48550/arxiv.2402.18835

  43. [43]

    Sachita Nishal and Nicholas Diakopoulos. 2025. Values as Problems, Principles, and Tensions in Sociotechnical System Design for Journalism. InProceedings of the 2025 ACM Designing Interactive Systems Conference (DIS ’25). Association for Computing Machinery, New York, NY, USA, 2975–2991. doi:10.1145/3715336. 3735717

  44. [44]

    Sachita Nishal, Charlotte Li, and Nicholas Diakopoulos. 2024. Domain-Specific Evaluation Strategies for AI in Journalism.arXiv(2024). arXiv:2403.17911 doi:10.48550/arxiv.2403.17911

  45. [45]

    Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. 2023. Capabilities of GPT-4 on Medical Challenge Problems. arXiv:2303.13375 [cs.CL] https://arxiv.org/abs/2303.13375

  46. [46]

    O’Hara, Douglas S

    Keith J. O’Hara, Douglas S. Blank, and James B. Marshall. 2015. Computational Notebooks for AI Education. InThe Florida AI Research Society. https://api. semanticscholar.org/CorpusID:1772160

  47. [47]

    OpenAI. 2025. Introducing GPT-5. https://openai.com/index/introducing-gpt-5/. Published August 7, 2025; Accessed: 2025-08-22

  48. [48]

    Will Orr and Edward B. Kang. 2024. AI as a Sport: On the Competitive Epistemologies of Benchmarking. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency(Rio de Janeiro, Brazil)(FAccT ’24). Association for Computing Machinery, New York, NY, USA, 1875–1884. doi:10.1145/3630106.3659012

  49. [49]

    Deokgun Park, Simranjit Sachar, Nicholas Diakopoulos, and Niklas Elmqvist

  50. [50]

    Human Factors in Computing Systems(05 2016), 1114 – 1125

    Supporting Comment Moderators in Identifying High Quality Online News Comments.Proc. Human Factors in Computing Systems(05 2016), 1114 – 1125. doi:10.1145/2858036.2858389

  51. [51]

    Sora Park, Caroline Fisher, Edson TandocJr, Uwe Dulleck, Shengnan Pinker Yao, and William Lukamto. 0. The relationship between news trust, mis- trust and audience disengagement.Journalism0, 0 (0), 14648849241299775. arXiv:https://doi.org/10.1177/14648849241299775 doi:10.1177/14648849241299775

  52. [52]

    Bender, Alex Hanna, and Aman- dalynne Paullada

    Inioluwa Deborah Raji, Emily Denton, Emily M. Bender, Alex Hanna, and Aman- dalynne Paullada. 2021. AI and the Everything in the Whole Wide World Benchmark. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum?id= j6NxpQbREA1

  53. [53]

    Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You Don’t Know: Unanswerable Questions for SQuAD. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Iryna Gurevych and Yusuke Miyao (Eds.). Association for Computational Linguistics, Melbourne, Australia, 784–789. doi:10.18653/v1...

  54. [54]

    Sandeep Reddy, Wendy Rogers, Ville-Petteri Makinen, Enrico Coiera, Pieta Brown, Markus Wenzel, Eva Weicken, Saba Ansari, Piyush Mathur, Aaron Casey, and Blair Kelly. 2021. Evaluation Framework to Guide Implementation of AI Systems into Healthcare Settings.BMJ Health & Care Informatics28, 1 (2021), e100444. doi:10.1136/bmjhci-2021-100444

  55. [55]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. 2023. GPQA: A Graduate-Level Google-Proof Q&A Benchmark. arXiv:2311.12022 [cs.AI] https: //arxiv.org/abs/2311.12022

  56. [56]

    Kochenderfer

    Anka Reuel, Amelia Hardy, Chandler Smith, Max Lamparth, Malcolm Hardy, and Mykel J. Kochenderfer. 2024. BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices. arXiv:2411.12990 [cs.AI] https://arxiv. org/abs/2411.12990

  57. [57]

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi

  58. [58]

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale. arXiv:1907.10641 [cs.CL] https://arxiv.org/abs/1907.10641

  59. [59]

    Michael Saxon, Ari Holtzman, Peter West, William Yang Wang, and Naomi Saphra. 2024. Benchmarks as Microscopes: A Call for Model Metrology. arXiv:2407.16711 [cs.SE] https://arxiv.org/abs/2407.16711

  60. [60]

    Reva Schwartz, Rumman Chowdhury, Akash Kundu, Heather Frase, Marzieh Fadaee, Tom David, Gabriella Waters, Afaf Taik, Morgan Briggs, Patrick Hall, Shomik Jain, Kyra Yee, Spencer Thomas, Sundeep Bhandari, Paul Duncan, Andrew Thompson, Maya Carlyle, Qinghua Lu, Matthew Holmes, and Theodora Skeadas

  61. [61]

    arXiv:2505.18893 [cs.CY] https://arxiv.org/abs/2505.18893

    Reality Check: A New Evaluation Ecosystem Is Necessary to Understand AI’s Real World Effects. arXiv:2505.18893 [cs.CY] https://arxiv.org/abs/2505.18893

  62. [62]

    Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert- Voss, Jeff Wu, Alec Radford, Gretchen Krueger, Jong Wook Kim, Sarah Kreps, Miles McCain, Alex Newhouse, Jason Blazakis, Kris McGuffie, and Jasmine Wang. 2019. Release Strategies and the Social Impacts of Language Models. arXiv:1908.09203 [cs.CL] https://arxiv.org/abs/1908.09203

  63. [63]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amand...

  64. [64]

    Marc Steen. 2013. Co-Design as a Process of Joint Inquiry and Imagination. Design Issues29, 2 (2013), 16–28. doi:10.1162/DESI_a_00207

  65. [65]

    Harini Suresh, Emily Tseng, Meg Young, Mary Gray, Emma Pierson, and Karen Levy. 2024. Participation in the age of foundation models. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency(Rio de Janeiro, Brazil)(FAccT ’24). Association for Computing Machinery, New York, NY, USA, 1609–1621. doi:10.1145/3630106.3658992

  66. [66]

    Ownership, Not Just Happy Talk

    Emily Tseng, Meg Young, Marianne Aubin Le Quéré, Aimee Rinehart, and Harini Suresh. 2025. "Ownership, Not Just Happy Talk": Co-Designing a Participatory Large Language Model for Journalism.arXiv(2025). arXiv:2501.17299

  67. [67]

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 2024. 𝜏- bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045 [cs.AI] https://arxiv.org/abs/2406.12045

  68. [68]

    Mary Lynn Young and Alfred Hermida. 2024. People, Power, Platforms and the Business of Journalism.Digital Journalism12, 9 (2024), 1250–1260. arXiv:https://doi.org/10.1080/21670811.2023.2273523 doi:10.1080/21670811.2023. 2273523

  69. [69]

    Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Proces...

  70. [70]

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. HellaSwag: Can a Machine Really Finish Your Sentence?. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Ko- rhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 4791–4800. doi:10...

  71. [71]

    AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

    Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. 2023. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models. arXiv:2304.06364 [cs.CL] https: //arxiv.org/abs/2304.06364 A Rapporteur Discussion Template A.1 Use-Case Breakout Group • Part 1: How have you used or would use generat...