Towards Real-World Validity in Generative AI Benchmarks: Understanding and Designing Domain-Centered Evaluations for Journalism Practitioners
Pith reviewed 2026-05-18 10:56 UTC · model grok-4.3
The pith
A human-centered design process with journalism professionals produces a contextualized evaluation cookbook for generative AI benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By engaging journalism professionals in a design workshop, the authors develop a domain-oriented evaluation cookbook that addresses challenges in translating tasks to constructs, aligning metrics with values, and balancing stakeholder needs, thereby laying out requirements for AI evaluations that are contextualized, value-aligned, and that cultivate evaluative literacy for end-users.
What carries the argument
The domain-oriented evaluation cookbook, derived from workshop findings on domain-specific challenges and tensions in benchmark design.
If this is right
- Journalism practitioners receive a practical structure to experiment with AI evaluations.
- AI benchmarks gain better alignment with domain-specific values and real-world usages.
- Evaluations promote evaluative literacy among domain end-users.
- General design requirements emerge for creating contextualized AI assessments.
Where Pith is reading between the lines
- Similar design-based approaches could be tested in other professional domains to create tailored benchmarks.
- This work implies that purely technical or general benchmarks may overlook critical domain tensions.
- Broader sampling and validation studies would strengthen the generalizability of the cookbook.
Load-bearing premise
The assumption that findings from a single workshop with 23 self-selected journalism professionals are sufficient to surface generalizable domain-specific challenges and ground a reusable evaluation cookbook.
What would settle it
A larger, more diverse study of journalists testing the cookbook and finding it does not improve the relevance or usability of AI evaluations would falsify the claim.
Figures
read the original abstract
Benchmarks play a significant role in how technology companies communicate about model capabilities and how researchers and the public understand generative AI systems. However, existing benchmarks have been criticized for their failure to adequately capture real-world usages (i.e. ecological validity) or to measure underlying concepts (i.e. construct validity). Building on approaches in HCI, we adopt a human-centered design process to address such critiques. Working within the journalism domain we engaged 23 professionals in a workshop which informed the design of a domain-oriented evaluation ``cookbook''. Our workshop findings surface domain-specific challenges and tensions faced by designers in translating specific tasks to evaluation constructs, aligning metrics with domain-specific values, and balancing needs among different stakeholders when constructing evaluations. Through an instantiation of design-based approaches for benchmark creation in the journalism domain, this work not only produces an evaluation structure for journalism practitioners to experiment with, but also lays out design requirements for AI evaluations that are contextualized, value-aligned, and cultivate evaluative literacy for domain end-users.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that existing generative AI benchmarks lack ecological and construct validity for real-world use, and that a human-centered design process—specifically a workshop with 23 journalism professionals—can surface domain-specific challenges (task-to-construct translation, value alignment, and stakeholder balancing) to inform the creation of a reusable 'evaluation cookbook' that supplies design requirements for contextualized, value-aligned evaluations cultivating evaluative literacy among domain end-users.
Significance. If the design requirements prove transferable, the work offers a concrete template for improving benchmark validity in applied domains by grounding evaluation design in practitioner input rather than generic metrics. This aligns with HCI traditions of participatory and design-based methods and could help address documented critiques of current AI evaluation practices.
major comments (1)
- [Workshop Findings / Method] The central claim that workshop outputs ground reusable design requirements for contextualized and value-aligned evaluations rests on findings from a single workshop with 23 self-selected participants. The abstract and method description provide no details on recruitment strategy, participant diversity, or follow-up validation, leaving open whether the surfaced tensions reflect stable journalism-domain features or group-specific priorities; this directly limits support for the 'cookbook' as a generalizable contribution.
minor comments (2)
- [Abstract] The abstract introduces the 'evaluation cookbook' without a concise description of its structure or concrete examples of how workshop insights were translated into specific requirements; a short illustrative example would aid reader comprehension.
- [Related Work] The manuscript would benefit from explicit positioning against prior HCI work on ecological validity and benchmark co-design (e.g., references to participatory evaluation frameworks) to clarify the incremental contribution.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for acknowledging the potential significance of grounding AI benchmark design in practitioner input. We address the major methodological concern below and will revise the manuscript to increase transparency while clarifying the scope of our claims.
read point-by-point responses
-
Referee: [Workshop Findings / Method] The central claim that workshop outputs ground reusable design requirements for contextualized and value-aligned evaluations rests on findings from a single workshop with 23 self-selected participants. The abstract and method description provide no details on recruitment strategy, participant diversity, or follow-up validation, leaving open whether the surfaced tensions reflect stable journalism-domain features or group-specific priorities; this directly limits support for the 'cookbook' as a generalizable contribution.
Authors: We agree that the manuscript would benefit from greater detail on recruitment and participant characteristics. In the revision we will expand the Methods section to describe the recruitment approach (outreach via journalism professional networks and associations), include a summary of participant roles (e.g., reporters, editors, data journalists), experience levels, and organizational contexts, and add a demographics table. We will also explicitly note that this was a single exploratory workshop without a separate follow-up validation study. We will revise the framing to present the evaluation cookbook and design requirements as an initial, domain-grounded prototype and transferable process rather than a fully validated general instrument. This will better align the claims with the evidence while preserving the contribution as a concrete HCI-informed example for other applied domains. revision: yes
Circularity Check
No circularity; claims derived from external workshop observations
full rationale
The paper derives its domain-specific challenges, tensions, and design requirements for contextualized AI evaluations directly from the findings of a single workshop with 23 journalism professionals. This is an external empirical input collected via human-centered design methods rather than any self-referential definition, fitted parameter renamed as prediction, or load-bearing self-citation chain. No equations, ansatzes, uniqueness theorems, or renamings of prior results appear in the derivation; the evaluation cookbook and requirements are presented as outputs of the workshop process itself. The work is therefore self-contained against external benchmarks with no reduction of claims to their own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Engaging domain professionals in a design workshop produces evaluation constructs that better capture ecological and construct validity than generic benchmarks.
invented entities (1)
-
evaluation cookbook
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Anthropic. 2025. Introducing Claude 4. https://www.anthropic.com/news/claude-
work page 2025
-
[2]
Accessed: 2025-08-22
work page 2025
-
[3]
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models. arXiv:2108.07732 [cs.PL] https://arxiv.org/abs/2108.07732
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[4]
Lessons from the Trenches on Reproducible Evaluation of Language Models
Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Jeffrey Hsu, Mimansa Jaiswal, Wilson Y. Lee, Haonan Li, Charles Lovering, Niklas Muennighoff, Ellie Pavlick,...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2019. PIQA: Reasoning about Physical Commonsense in Natural Language. InAAAI Conference on Artificial Intelligence. https://api.semanticscholar.org/CorpusID: 208290939
work page 2019
-
[6]
Kathrin Blagec, Jakob Kraiger, Wolfgang Frühwirt, and Matthias Samwald. 2023. Benchmark datasets driving artificial intelligence development fail to capture the needs of medical professionals.Journal of Biomedical Informatics137 (2023), 104274. doi:10.1016/j.jbi.2022.104274
-
[7]
Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative research in psychology3, 2 (2006), 77–101
work page 2006
-
[8]
2011.Media Ethics: A Guide for Professional Conduct(5th ed.)
Fred Brown (Ed.). 2011.Media Ethics: A Guide for Professional Conduct(5th ed.). Society of Professional Journalists, Nashville, TN, USA
work page 2011
-
[9]
Colleen Cheek, Elizabeth Austin, Lieke Richardson, Luke Testa, Natalia Ran- solin, Emilie Francis-Auton, Mariam Safi, Margaret Murphy, Aaron De Los Santos, Matthew Vukasovic, and Robyn Clay-Williams. 2024. Non-Participant Observations in Experience-Based Codesign: An example using a Case Study Research approach to explore Emergency Department Care.In- ter...
-
[10]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[11]
Peter Clark and Oren Etzioni. 2016. My Computer Is an Honor Student — but How Intelligent Is It? Standardized Tests as a Measure of AI.AI Magazine37, 1 (Apr. 2016), 5–12. doi:10.1609/aimag.v37i1.2636
-
[12]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168 [cs.LG] https://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[13]
Hannes Cools and Nicholas Diakopoulos. 2024. Uses of Generative AI in the Newsroom: Mapping Journalists’ Perceptions of Perils and Possibilities.Journal- ism Practiceahead-of-print, ahead-of-print (2024), 1–19. doi:10.1080/17512786. 2024.2394558
-
[14]
Ernest Davis. 2015. The Limitations of Standardized Science Tests as Benchmarks for Artificial Intelligence Research: Position Paper. arXiv:1411.1629 [cs.AI] https: //arxiv.org/abs/1411.1629
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[15]
Ernest Davis. 2023. Benchmarks for Automated Commonsense Reasoning: A Survey.ACM Comput. Surv.56, 4, Article 81 (Oct. 2023), 41 pages. doi:10.1145/ 3615355
work page 2023
-
[16]
Cheng Deng, Tianhang Zhang, Zhongmou He, Qiyuan Chen, Yuanyuan Shi, Yi Xu, Luoyi Fu, Weinan Zhang, Xinbing Wang, Chenghu Zhou, Zhouhan Lin, and Junxian He. 2024. K2: A Foundation Language Model for Geoscience Knowl- edge Understanding and Utilization. InProceedings of the 17th ACM Interna- tional Conference on Web Search and Data Mining(Merida, Mexico)(WS...
-
[17]
M Deuze. 2005. What is journalism?: Professional identity and ideology of journalists reconsidered.Journalism6, 4 (11 2005), 442 – 464. doi:10.1177/ 1464884905056815
work page 2005
-
[18]
Nicholas Diakopoulos, Hannes Cools, Charlotte Li, Natali Helberger, Ernest Kung, Aimee Rinehart, and Lisa Gibbs. 2024.Generative AI in Journalism: The Evolution of Newswork and Ethics in a Generative Information Ecosystem. Technical Report. The Associated Press. doi:10.13140/RG.2.2.31540.05765
-
[19]
Nicholas Diakopoulos, Christoph Trattner, Dietmar Jannach, Irene Costera Meijer, and Enrico Motta. 2023. Leveraging Professional Ethics for Responsible AI. Commun. ACM(2023). doi:10.1145/3625252
-
[20]
Tomás Dodds, Valeria Reséndez, Gerret von Nordheim, Theo Araujo, and Judith Moeller. 2024. Collaborative Coding Cultures: How Journalists Use GitHub as a Trading Zone.Digital Journalism12, 7 (2024), 1030–1051. arXiv:https://doi.org/10.1080/21670811.2024.2342468 doi:10.1080/21670811.2024. 2342468
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1080/21670811.2024.2342468 2024
-
[21]
Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. arXiv:1903.00161 [cs.CL] https://arxiv.org/ abs/1903.00161
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[22]
Stephanie D’haeseleer, Kristin Van Damme, Hannes Cools, Sarah Van Leuven, and Tom Evens. 2025. AI Divides in Newsrooms? How Journalists in the Low Countries Use and Perceive Generative AI.Journalism Practice0, 0 (2025), 1–28. arXiv:https://doi.org/10.1080/17512786.2025.2538120 doi:10.1080/17512786.2025. 2538120
- [23]
-
[24]
Logan Fisher, Bibi Halima, and Keli Yerian. 2024. Approaches to (language) learning. InLearning How to Learn Languages: A Theoretical and Practical Guide for Learning Additional Languages, Bibi Halima and Keli Yerian (Eds.). University of Oregon
work page 2024
-
[25]
Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rock- more, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fagan, Galit Sarfaty, Gregory M. Dickinson, Haggai Porat, Jason Hegland, Jessica Wu, Joe Nudell, Joel Niklaus, John Nay, Jonathan H. C...
-
[26]
Chawla, Olaf Wiest, and Xiangliang Zhang
Taicheng Guo, Kehan Guo, Bozhao Nan, Zhenwen Liang, Zhichun Guo, Nitesh V. Chawla, Olaf Wiest, and Xiangliang Zhang. 2023. What can large language models do in chemistry? a comprehensive benchmark on eight tasks. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates I...
work page 2023
-
[27]
Thomas Hanitzsch. 2007. Deconstructing Journalism Culture: Toward a Universal Theory.Communication theory17, 4 (11 2007), 367 – 385. doi:10.1111/j.1468- 2885.2007.00303.x
-
[28]
Bernstein, and Mykel John Kochenderfer
Amelia Hardy, Anka Reuel, Kiana Jafari Meimandi, Lisa Soder, Allie Griffith, Dy- lan M Asmar, Sanmi Koyejo, Michael S. Bernstein, and Mykel John Kochenderfer
-
[29]
InProceedings of the 30th International Conference on Intelligent User Interfaces (IUI ’25)
More than Marketing? On the Information Value of AI Benchmarks for Practitioners. InProceedings of the 30th International Conference on Intelligent User Interfaces (IUI ’25). Association for Computing Machinery, New York, NY, USA, 1032–1047. doi:10.1145/3708359.3712152 11
-
[30]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Lan- guage Understanding. InInternational Conference on Learning Representations. https://openreview.net/forum?id=d7KBjmI3GmQ
work page 2021
-
[31]
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring Mathematical Problem Solving With the MATH Dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https: //openreview.net/forum?id=7Bywt2mQsCe
work page 2021
-
[32]
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=VTF8yNQM66
work page 2024
-
[33]
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. RACE: Large-scale ReAding Comprehension Dataset From Examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Martha Palmer, Rebecca Hwa, and Sebastian Riedel (Eds.). Association for Computational Linguistics, Copenhagen, Denmark, 785–79...
work page 2017
-
[34]
Miao Li, Ming-Bin Chen, Bo Tang, ShengbinHou ShengbinHou, Pengyu Wang, Haiying Deng, Zhiyu Li, Feiyu Xiong, Keming Mao, Cheng Peng, and Yi Luo
-
[35]
doi:10.18653/v1/2024.acl-long.538
NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism.Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)(2024), 9993–10014. doi:10.18653/v1/2024.acl-long.538
- [36]
-
[37]
Holistic Evaluation of Language Models
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michi- hiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hong...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
Q. Vera Liao and Ziang Xiao. 2025. Rethinking Model Evaluation as Narrowing the Socio-Technical Gap. arXiv:2306.03100 [cs.HC] https://arxiv.org/abs/2306.03100
-
[39]
Vera Liao, Alexandra Olteanu, and Ziang Xiao
Yu Lu Liu, Su Lin Blodgett, Jackie Chi Kit Cheung, Q. Vera Liao, Alexandra Olteanu, and Ziang Xiao. 2024. ECBD: Evidence-Centered Benchmark Design for NLP. arXiv:2406.08723 [cs.CL] https://arxiv.org/abs/2406.08723
-
[40]
Eric Martínez. 2024. Re-evaluating GPT-4’s bar exam performance.Artificial Intelligence and Law(2024). doi:10.1007/s10506-024-09396-9
-
[41]
Alexandre Matton, Tom Sherborne, Dennis Aumiller, Elena Tommasone, Milad Alizadeh, Jingyi He, Raymond Ma, Maxime Voisin, Ellen Gilsenan-McMahon, and Matthias Gallé. 2024. On Leakage of Code Generation Evaluation Datasets. arXiv:2407.07565 [cs.CL] https://arxiv.org/abs/2407.07565
-
[42]
Sachita Nishal and Nicholas Diakopoulos. 2024. Envisioning the Applications and Implications of Generative AI for News Media.arXiv(2024). arXiv:2402.18835 doi:10.48550/arxiv.2402.18835
-
[43]
Sachita Nishal and Nicholas Diakopoulos. 2025. Values as Problems, Principles, and Tensions in Sociotechnical System Design for Journalism. InProceedings of the 2025 ACM Designing Interactive Systems Conference (DIS ’25). Association for Computing Machinery, New York, NY, USA, 2975–2991. doi:10.1145/3715336. 3735717
-
[44]
Sachita Nishal, Charlotte Li, and Nicholas Diakopoulos. 2024. Domain-Specific Evaluation Strategies for AI in Journalism.arXiv(2024). arXiv:2403.17911 doi:10.48550/arxiv.2403.17911
-
[45]
Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. 2023. Capabilities of GPT-4 on Medical Challenge Problems. arXiv:2303.13375 [cs.CL] https://arxiv.org/abs/2303.13375
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
Keith J. O’Hara, Douglas S. Blank, and James B. Marshall. 2015. Computational Notebooks for AI Education. InThe Florida AI Research Society. https://api. semanticscholar.org/CorpusID:1772160
work page 2015
-
[47]
OpenAI. 2025. Introducing GPT-5. https://openai.com/index/introducing-gpt-5/. Published August 7, 2025; Accessed: 2025-08-22
work page 2025
-
[48]
Will Orr and Edward B. Kang. 2024. AI as a Sport: On the Competitive Epistemologies of Benchmarking. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency(Rio de Janeiro, Brazil)(FAccT ’24). Association for Computing Machinery, New York, NY, USA, 1875–1884. doi:10.1145/3630106.3659012
-
[49]
Deokgun Park, Simranjit Sachar, Nicholas Diakopoulos, and Niklas Elmqvist
-
[50]
Human Factors in Computing Systems(05 2016), 1114 – 1125
Supporting Comment Moderators in Identifying High Quality Online News Comments.Proc. Human Factors in Computing Systems(05 2016), 1114 – 1125. doi:10.1145/2858036.2858389
-
[51]
Sora Park, Caroline Fisher, Edson TandocJr, Uwe Dulleck, Shengnan Pinker Yao, and William Lukamto. 0. The relationship between news trust, mis- trust and audience disengagement.Journalism0, 0 (0), 14648849241299775. arXiv:https://doi.org/10.1177/14648849241299775 doi:10.1177/14648849241299775
-
[52]
Bender, Alex Hanna, and Aman- dalynne Paullada
Inioluwa Deborah Raji, Emily Denton, Emily M. Bender, Alex Hanna, and Aman- dalynne Paullada. 2021. AI and the Everything in the Whole Wide World Benchmark. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum?id= j6NxpQbREA1
work page 2021
-
[53]
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You Don’t Know: Unanswerable Questions for SQuAD. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Iryna Gurevych and Yusuke Miyao (Eds.). Association for Computational Linguistics, Melbourne, Australia, 784–789. doi:10.18653/v1...
-
[54]
Sandeep Reddy, Wendy Rogers, Ville-Petteri Makinen, Enrico Coiera, Pieta Brown, Markus Wenzel, Eva Weicken, Saba Ansari, Piyush Mathur, Aaron Casey, and Blair Kelly. 2021. Evaluation Framework to Guide Implementation of AI Systems into Healthcare Settings.BMJ Health & Care Informatics28, 1 (2021), e100444. doi:10.1136/bmjhci-2021-100444
-
[55]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. 2023. GPQA: A Graduate-Level Google-Proof Q&A Benchmark. arXiv:2311.12022 [cs.AI] https: //arxiv.org/abs/2311.12022
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
Anka Reuel, Amelia Hardy, Chandler Smith, Max Lamparth, Malcolm Hardy, and Mykel J. Kochenderfer. 2024. BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices. arXiv:2411.12990 [cs.AI] https://arxiv. org/abs/2411.12990
-
[57]
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi
-
[58]
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
WinoGrande: An Adversarial Winograd Schema Challenge at Scale. arXiv:1907.10641 [cs.CL] https://arxiv.org/abs/1907.10641
work page internal anchor Pith review Pith/arXiv arXiv 1907
- [59]
-
[60]
Reva Schwartz, Rumman Chowdhury, Akash Kundu, Heather Frase, Marzieh Fadaee, Tom David, Gabriella Waters, Afaf Taik, Morgan Briggs, Patrick Hall, Shomik Jain, Kyra Yee, Spencer Thomas, Sundeep Bhandari, Paul Duncan, Andrew Thompson, Maya Carlyle, Qinghua Lu, Matthew Holmes, and Theodora Skeadas
-
[61]
arXiv:2505.18893 [cs.CY] https://arxiv.org/abs/2505.18893
Reality Check: A New Evaluation Ecosystem Is Necessary to Understand AI’s Real World Effects. arXiv:2505.18893 [cs.CY] https://arxiv.org/abs/2505.18893
-
[62]
Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert- Voss, Jeff Wu, Alec Radford, Gretchen Krueger, Jong Wook Kim, Sarah Kreps, Miles McCain, Alex Newhouse, Jason Blazakis, Kris McGuffie, and Jasmine Wang. 2019. Release Strategies and the Social Impacts of Language Models. arXiv:1908.09203 [cs.CL] https://arxiv.org/abs/1908.09203
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[63]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amand...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[64]
Marc Steen. 2013. Co-Design as a Process of Joint Inquiry and Imagination. Design Issues29, 2 (2013), 16–28. doi:10.1162/DESI_a_00207
-
[65]
Harini Suresh, Emily Tseng, Meg Young, Mary Gray, Emma Pierson, and Karen Levy. 2024. Participation in the age of foundation models. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency(Rio de Janeiro, Brazil)(FAccT ’24). Association for Computing Machinery, New York, NY, USA, 1609–1621. doi:10.1145/3630106.3658992
-
[66]
Ownership, Not Just Happy Talk
Emily Tseng, Meg Young, Marianne Aubin Le Quéré, Aimee Rinehart, and Harini Suresh. 2025. "Ownership, Not Just Happy Talk": Co-Designing a Participatory Large Language Model for Journalism.arXiv(2025). arXiv:2501.17299
-
[67]
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 2024. 𝜏- bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045 [cs.AI] https://arxiv.org/abs/2406.12045
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[68]
Mary Lynn Young and Alfred Hermida. 2024. People, Power, Platforms and the Business of Journalism.Digital Journalism12, 9 (2024), 1250–1260. arXiv:https://doi.org/10.1080/21670811.2023.2273523 doi:10.1080/21670811.2023. 2273523
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1080/21670811.2023.2273523 2024
-
[69]
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Proces...
-
[70]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. HellaSwag: Can a Machine Really Finish Your Sentence?. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Ko- rhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 4791–4800. doi:10...
-
[71]
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. 2023. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models. arXiv:2304.06364 [cs.CL] https: //arxiv.org/abs/2304.06364 A Rapporteur Discussion Template A.1 Use-Case Breakout Group • Part 1: How have you used or would use generat...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.