pith. machine review for the scientific record. sign in

arxiv: 2604.02406 · v1 · submitted 2026-04-02 · 💻 cs.CY

Recognition: 2 theorem links

· Lean Theorem

Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed Rubrics

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:57 UTC · model grok-4.3

classification 💻 cs.CY
keywords cultural appropriatenessAI-generated imagescommunity engagementtext-to-image modelsmeasurement instrumentsLLM-as-a-judgecultural artifactssystematization
0
0 comments X

The pith

Communities define what counts as culturally appropriate in AI-generated images of their artifacts before automation begins.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates how to gather input from three specific communities to turn the abstract idea of cultural appropriateness into precise definitions for evaluating text-to-image models. By focusing engagement on this early systematization stage, the resulting concepts directly incorporate how people interact with artifacts and want them shown. This approach is then tested for conversion into repeatable rubrics that multimodal LLMs can apply automatically across different models and settings. The work highlights both the gains in validity from community input and the remaining difficulties in preserving that input during automation.

Core claim

Systematized concepts of cultural appropriateness derived from community members' lived experiences with artifacts and their preferences for depiction provide valid measures that reflect authentic perspectives, and these can be operationalized into automated instruments using a multimodal LLM-as-a-judge method while retaining the value of the original community input.

What carries the argument

The three-stage measurement process that concentrates community input in the systematization phase to create precise concepts before they are turned into concrete rubrics for automated application.

If this is right

  • Measurement instruments for AI image generation can be made repeatable and automatable across models while still grounding in community expertise.
  • Early community involvement in defining concepts reduces the risk that automated evaluations miss key cultural concerns.
  • Case-study rubrics from specific groups like blind UK residents or Kerala and Tamil Nadu communities can guide evaluation of how models depict material culture.
  • The systematization step creates a reusable foundation for applying the same measures to new datasets or evaluation settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same early-community systematization approach could apply to measuring other AI harms involving cultural or identity representation.
  • Hybrid human-plus-LLM pipelines may be needed to handle cases where full automation loses nuance from the original community definitions.
  • Standardized community-informed rubrics could support cross-model benchmarks that track improvement in cultural depiction over time.

Load-bearing premise

Community perspectives on cultural appropriateness can be translated into structured rubrics without losing essential expertise and lived experience.

What would settle it

Direct comparison showing that community members rate the same set of AI-generated artifact images differently from an LLM judge applying the community-derived rubrics.

Figures

Figures reproduced from arXiv: 2604.02406 by Anja Thieme, Cecily Morrison, Daniela Massiceti, Deepthi Sudharsan, Hamna, Hoda Heidari, Jennifer Wortman Vaughan, Nari Johnson, Samantha Dalal, Theo Holroyd.

Figure 1
Figure 1. Figure 1: Scaffolding community engagement to develop community-centered measures of cultural representation. Given an input prompt (e.g., “a photo of a guide cane”), we invited community members to participate in designing a rubric that captures their expertise and preferences for each cultural artifact (systematization). Our research team then explored the use of this rubric within an automated multimodal LLM-as-a… view at source ↗
Figure 2
Figure 2. Figure 2: Measurement framework from the social sciences [1, 116]. Our research studies how to center community knowledge in the systematization process before operationalizing the systematized concept as an automated MLLM-as-a-judge system. Applying Measurement Theory to AI Evaluations. Recent work by Wallach et al. [116] advocates for researchers to rethink how they evaluate generative AI systems, drawing on the t… view at source ↗
Figure 3
Figure 3. Figure 3: Selected culturally significant artifacts. From right to left: (1) With the blind and low vision community, we selected a guide cane (a mobility aid that is held diagonally across one’s body) and a braille notetaker (an electronic device that can be used to read and write notes in tactile braille). (2) With residents of Tamil Nadu, we selected Pallanguzhi (a two-player mancala game where players compete to… view at source ↗
Figure 4
Figure 4. Figure 4: A rubric to score images of a guide cane, designed with BLV community members. Criteria that correspond to visual features in images are organized under two themes that describe participants’ desires for cultural representation. model for its demonstrated performance on MLLM-as-a-judge tasks [130] and report all results averaged over five random seeds. We provide additional details in Appendix B.3.3. To va… view at source ↗
Figure 5
Figure 5. Figure 5: Human-MLLM judge alignment for individual rubric criteria. A histogram that shows the human-MLLM agreement rate for individual rubric criteria. We find that there is high variance across criteria in the MLLM’s ability to annotate a criterion accurately, such as the example criterion on the left, where GPT 4-o has low accuracy (agreement rate 0.46) at annotating whether a drum’s head is made of the correct … view at source ↗
Figure 6
Figure 6. Figure 6: Community-elicited rubrics differ meaningfully from those generated by off-the-shelf LLMs. Our rubrics differ from LLM-generated rubrics in three ways, each illustrated using an example (Appendix B.1.3). First, LLM-generated rubrics can include factual or interpretive errors that reflect misunderstandings of the artifact (e.g., whether a braille notetaker should have a screen). Second, our rubrics provide … view at source ↗
Figure 7
Figure 7. Figure 7: Annotated LLM-generated rubric for a guide cane. While generally providing an accurate description of a guide cane, the rubric misses several key details. The rubric does not provide a complete description of the straight handle shape of a cane (C2), a feature that is of critical importance to the community. In workshops, we learned that a band of red tape on a cane’s body is often a visual signifier that … view at source ↗
Figure 8
Figure 8. Figure 8: Annotated LLM-generated rubric for a braille notetaker. The rubric criteria include both inaccurate descriptions of a braille notetaking device, and do not include descriptive details about valid depictions of braille. A braille notetaker does not resemble a notepad (e.g., it does not include a writing device such as a pen), and instead resembles a slim rectangular box (C1). The rubric does not provide a d… view at source ↗
Figure 9
Figure 9. Figure 9: Annotated LLM-generated rubric for Pallanguzhi. The rubric generally provides an accurate description of the most important characteristics of the a Pallanguzhi board, with two differences from the community-elicited rubric. (C1) Community members clarified that the color of the wood is important, and that Pallanguzhi baords are traditionally made of a deep-brown teakwood. (C2) The number of pits in each r… view at source ↗
Figure 10
Figure 10. Figure 10: Annotated LLM-generated rubric for a Mridangam. The rubric lacks many of the critical details that distinguish the Mridangam from related drums and percussion instruments. One significant omission is the black circular membrane that must be present on both drumheads, a key feature that contributes to the timbre of the drum (C5). One drumhead is often slightly larger than the other (C2). The Mridangam shou… view at source ↗
Figure 11
Figure 11. Figure 11: Annotated LLM-generated rubric for a Kasavu saree. The rubric generally provides an accurate description of a Kasavu saree, demonstrating substantial overlap with the community-elicited rubric. However, community members were clear that the material must be cotton and not silk (C4). C1: The depiction shows a long and narrow wooden boat traditionally used in Kerala, India. C2: The image includes details su… view at source ↗
Figure 12
Figure 12. Figure 12: Annotated LLM-generated rubric for Chundan Vallam. The rubric criteria cover the general structure of the Chundan Vallam but do not specify its defining features. In particular, they omit details about the oar structure and handling (C2); community members specified that the oars should be long, angled downward toward the water, and that each oarsman must use a single oar. The rubrics also do not specify … view at source ↗
Figure 13
Figure 13. Figure 13: Criterion-level annotations provided by humans reveal the specific representational errors that make depictions of a braille notetaker inappropriate. The figure displays a reference photo of a braille notetaker, and example AI-generated images that fall into one of four groups (as annotated by humans): (1) images that are appropriate to show (and no filter-out criteria are met), (2) images that do not mee… view at source ↗
Figure 14
Figure 14. Figure 14: Comparing (manual) rubric application across models for a braille notetaker. The frequency at which different criteria are violated (reported here using annotations provided by humans) varies across different models. For example, the GPT Image-1 images of braille notetakers that are inappropriate to show are all violate Theme 1, Criteria 4 (failing to depict valid braille). In contrast, images generated b… view at source ↗
Figure 15
Figure 15. Figure 15: Comparing (manual) rubric application across models for a Mridangam. Comparing the frequency at which different criteria are violated across models allows practitioners to draw interpretable insights about models’ failure modes. With the exception of GPT Image-1, many of the models (i.e., DALL·E 3, Flux.1 DEV, and Stable Diffusion 3 Medium) consistently fail to meet several criteria, such as failing to de… view at source ↗
Figure 16
Figure 16. Figure 16: Comparing (manual) rubric application across models for a Kasavu Saree. Visualizing the breakdown of criteria that are violated by different image generation models reveals interpretable insights about model behavior. For example, only Flux.1 and the Stable Diffusion models depict the saree with additional unnecessary embellishment (Theme 1, Criteria 4). The GPT Image-1 images consistently depict the sare… view at source ↗
Figure 17
Figure 17. Figure 17: Stable Diffusion 3 generates depictions of unrelated cultural artifacts and scenes when given simple transliterated prompts. Depictions improve when images are generated using DALL·E 3 revised prompts instead. The images on the left were generated with the simple transliterated prompt “A photo of a Chundan Vallam”. Instead of producing depictions of a boat, the generated images show unrelated depictions o… view at source ↗
read the original abstract

Measurement is essential to improving AI performance and mitigating harms for marginalized groups. As generative AI systems are rapidly deployed across geographies and contexts, AI measurement practices must be designed to support repeatable, automatable application across different models, datasets, and evaluation settings. But the drive to automate measurement can be in tension with the ability for measurement instruments to capture the expertise and perspectives of communities impacted by AI. Recent work advocates for breaking measurement into several key stages: first moving from an abstract concept to be measured into a precise, "systematized" concept; next operationalizing the systematized concept into a concrete measurement instrument; and finally applying the measurement instrument on data to produce measurements. This opens up an opportunity to concentrate community engagement in the systematization phase before operationalizing and applying measurement instruments. In this paper, we explore how to involve communities in systematizing the concept of "cultural appropriateness" in text-to-image models' representation of culturally significant artifacts through case studies with three communities: blind and low vision individuals residing in the UK, residents of Kerala, and residents of Tamil Nadu. Our systematized concepts reflect community members' lived experiences interacting with each artifact and how they want their material culture to be depicted, demonstrating the value of community involvement in defining valid measures. We explore how these systematized concepts can be operationalized into automated measurement instruments that could be applied using a multimodal LLM-as-a-judge approach and challenges that remain. We reflect on the benefits and limitations of such approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript describes a framework for involving communities in systematizing the concept of cultural appropriateness for evaluating text-to-image model outputs depicting cultural artifacts. It presents three case studies (blind and low-vision UK residents, Kerala residents, Tamil Nadu residents) in which community input shapes systematized concepts grounded in lived experiences and desired depictions of material culture. The work then explores operationalizing these concepts into automated multimodal LLM-as-a-judge instruments and reflects on benefits, limitations, and remaining challenges of such automation.

Significance. If the operationalization step can be shown to preserve community perspectives, the approach would offer a practical route to repeatable, community-informed metrics for cultural representation in generative AI, addressing tensions between automation and expertise capture in measurement design.

major comments (2)
  1. The central claim that community involvement in systematization produces valid measures rests on the untested assumption that these concepts can be operationalized into LLM-as-a-judge instruments without substantial loss of fidelity. No quantitative agreement study (e.g., side-by-side ratings by community members versus the LLM on identical generated images) or validation of prompt construction against community judgments is reported, leaving the transition from qualitative systematization to automated application unsupported by evidence.
  2. Details on how the systematized concepts were translated into concrete LLM prompts or rubrics are absent, including any iterative refinement process or checks for consistency with original community input. This omission makes it impossible to evaluate whether the operationalized instruments accurately reflect the lived-experience grounding described in the case studies.
minor comments (1)
  1. The abstract and introduction would benefit from clearer demarcation between the qualitative case-study contributions and the exploratory operationalization discussion to help readers distinguish established findings from proposed future directions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater clarity on the operationalization of community-informed concepts. We address each major comment below and propose targeted revisions to strengthen the manuscript without altering its core exploratory focus on community-driven systematization.

read point-by-point responses
  1. Referee: The central claim that community involvement in systematization produces valid measures rests on the untested assumption that these concepts can be operationalized into LLM-as-a-judge instruments without substantial loss of fidelity. No quantitative agreement study (e.g., side-by-side ratings by community members versus the LLM on identical generated images) or validation of prompt construction against community judgments is reported, leaving the transition from qualitative systematization to automated application unsupported by evidence.

    Authors: The manuscript's primary claim centers on the value of community involvement during the systematization phase to ground the concept of cultural appropriateness in lived experiences, rather than asserting that subsequent operationalization into LLM-as-a-judge instruments occurs without any loss of fidelity. We explicitly frame the operationalization as an exploratory step and dedicate space to reflecting on its benefits, limitations, and open challenges. No quantitative agreement study was included because the work prioritizes establishing repeatable community-informed concepts over immediate empirical validation of automation; such a study would be a natural and valuable extension but falls outside the current scope. We will revise the manuscript to more explicitly delineate these boundaries and add a dedicated future-work subsection outlining potential validation approaches. revision: partial

  2. Referee: Details on how the systematized concepts were translated into concrete LLM prompts or rubrics are absent, including any iterative refinement process or checks for consistency with original community input. This omission makes it impossible to evaluate whether the operationalized instruments accurately reflect the lived-experience grounding described in the case studies.

    Authors: We agree that additional transparency on the translation process would improve evaluability. The initial submission emphasized the community engagement and systematization stages; consequently, concrete prompt examples and refinement details were omitted. In the revised manuscript we will add an appendix containing (1) sample LLM prompts and rubrics derived directly from the three case-study systematized concepts, (2) a description of how community statements were mapped to rubric criteria, and (3) any consistency checks performed against the original participant input. This addition will allow readers to assess alignment without shifting the paper's primary emphasis. revision: yes

Circularity Check

0 steps flagged

No circularity: qualitative case studies with no derivations or fitted predictions

full rationale

The paper conducts case studies with three communities to systematize the concept of cultural appropriateness for AI-generated images of artifacts, then explores (without claiming to validate) operationalization into multimodal LLM judges. No equations, parameters, or quantitative predictions appear anywhere in the manuscript. The central claims are descriptive accounts of community input rather than any reduction of outputs to inputs by construction. No self-citation chains are load-bearing for the results, and the work contains no self-definitional steps, ansatzes smuggled via citation, or renaming of known results. This is self-contained qualitative research with no circular derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that community perspectives captured in the systematization stage remain valid when translated into automated scoring rules. No free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Community engagement concentrated in the systematization phase produces more valid measures than engagement spread across all phases.
    Stated as the key opportunity opened by the three-stage framework.

pith-pipeline@v0.9.0 · 5603 in / 1309 out tokens · 36902 ms · 2026-05-13T20:57:08.175792+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

139 extracted references · 139 canonical work pages · 2 internal anchors

  1. [1]

    Robert Adcock and David Collier. 2001. Measurement validity: A shared standard for qualitative and quantitative research.American Political Science Review95, 3 (2001), 529–546

  2. [3]

    Muhammad Farid Adilazuarda, Sagnik Mukherjee, Pradhyumna Lavania, Siddhant Singh, Ashutosh Dwivedi, Alham Fikri Aji, Jacki O’Neill, Ashutosh Modi, and Monojit Choudhury. 2024. Towards Measuring and Modeling “Culture” in LLMs: A Survey.arXiv preprint arXiv:2403.15412 (2024)

  3. [4]

    I look at it as the king of knowledge

    Rudaiba Adnin and Maitraye Das. 2024. "I look at it as the king of knowledge": How Blind People Use and Understand Generative AI Tools. In Proceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility(St. John’s, NL, Canada)(ASSETS ’24). Association for Computing Machinery, New York, NY, USA, Article 64, 14 pages. doi:10.11...

  4. [5]

    Afra Feyza Akyürek, Advait Gosai, Chen Bo Calvin Zhang, Vipul Gupta, Jaehwan Jeong, Anisha Gunjal, Tahseen Rabbani, Maria Mazzone, David Randolph, Mohammad Mahmoudi Meymand, Gurshaan Chattha, Paula Rodriguez, Diego Mares, Pavit Singh, Michael Liu, Subodh Chawla, Pete Cline, Lucy Ogaz, Ernesto Hernandez, Zihao Wang, Pavi Bhatter, Marcos Ayestaran, Bing Liu...

  5. [6]

    Arnstein

    Sherry R. Arnstein. 1969. A Ladder of Citizen Participation.Journal of the American Institute of Planners35, 4 (1969), 216–224

  6. [7]

    Taylor, Mark Díaz, Christopher M

    Lora Aroyo, Alex S. Taylor, Mark Díaz, Christopher M. Homan, Alicia Parrish, Greg Serapio-García, Vinodkumar Prabhakaran, and Ding Wang

  7. [8]

    InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23)

    DICES dataset: diversity in conversational AI evaluation for safety. InProceedings of the 37th International Conference on Neural Information Processing Systems(New Orleans, LA, USA)(NIPS ’23). Curran Associates Inc., Red Hook, NY, USA, Article 2321, 13 pages

  8. [9]

    Solon Barocas, Anhong Guo, Ece Kamar, Jacquelyn Krones, Meredith Ringel Morris, Jennifer Wortman Vaughan, W Duncan Wadsworth, and Hanna Wallach. 2021. Designing disaggregated evaluations of ai systems: Choices, considerations, and tradeoffs. InProceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 368–378

  9. [10]

    Bennett, Erin Brady, and Stacy M

    Cynthia L. Bennett, Erin Brady, and Stacy M. Branham. 2018. Interdependence as a Frame for Assistive Technology Research and Design. In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility(Galway, Ireland)(ASSETS ’18). Association for Computing Machinery, New York, NY, USA, 161–173. doi:10.1145/3234695.3236348

  10. [11]

    Bennett, Shaun K

    Cynthia L. Bennett, Shaun K. Kane, and Christina N. Harrington. 2025. Toward Community-Led Evaluations of Text-to-Image AI Representations of Disability, Health, and Accessibility. InProceedings of the 5th ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO ’25). Association for Computing Machinery, New York, NY, USA, 25...

  11. [12]

    Stevie Bergman, Nahema Marchal, John Mellor, Shakir Mohamed, Iason Gabriel, and William Isaac. 2024. STELA: a community-centred approach to norm elicitation for AI alignment.Scientific Reports14, 1 (2024), 6616

  12. [13]

    Federico Bianchi, Pratyusha Kalluri, Esin Durmus, Faisal Ladhak, Myra Cheng, Debora Nozza, Tatsunori Hashimoto, Dan Jurafsky, James Zou, and Aylin Caliskan. 2023. Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale. InProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency(Chicago, IL, U...

  13. [14]

    Gray, and Rida Qadri

    Asia Biega, Georgina Born, Fernando Diaz, Mary L. Gray, and Rida Qadri. 2025. Towards a Multidisciplinary Vision for Culturally Inclusive Generative AI (Dagstuhl Seminar 25022).Dagstuhl Reports15, 1 (2025), 33–49. doi:10.4230/DagRep.15.1.33

  14. [15]

    Black Forest Labs. 2024. FLUX. https://github.com/black-forest-labs/flux

  15. [16]

    Janet Blake. 2000. On Defining the Cultural Heritage.International & Comparative Law Quarterly49, 1 (2000), 61–85

  16. [17]

    Bogardus

    Emory S. Bogardus. 1942.Fundamentals of Social Psychology(3 ed.). D. Appleton-Century Company, New York and London

  17. [18]

    Branham and Shaun K

    Stacy M. Branham and Shaun K. Kane. 2015. The Invisible Work of Accessibility: How Blind Employees Manage Accessibility in Mixed-Ability Workplaces. InProceedings of the 17th International ACM SIGACCESS Conference on Computers & Accessibility(Lisbon, Portugal)(ASSETS ’15). Association for Computing Machinery, New York, NY, USA, 163–171. doi:10.1145/270064...

  18. [19]

    Chris Callison-Burch. 2009. Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk. InProceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Philipp Koehn and Rada Mihalcea (Eds.). Association for Computational Linguistics, Singapore, 286–295. https://aclanthology.org/D09-1030/

  19. [20]

    Joseph Chee Chang, Saleema Amershi, and Ece Kamar. 2017. Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems(Denver, Colorado, USA)(CHI ’17). Association for Computing Machinery, New York, NY, USA, 2334–2346. doi:10.1145/3025453.3026044

  20. [21]

    Kyla Chasalow and Karen Levy. 2021. Representativeness in statistics, politics, and machine learning. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 77–89

  21. [22]

    Khaoula Chehbouni, Mohammed Haddou, Jackie Chi Kit Cheung, and Golnoosh Farnadi. 2025. Neither Valid nor Reliable? Investigating the Use of LLMs as Judges. https://arxiv.org/abs/2508.18076

  22. [23]

    Jiahui Chen, Candace Ross, Reyhane Askari-Hemmat, Koustuv Sinha, Melissa Hall, Michal Drozdzal, and Adriana Romero-Soriano. 2025. Multi- Modal Language Models as Text-to-Image Model Evaluators. https://arxiv.org/abs/2505.00759 Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed Rubrics 19

  23. [24]

    Tim Connell. 2008. The Challenge of Assistive Technology and Braille Literacy. https://www.afb.org/aw/9/1/14277 [Online; accessed 6-September- 2025]

  24. [25]

    Alex Dow, Jean Garcia-Gathright, Nicholas J Pangakis, Emily Sheng, Dan Vann, Matthew Vogel, and Hanna Wallach

    Emily Corvi, Hannah Washington, Stefanie Reed, Chad Atalla, Alexandra Chouldechova, P. Alex Dow, Jean Garcia-Gathright, Nicholas J Pangakis, Emily Sheng, Dan Vann, Matthew Vogel, and Hanna Wallach. 2025. Taxonomizing Representational Harms using Speech Act Theory. InFindings of the Association for Computational Linguistics. doi:10.18653/v1/2025.findings-acl.202

  25. [26]

    Amanda Coston, Anna Kawakami, Haiyi Zhu, Ken Holstein, and Hoda Heidari. 2023. A validity perspective on evaluating the justified use of data-driven decision-making algorithms. In2023 IEEE conference on secure and trustworthy machine learning (SaTML). IEEE, 690–704

  26. [27]

    Lee J Cronbach and Paul E Meehl. 1955. Construct validity in psychological tests.Psychological bulletin52, 4 (1955), 281

  27. [28]

    inclusion

    Samantha Dalal, Siobhan Mackenzie Hall, and Nari Johnson. 2024. Provocation: Who benefits from "inclusion" in Generative AI? https: //arxiv.org/abs/2411.09102

  28. [29]

    Maitraye Das, Alexander J Fiannaca, Meredith Ringel Morris, Shaun K Kane, and Cynthia L Bennett. 2024. From provenance to aberrations: Image creator and screen reader user perspectives on alt text for AI-generated images. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–21

  29. [30]

    It doesn’t win you friends

    Maitraye Das, Darren Gergle, and Anne Marie Piper. 2019. "It doesn’t win you friends": Understanding Accessibility in Collaborative Writing for People with Vision Impairments.Proc. ACM Hum.-Comput. Interact.3, CSCW, Article 191 (Nov. 2019), 26 pages. doi:10.1145/3359293

  30. [31]

    Nassim Dehouche and Kullathida Dehouche. 2023. What’s in a text-to-image prompt? The potential of stable diffusion in visual arts education. Heliyon9, 6 (2023), e16757. doi:10.1016/j.heliyon.2023.e16757

  31. [32]

    Fernando Delgado, Stephen Yang, Michael Madaio, and Qian Yang. 2023. The Participatory Turn in AI Design: Theoretical Foundations and the Current State of Practice. https://arxiv.org/abs/2310.00907

  32. [33]

    Sunipa Dev, Vinodkumar Prabhakaran, Rutledge Chin Feman, Aida Davani, Remi Denton, Charu Kalia, Piyawat L Kumjorn, Madhurima Maji, Rida Qadri, Negar Rostamzadeh, Renee Shelby, Romina Stella, Hayk Stepanyan, Erin van Liemt, Aishwarya Verma, Oscar Wahltinez, Edem Wornyo, Andrew Zaldivar, and Saška Mojsilović. 2026. A Unified Framework to Quantify Cultural I...

  33. [34]

    Athiya Deviyani and Fernando Diaz. 2025. Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy. https://arxiv.org/abs/2503. 19828

  34. [35]

    2025.Exploring Black Communities’ Perceptions and Design Approaches for Building Culturally Tailored AI Systems

    Lisa Egede. 2025.Exploring Black Communities’ Perceptions and Design Approaches for Building Culturally Tailored AI Systems. Association for Computing Machinery, New York, NY, USA, 72–76. https://doi.org/10.1145/3715668.3735629

  35. [36]

    Maria Eriksson, Erasmo Purificato, Arman Noroozian, Joao Vinagre, Guillaume Chaslot, Emilia Gomez, and David Fernandez-Llorca. 2025. Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation. https://arxiv.org/abs/2502.06559

  36. [37]

    Yannick Exner, Jochen Hartmann, Oded Netzer, and Shunyuan Zhang. 2025. AI in Disguise - How AI-Generated Ads’ Visual Cues Shape Consumer Perception and Performance. doi:10.2139/ssrn.5096969

  37. [38]

    Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. 2009. Describing objects by their attributes. In2009 IEEE Conference on Computer Vision and Pattern Recognition. 1778–1785. doi:10.1109/CVPR.2009.5206772

  38. [39]

    Sanjana Gautam, Pranav Narayanan Venkit, and Sourojit Ghosh. 2024. From melting pots to misrepresentations: Exploring harms in Generative AI. arXiv preprint arXiv:2403.10776(2024)

  39. [40]

    Simret Araya Gebreegziabher, Charles Chiang, Zichu Wang, Zahra Ashktorab, Michelle Brachman, Werner Geyer, Toby Jia-Jun Li, and Diego Gómez-Zará. 2025. MetricMate: An Interactive Tool for Generating Evaluation Criteria for LLM-as-a-Judge Workflow. InProceedings of the 4th Annual Symposium on Human-Computer Interaction for Work (CHIWORK ’25). Association f...

  40. [41]

    Sourojit Ghosh, Pranav Narayanan Venkit, Sanjana Gautam, Shomir Wilson, and Aylin Caliskan. 2024. Do Generative AI Models Output Harm while Representing Non-Western Cultures: Evidence from A Community-Centered Approach.Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society7, 1 (Oct. 2024), 476–489. doi:10.1609/aies.v7i1.31651

  41. [42]

    Tarleton Gillespie. 2024. Generative AI and the politics of visibility.Big Data & Society11, 2 (2024), 20539517241252131. doi:10.1177/ 20539517241252131

  42. [43]

    Luke Guerdan, Solon Barocas, Kenneth Holstein, Hanna Wallach, Zhiwei Steven Wu, and Alexandra Chouldechova. 2025. Validating LLM-as-a-Judge Systems under Rating Indeterminacy. https://arxiv.org/abs/2503.05965

  43. [44]

    Kanika Gupta, Monojit Choudhury, and Kalika Bali. 2012. Mining Hindi-English Transliteration Pairs from Online Hindi Lyrics. InProceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odij...

  44. [45]

    Rishav Hada, Varun Gumma, Adrian de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, and Sunayana Sitaram. 2024. Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?arXiv preprint arXiv:2309.07462(2024)

  45. [46]

    Bell, Candace Ross, Adina Williams, Michal Drozdzal, and Adriana Romero Soriano

    Melissa Hall, Samuel J. Bell, Candace Ross, Adina Williams, Michal Drozdzal, and Adriana Romero Soriano. 2024. Towards Geographic Inclusion in the Evaluation of Text-to-Image Models. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency(Rio de Janeiro, Brazil)(FAccT ’24). Association for Computing Machinery, New York, NY, ...

  46. [47]

    1997.Representation: Cultural Representations and Signifying Practices

    Stuart Hall (Ed.). 1997.Representation: Cultural Representations and Signifying Practices. Sage Publications, London. 20 Johnson et al

  47. [48]

    Siobhan Mackenzie Hall, Samantha Dalal, Raesetje Sefala, Foutse Yuehgoh, Aisha Alaagib, Imane Hamzaoui, Shu Ishida, Jabez Magomere, Lauren Crais, Aya Salama, et al. 2025. The Human Labour of Data Work: Capturing Cultural Diversity through World Wide Dishes.arXiv preprint arXiv:2502.05961(2025)

  48. [49]

    Hamna, Gayatri Bhat, Sourabrata Mukherjee, Faisal Lalani, Evan Hadfield, Divya Siddarth, Kalika Bali, and Sunayana Sitaram. 2025. Building Benchmarks from the Ground Up: Community-Centered Evaluation of LLMs in Healthcare Chatbot Settings. https://arxiv.org/abs/2509.24506

  49. [50]

    Hamna, Deepthi Sudharsan, Agrima Seth, Ritvik Budhiraja, Deepika Khullar, Vyshak Jain, Kalika Bali, Aditya Vashistha, and Sameer Segal. 2025. Kahani: Culturally-Nuanced Visual Storytelling Tool for Non-Western Cultures. InProceedings of the 2025 ACM SIGCAS/SIGCHI Conference on Computing and Sustainable Societies (COMPASS ’25). Association for Computing Ma...

  50. [51]

    Emma Harvey, Emily Sheng, Su Lin Blodgett, Alexandra Chouldechova, Jean Garcia-Gathright, Alexandra Olteanu, and Hanna Wallach. 2025. Understanding and Meeting Practitioner Needs When Measuring Representational Harms Caused by LLM-Based Systems. https://arxiv.org/abs/ 2506.04482

  51. [52]

    Helia Hashemi, Jason Eisner, Corby Rosset, Benjamin Van Durme, and Chris Kedzie. 2024. LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.)...

  52. [53]

    Huiguo He, Huan Yang, Zixi Tuo, Yuan Zhou, Qiuyue Wang, Yuhang Zhang, Zeyu Liu, Wenhao Huang, Hongyang Chao, and Jian Yin. 2025. Dreamstory: Open-domain story visualization by llm-guided multi-subject consistent diffusion.IEEE Transactions on Pattern Analysis and Machine Intelligence(2025)

  53. [54]

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2018. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. https://arxiv.org/abs/1706.08500

  54. [55]

    Rachel Hong, William Agnew, Tadayoshi Kohno, and Jamie Morgenstern. 2024. Who’s in and who’s out? A case study of multimodal CLIP-filtering in DataComp. InProceedings of the 4th ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization. 1–17

  55. [56]

    Sandford

    Chien-Chi Hsu and Brian A. Sandford. 2007. The Delphi technique: Making sense of consensus.Practical Assessment, Research, and Evaluation12, 10 (2007), 1–8. https://openpublishing.library.umass.edu/pare/article/id/1418/ A widely cited methodological overview of the Delphi method

  56. [57]

    Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A Smith. 2023. TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering. https://arxiv.org/abs/2303.11897

  57. [58]

    Mina Huh, Yi-Hao Peng, and Amy Pavel. 2023. GenAssist: Making image generation accessible. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. 1–17

  58. [59]

    Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. 2024. Rethinking FID: Towards a Better Evaluation Metric for Image Generation. https://arxiv.org/abs/2401.09603

  59. [60]

    Akshita Jha, Vinodkumar Prabhakaran, Remi Denton, Sarah Laszlo, Shachi Dave, Rida Qadri, Chandan K Reddy, and Sunipa Dev. 2024. Visage: A global-scale analysis of visual stereotypes in text-to-image generation.arXiv preprint arXiv:2401.06310(2024)

  60. [61]

    Jiang, Lauren Brown, Jessica Cheng, Mehtab Khan, Abhishek Gupta, Deja Workman, Alex Hanna, Johnathan Flowers, and Timnit Gebru

    Harry H. Jiang, Lauren Brown, Jessica Cheng, Mehtab Khan, Abhishek Gupta, Deja Workman, Alex Hanna, Johnathan Flowers, and Timnit Gebru

  61. [62]

    InProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society(Montréal, QC, Canada)(AIES ’23)

    AI Art and its Impact on Artists. InProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society(Montréal, QC, Canada)(AIES ’23). Association for Computing Machinery, New York, NY, USA, 363–374. doi:10.1145/3600211.3604681

  62. [63]

    Nari Johnson, Hamna Abid, Deepthi Sudharsan, Theo Holroyd, Samantha Dalal, Siobhan Mackenzie Hall, Jennifer Wortman Vaughan, Daniela Massiceti, and Cecily Morrison. 2025. Position: To Make Text-to-Image Models that Work for Marginalized Communities, We Need New Measurement Practices for the Long Tail. https://www.microsoft.com/en-us/research/publication/p...

  63. [64]

    Shivani Kapania, Stephanie Ballard, Alex Kessler, and Jennifer Wortman Vaughan. 2025. Examining the Expanding Role of Synthetic Data Throughout the AI Development Pipeline. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency

  64. [65]

    2025.Translation Tutorial: AI Measurement as a Stakeholder-Engaged Design Practice

    Anna Kawakami, Su Lin Blodgett, Solon Barocas, Alex Chouldechova, Abigail Jacobs, Emily Sheng, Jenn Wortman Vaughan, Hanna Wallach, Amy Winecoff, Angelina Wang, Haiyi Zhu, and Ken Holstein. 2025.Translation Tutorial: AI Measurement as a Stakeholder-Engaged Design Practice. Retrieved January 10, 2026 from https://drive.google.com/file/d/12qQd6ROfacYAtoQ-ii...

  65. [66]

    Anna Kawakami, Jordan Taylor, Sarah Fox, Haiyi Zhu, and Kenneth Holstein. 2026. AI failure loops in devalued work: The confluence of overconfidence in AI and underconfidence in worker expertise.Big Data & Society13, 1 (2026), 20539517261424164. doi:10.1177/20539517261424164

  66. [67]

    Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, et al. 2024. The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models.arXiv preprin...

  67. [68]

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. 2023. Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation. https://arxiv.org/abs/2305.01569

  68. [69]

    Kevin Knight and Jonathan Graehl. 1998. Machine Transliteration.Computational Linguistics24, 4 (1998), 599–612. https://aclanthology.org/J98- 4003/

  69. [70]

    Elisa Kreiss, Cynthia Bennett, Shayan Hooshmand, Eric Zelikman, Meredith Ringel Morris, and Christopher Potts. 2022. Context Matters for Image Descriptions for Accessibility: Challenges for Referenceless Evaluation Metrics.arXiv preprint arXiv:2205.10646(2022). Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed Rubrics 21

  70. [71]

    Neha Kumar, Naveena Karusala, Azra Ismail, Marisol Wong-Villacres, and Aditya Vishwanath. 2019. Engaging Feminist Solidarity for Comparative Research, Design, and Practice.Proc. ACM Hum.-Comput. Interact.3, CSCW, Article 167 (Nov. 2019), 24 pages. doi:10.1145/3359269

  71. [72]

    C., Avik Bhattacharyya, Mitesh M

    Anoop Kunchukuttan, Divyanshu Kakwani, Satish Golla, Gokul N. C., Avik Bhattacharyya, Mitesh M. Khapra, and Pratyush Kumar. 2020. AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages. https://arxiv.org/abs/2005.00085

  72. [73]

    Tony Lee, Michihiro Yasunaga, Chenlin Meng, Yifan Mai, Joon Sung Park, Agrim Gupta, Yunzhi Zhang, Deepak Narayanan, Hannah Benita Teufel, Marco Bellagente, Minguk Kang, Taesung Park, Jure Leskovec, Jun-Yan Zhu, Li Fei-Fei, Jiajun Wu, Stefano Ermon, and Percy Liang. 2023. Holistic Evaluation of Text-To-Image Models. https://arxiv.org/abs/2311.04287

  73. [74]

    Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, Kai Shu, Lu Cheng, and Huan Liu. 2025. From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge. https: //arxiv.org/abs/2411.16594

  74. [75]

    Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. 2024. Evaluating Text-to-Visual Generation with Image-to-Text Generation. https://arxiv.org/abs/2404.01291

  75. [76]

    Smith, and Fannie Liu

    Kelly Mack, Rai Ching Ling Hsu, Andrés Monroy-Hernández, Brian A. Smith, and Fannie Liu. 2023. Towards Inclusive Avatars: Disability Representation in Avatar Platforms. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 607, 13 pages. do...

  76. [77]

    They only care to show us the wheelchair

    Kelly Avery Mack, Rida Qadri, Remi Denton, Shaun K Kane, and Cynthia L Bennett. 2024. “They only care to show us the wheelchair”: disability representation in text-to-image AI models. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–23

  77. [78]

    Jabez Magomere, Shu Ishida, Tejumade Afonja, Aya Salama, Daniel Kochin, Yuehgoh Foutse, Imane Hamzaoui, Raesetje Sefala, Aisha Alaagib, Samantha Dalal, et al . 2025. The World Wide recipe: A community-centred framework for fine-grained data collection and regional bias operationalisation. InProceedings of the 2025 ACM Conference on Fairness, Accountabilit...

  78. [79]

    Daniela Massiceti, Camilla Longden, Agnieszka Slowik, Samuel Wills, Martin Grayson, and Cecily Morrison. 2024. Explaining CLIP’s performance disparities on data from blind/low vision users. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12172–12182

  79. [80]

    Nathan Matias and Megan Price

    J. Nathan Matias and Megan Price. 2025. How public involvement can improve the science of AI.Proceedings of the National Academy of Sciences 122, 48 (2025), e2421111122. doi:10.1073/pnas.2421111122

  80. [81]

    Timothy R McIntosh, Teo Susnjak, Nalin Arachchilage, Tong Liu, Dan Xu, Paul Watters, and Malka N Halgamuge. 2025. Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence.IEEE Transactions on Artificial Intelligence(2025), 1–18. doi:10.1109/tai.2025.3569516

Showing first 80 references.