pith. machine review for the scientific record. sign in

arxiv: 2604.16403 · v1 · submitted 2026-03-31 · 💻 cs.AI · cs.CY

Recognition: 2 theorem links

· Lean Theorem

Computational Hermeneutics: Evaluating generative AI as a cultural technology

Adam Sobey, Aditya Vashistha, Ashley Noel-Hirst, Baptiste Caramiaux, Cody Kommers, Dalaki Livingston, Daniela Mihai, Deven Parker, Drew Hemment, Edgar Du\'e\~nez-Guzm\'an, Emily Robinson, Emmanouil Benetos, Evelyn Gius, Georgia Meyer, Hoyt Long, James Dobson, Jessica Ratcliff, Jonathan W. Y. Gray, Karina Rodriguez, Kerry Francksen, Kirsten Ostherr, Maria Antoniak, Martin Disley, Matthew Wilkens, Mercedes Bunz, Meredith Martin, Richard Jean So, Ruth Ahnert, Ryan Heuser, Sang Leigh, Sarah Immel, Shauna Concannon, Steve Benford, Ted Underwood, Yali Du, Yipeng Qin, Youyou Wu, Yuan Zheng

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:44 UTC · model grok-4.3

classification 💻 cs.AI cs.CY
keywords generative AIhermeneuticscultural technologyevaluation frameworksinterpretive challengescontext machinessituatednessplurality
0
0 comments X

The pith

Generative AI systems function as context machines that address interpretive challenges of situatedness, plurality, and ambiguity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Generative AI is viewed here not merely as a technical system but as a cultural technology whose outputs gain meaning only through context. The authors draw on hermeneutic theory to claim that these systems must inherently manage three challenges: meanings that depend on their situation, the coexistence of multiple valid readings, and natural conflicts between interpretations. They introduce computational hermeneutics as a framework to interpret what these systems do and to improve their operation. The paper proposes three evaluation principles: making benchmarks iterative, involving human participants, and assessing cultural context rather than isolated outputs. This approach would redirect AI design and assessment from measuring accuracy on fixed questions toward understanding contextual meaning.

Core claim

We argue that GenAI systems function as context machines that must inherently address three interpretive challenges: situatedness where meaning only emerges in context, plurality where multiple valid interpretations coexist, and ambiguity where interpretations naturally conflict. We present computational hermeneutics as an emerging framework offering an interpretive account of what GenAI systems do, and how they might do it better. We offer three principles for hermeneutic evaluation: that benchmarks should be iterative, not one-off; include people, not just machines; and measure cultural context, not just model output. This perspective offers a nascent paradigm for designing and evaluating

What carries the argument

Computational hermeneutics as a framework that treats GenAI systems as context machines required to handle situatedness, plurality, and ambiguity in meaning.

If this is right

  • Evaluation benchmarks must be iterative processes rather than single fixed tests.
  • Assessment requires direct inclusion of human participants in addition to automated measures.
  • Metrics need to capture cultural context and interpretive fit instead of isolated output accuracy.
  • System design should prioritize addressing interpretive challenges over optimizing for standardized questions.
  • The overall paradigm for AI evaluation shifts from accuracy to contextual meaning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This view could guide training data curation to explicitly embed varied cultural contexts for better ambiguity handling.
  • It suggests potential links to existing work on human-AI collaboration in interpretive fields like history or literature.
  • Developers might test the framework by applying it to specific domains such as creative writing or historical analysis tasks.
  • Over time it could influence how regulators assess AI systems deployed in cultural or media production.

Load-bearing premise

Hermeneutic theory from the humanities can be straightforwardly applied to provide a computational account of GenAI operation and evaluation without requiring additional empirical validation or adaptation.

What would settle it

A controlled study comparing GenAI performance on cultural tasks using hermeneutic evaluation principles versus standard accuracy metrics, where the hermeneutic approach shows no measurable improvement in handling context or ambiguity.

read the original abstract

Generative AI systems are increasingly recognized as cultural technologies, yet current evaluation frameworks often treat culture as a variable to be measured rather than fundamental to the system's operation. Drawing on hermeneutic theory from the humanities, we argue that GenAI systems function as "context machines" that must inherently address three interpretive challenges: situatedness (meaning only emerges in context), plurality (multiple valid interpretations coexist), and ambiguity (interpretations naturally conflict). We present computational hermeneutics as an emerging framework offering an interpretive account of what GenAI systems do, and how they might do it better. We offer three principles for hermeneutic evaluation -- that benchmarks should be iterative, not one-off; include people, not just machines; and measure cultural context, not just model output. This perspective offers a nascent paradigm for designing and evaluating contemporary AI systems: shifting from standardized questions about accuracy to contextual ones about meaning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that generative AI systems function as 'context machines' that inherently address three interpretive challenges drawn from hermeneutic theory—situatedness (meaning emerges only in context), plurality (multiple valid interpretations coexist), and ambiguity (interpretations conflict)—and proposes 'computational hermeneutics' as an emerging framework for an interpretive account of GenAI operation and evaluation, along with three principles: benchmarks should be iterative rather than one-off, include people not just machines, and measure cultural context not just model output.

Significance. If the interpretive account holds, the paper offers a potentially significant shift in how GenAI is conceptualized and evaluated, moving from standardized accuracy metrics toward contextual questions about meaning and culture; this could influence design paradigms if the framework is later operationalized, and the explicit integration of humanities-derived hermeneutic theory into AI assessment is a clear strength of the conceptual contribution.

major comments (2)
  1. [Introduction and framework presentation] The section introducing the three interpretive challenges and the 'context machines' framing: the manuscript asserts that GenAI systems inherently address situatedness, plurality, and ambiguity but supplies no derivation or technical mapping showing why standard next-token prediction fails to capture them or how these challenges translate into model architectures, loss functions, or metrics distinct from existing context-window or retrieval-augmented methods.
  2. [Principles for hermeneutic evaluation] The section offering the three principles for hermeneutic evaluation: the principles (iterative benchmarks, people-inclusive, context-measuring) are stated at a high level without operationalization of 'cultural context' as a measurable quantity or any concrete examples of how they would alter benchmark design or model training, rendering the framework non-computable in its current form.
minor comments (1)
  1. The manuscript would benefit from additional citations to specific hermeneutic theorists (e.g., Gadamer or Ricoeur) to make the theoretical grounding more traceable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential value of integrating hermeneutic theory into GenAI evaluation. We agree that the manuscript would benefit from greater clarity on the conceptual-to-technical linkages and from more concrete illustrations of the proposed principles. Below we respond point by point and indicate the revisions we will make.

read point-by-point responses
  1. Referee: The section introducing the three interpretive challenges and the 'context machines' framing: the manuscript asserts that GenAI systems inherently address situatedness, plurality, and ambiguity but supplies no derivation or technical mapping showing why standard next-token prediction fails to capture them or how these challenges translate into model architectures, loss functions, or metrics distinct from existing context-window or retrieval-augmented methods.

    Authors: We acknowledge that the current presentation remains at the level of interpretive framing rather than supplying explicit technical derivations. The manuscript positions computational hermeneutics as a conceptual lens rather than an immediate architectural proposal; therefore no detailed mapping to loss functions or novel metrics was included. To address the concern, we will revise the introduction to add a short subsection that (a) contrasts next-token prediction with the three challenges by reference to known limitations of fixed context windows, (b) illustrates how retrieval-augmented generation partially addresses plurality but leaves situatedness and ambiguity under-specified, and (c) sketches, at a conceptual level, how an iterative human-in-the-loop protocol could surface distinct evaluation signals. These additions will clarify the intended relationship without claiming new technical results. revision: partial

  2. Referee: The section offering the three principles for hermeneutic evaluation: the principles (iterative benchmarks, people-inclusive, context-measuring) are stated at a high level without operationalization of 'cultural context' as a measurable quantity or any concrete examples of how they would alter benchmark design or model training, rendering the framework non-computable in its current form.

    Authors: We accept that the principles are currently stated at a programmatic level and that operational definitions and examples are required to demonstrate feasibility. In the revised manuscript we will (1) define 'cultural context' operationally via two proxy measures—inter-annotator agreement on culturally specific references and the number of distinct valid interpretations elicited from diverse human evaluators—and (2) supply two worked examples: an adaptation of a standard multiple-choice benchmark that inserts iterative human clarification rounds, and a training-time objective that augments cross-entropy loss with an ambiguity-resolution term derived from multi-annotator disagreement. These changes will render the framework more actionable while preserving its theoretical grounding. revision: yes

Circularity Check

0 steps flagged

No significant circularity: interpretive framework draws on external hermeneutic theory

full rationale

The manuscript proposes computational hermeneutics as an interpretive lens for GenAI, framing systems as context machines that address situatedness, plurality, and ambiguity, then suggests three evaluation principles (iterative benchmarks, people-inclusive, context-measuring). These claims rest on direct citation of established humanities hermeneutic theory rather than any internal derivation, equations, fitted parameters, or self-referential definitions. No load-bearing step reduces a result to its own inputs by construction, no predictions are statistically forced from subsets of data, and no uniqueness theorems or ansatzes are smuggled via self-citation. The argument is self-contained as a conceptual extension of external sources, with no mathematical or empirical loop that would trigger circularity under the specified patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The claim rests on the applicability of hermeneutic theory to AI without new supporting evidence; no free parameters are fitted, but the framework introduces a new interpretive lens.

axioms (2)
  • domain assumption Hermeneutic theory supplies a valid interpretive account for how GenAI systems function as context machines
    Invoked in the abstract as the basis for identifying the three interpretive challenges.
  • domain assumption Current evaluation frameworks treat culture as a variable to be measured rather than fundamental to system operation
    Stated directly as the motivation for the new framework.
invented entities (1)
  • computational hermeneutics no independent evidence
    purpose: Emerging framework for interpretive evaluation of GenAI
    New term coined to describe the proposed approach combining hermeneutics with computational systems.

pith-pipeline@v0.9.0 · 5601 in / 1291 out tokens · 50942 ms · 2026-05-13T23:44:28.172838+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

117 extracted references · 117 canonical work pages

  1. [1]

    Century: A framework and dataset for evaluating historical contextualisation of sensitive images

    Canfer Akbulut, Kevin Robinson, Maribeth Rauh, Isabela Albuquerque, Olivia Wiles, Laura Weidinger, Verena Rieser, Yana Hasson, Nahema Marchal, Iason Gabriel, et al. Century: A framework and dataset for evaluating historical contextualisation of sensitive images. InThe Thirteenth International Conference on Learning Representations, 2025

  2. [2]

    All too human? mapping and mitigating the risk from anthropomorphic ai

    Canfer Akbulut, Laura Weidinger, Arianna Manzini, Iason Gabriel, and Verena Rieser. All too human? mapping and mitigating the risk from anthropomorphic ai. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 7, pages 13–26, 2024

  3. [3]

    When benchmarks are targets: Revealing the sensitivity of large language model leaderboards

    Norah Alzahrani, Hisham Alyahya, Yazeed Alnumay, Sultan AlRashed, Shaykhah Alsubaie, Yousef Al- mushayqih, Faisal Mirza, Nouf Alotaibi, Nora Al-Twairesh, Areeb Alowisheq, M Saiful Bari, and Haidar Khan. When benchmarks are targets: Revealing the sensitivity of large language model leaderboards. In Proceedings of the 62nd Annual Meeting of the Association ...

  4. [4]

    A scenario-based design pack for exploring multimodal human–genai relations

    Josh Andres, Chris Danta, Andrea Bianchi, Sahar Farzanfar, Gloria Milena Fernandez-Nieto, Alexa Becker, Tara Capel, Frances Liddell, Shelby Hagemann, Ned Cooper, et al. A scenario-based design pack for exploring multimodal human–genai relations. InProceedings of the 27th International Conference on Multimodal Interaction, pages 145–154, 2025

  5. [5]

    Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues.arXiv preprint arXiv:2402.14762, 2024

    Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, et al. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues.arXiv preprint arXiv:2402.14762, 2024

  6. [6]

    Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

    Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY , USA, 2021. Association for Computing Machinery

  7. [7]

    Unsupervised feature learning and deep learning: A review and new perspectives.CoRR, abs/1206.5538, 1(2665):2012, 2012

    Yoshua Bengio, Aaron C Courville, and Pascal Vincent. Unsupervised feature learning and deep learning: A review and new perspectives.CoRR, abs/1206.5538, 1(2665):2012, 2012

  8. [8]

    Seegull multilingual: a dataset of geo-culturally situated stereotypes

    Mukul Bhutani, Kevin Robinson, Vinodkumar Prabhakaran, Shachi Dave, and Sunipa Dev. Seegull multilingual: a dataset of geo-culturally situated stereotypes. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 842–854, 2024

  9. [9]

    Benchmark datasets driving artificial intelligence development fail to capture the needs of medical professionals.Journal of Biomedical Informatics, 137:104274, 2023

    Kathrin Blagec, Jakob Kraiger, Wolfgang Frühwirt, and Matthias Samwald. Benchmark datasets driving artificial intelligence development fail to capture the needs of medical professionals.Journal of Biomedical Informatics, 137:104274, 2023

  10. [10]

    Automatic extraction of metaphoric analogies from literary texts: Task for- mulation, dataset construction, and evaluation

    Joanne Boisson, Zara Siddique, Hsuvas Borkakoty, Dimosthenis Antypas, Luis Espinosa Anke, and Jose Camacho-Collados. Automatic extraction of metaphoric analogies from literary texts: Task for- mulation, dataset construction, and evaluation. InProceedings of the 31st International Conference on Computational Linguistics, pages 6692–6704, 2025

  11. [11]

    Man is to computer programmer as woman is to homemaker? Debiasing word embeddings.Advances in Neural Information Processing Systems, 29, 2016

    Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings.Advances in Neural Information Processing Systems, 29, 2016

  12. [12]

    Machine culture.Nature Human Behaviour, 7(11):1855–1868, 2023

    Levin Brinkmann, Fabian Baumann, Jean-François Bonnefon, Maxime Derex, Thomas F Müller, Anne- Marie Nussberger, Agnieszka Czaplicka, Alberto Acerbi, Thomas L Griffiths, Joseph Henrich, et al. Machine culture.Nature Human Behaviour, 7(11):1855–1868, 2023

  13. [13]

    Rethink reporting of evaluation results in AI.Science, 380(6641):136–138, 2023

    Ryan Burnell, Wout Schellaert, John Burden, Tomer D Ullman, Fernando Martinez-Plumed, Joshua B Tenenbaum, Danaja Rutar, Lucy G Cheke, Jascha Sohl-Dickstein, Melanie Mitchell, et al. Rethink reporting of evaluation results in AI.Science, 380(6641):136–138, 2023

  14. [14]

    Assessing cross- cultural alignment between ChatGPT and human societies: An empirical study

    Yong Cao, Li Zhou, Seolhwa Lee, Laura Cabello, Min Chen, and Daniel Hershcovich. Assessing cross- cultural alignment between ChatGPT and human societies: An empirical study. InProceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP), pages 53–67, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics

  15. [15]

    Penguin UK, 2018

    John D Caputo.Hermeneutics: Facts and interpretation in the age of information. Penguin UK, 2018

  16. [16]

    Explorers of unknown planets

    Baptiste Caramiaux and Sarah Fdili Alaoui. “Explorers of unknown planets”: Practices and politics of artificial intelligence in visual arts.Proc. ACM Hum.-Comput. Interact., 6(CSCW2), November 2022. 9

  17. [17]

    Art or artifice? Large language models and the false promise of creativity

    Tuhin Chakrabarty, Philippe Laban, Divyansh Agarwal, Smaranda Muresan, and Chien-Sheng Wu. Art or artifice? Large language models and the false promise of creativity. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI ’24, New York, NY , USA, 2024. Association for Computing Machinery

  18. [18]

    Yu, Qiang Yang, and Xing Xie

    Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. A survey on evaluation of large language models.ACM Trans. Intell. Syst. Technol., 15(3), March 2024

  19. [19]

    Unleashing the potential of prompt engineering for large language models.Patterns

    Banghao Chen, Zhaofeng Zhang, Nicolas Langrené, and Shengxin Zhu. Unleashing the potential of prompt engineering for large language models.Patterns

  20. [20]

    A computational framework for behavioral assessment of LLM therapists.arXiv preprint arXiv:2401.00820, 2024

    Yu Ying Chiu, Ashish Sharma, Inna Wanyin Lin, and Tim Althoff. A computational framework for behavioral assessment of LLM therapists.arXiv preprint arXiv:2401.00820, 2024

  21. [21]

    Building machines that learn and think with people.Nature Human Behaviour, 8(10):1851–1863, 2024

    Katherine M Collins, Ilia Sucholutsky, Umang Bhatt, Kartik Chandra, Lionel Wong, Mina Lee, Cedegao E Zhang, Tan Zhi-Xuan, Mark Ho, Vikash Mansinghka, et al. Building machines that learn and think with people.Nature Human Behaviour, 8(10):1851–1863, 2024

  22. [22]

    From driverless dilemmas to more practical commonsense tests for automated vehicles.Proceedings of the National Academy of Sciences, 118(11):e2010202118, 2021

    Julian De Freitas, Andrea Censi, Bryant Walker Smith, Luigi Di Lillo, Sam E Anthony, and Emilio Fraz- zoli. From driverless dilemmas to more practical commonsense tests for automated vehicles.Proceedings of the National Academy of Sciences, 118(11):e2010202118, 2021

  23. [23]

    Bringing the people back in: Contesting benchmark machine learning datasets.arXiv preprint arXiv:2007.07399, 2020

    Remi Denton, Alex Hanna, Razvan Amironesei, Andrew Smart, Hilary Nicole, and Morgan Klaus Scheuerman. Bringing the people back in: Contesting benchmark machine learning datasets.arXiv preprint arXiv:2007.07399, 2020

  24. [24]

    An archival perspective on pretraining data.Patterns, 5(4), 2024

    Meera A Desai, Irene V Pasquetto, Abigail Z Jacobs, and Dallas Card. An archival perspective on pretraining data.Patterns, 5(4), 2024

  25. [25]

    A taxonomy of linguistic expressions that contribute to anthropomorphism of language technologies

    Alicia DeVrio, Myra Cheng, Lisa Egede, Alexandra Olteanu, and Su Lin Blodgett. A taxonomy of linguistic expressions that contribute to anthropomorphism of language technologies. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–18, 2025

  26. [26]

    Princeton University Press, 1989

    Wilhelm Dilthey.Introduction to the human sciences, volume 1. Princeton University Press, 1989

  27. [27]

    University of Illinois Press, 2019

    James E Dobson.Critical digital humanities: The search for a methodology. University of Illinois Press, 2019

  28. [28]

    Vector hermeneutics: On the interpretation of vector space models of text.Digital Scholarship in the Humanities, 37(1):81–93, 2022

    James E Dobson. Vector hermeneutics: On the interpretation of vector space models of text.Digital Scholarship in the Humanities, 37(1):81–93, 2022

  29. [29]

    Towards a rigorous science of interpretable machine learning.stat, 1050:2, 2017

    Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning.stat, 1050:2, 2017

  30. [30]

    Relational norms for human-AI cooperation.arXiv preprint arXiv:2502.12102, 2025

    Brian D Earp, Sebastian Porsdam Mann, Mateo Aboy, Edmond Awad, Monika Betzler, Marietjie Botes, Rachel Calcott, Mina Caraccio, Nick Chater, Mark Coeckelbergh, et al. Relational norms for human-AI cooperation.arXiv preprint arXiv:2502.12102, 2025

  31. [31]

    William Empson.Seven Types of Ambiguity. 1930

  32. [32]

    Can we trust AI benchmarks? An interdisciplinary review of current issues in AI evaluation.arXiv preprint arXiv:2502.06559, 2025

    Maria Eriksson, Erasmo Purificato, Arman Noroozian, Joao Vinagre, Guillaume Chaslot, Emilia Gomez, and David Fernandez-Llorca. Can we trust AI benchmarks? An interdisciplinary review of current issues in AI evaluation.arXiv preprint arXiv:2502.06559, 2025

  33. [33]

    How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings

    Kawin Ethayarajh. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 55–65, Hong Kong, China, November 2019...

  34. [34]

    Utility is in the eye of the user: A critique of NLP leaderboards

    Kawin Ethayarajh and Dan Jurafsky. Utility is in the eye of the user: A critique of NLP leaderboards. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4846–4853, Online, November 2020. Association for Computational Linguistics

  35. [35]

    Large AI models are cultural and social technologies.Science, 387(6739):1153–1156, 2025

    Henry Farrell, Alison Gopnik, Cosma Shalizi, and James Evans. Large AI models are cultural and social technologies.Science, 387(6739):1153–1156, 2025. 10

  36. [36]

    Entanglement HCI the next wave?ACM Trans

    Christopher Frauenberger. Entanglement HCI the next wave?ACM Trans. Comput.-Hum. Interact., 27(1), November 2019

  37. [37]

    Perspectivist approaches to natural language processing: A survey.Language Resources and Evaluation, pages 1–28, 2024

    Simona Frenda, Gavin Abercrombie, Valerio Basile, Alessandro Pedrani, Raffaella Panizzon, Alessan- dra Teresa Cignarella, Cristina Marco, and Davide Bernardi. Perspectivist approaches to natural language processing: A survey.Language Resources and Evaluation, pages 1–28, 2024

  38. [38]

    Hans-Georg Gadamer.Truth and method. 1960

  39. [39]

    Gaver, Jacob Beaver, and Steve Benford

    William W. Gaver, Jacob Beaver, and Steve Benford. Ambiguity as a resource for design. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’03, page 233–240, New York, NY , USA, 2003. Association for Computing Machinery

  40. [40]

    How culture shapes what people want from AI

    Xiao Ge, Chunchen Xu, Daigo Misaki, Hazel Rose Markus, and Jeanne L Tsai. How culture shapes what people want from AI. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI ’24, New York, NY , USA, 2024. Association for Computing Machinery

  41. [41]

    Basic Books, 1973

    Clifford Geertz.The interpretation of cultures. Basic Books, 1973

  42. [42]

    Representation: Cultural representations and signifying practices.Culture, 1997

    S Hall. Representation: Cultural representations and signifying practices.Culture, 1997

  43. [43]

    Situated knowledges: The science question in feminism and the privilege of partial perspective.Feminist Studies, 14(3):575–599, 1988

    Donna Haraway. Situated knowledges: The science question in feminism and the privilege of partial perspective.Feminist Studies, 14(3):575–599, 1988

  44. [44]

    Martin Heidegger.Being and time. 1927

  45. [45]

    Doing AI differently: Rethinking the foundations of AI via the humanities

    Drew Hemment, Cody Kommers, and colleagues. Doing AI differently: Rethinking the foundations of AI via the humanities. Technical report, London: The Alan Turing Institute, 2025

  46. [46]

    Experiential AI: Between arts and explainable AI.Leonardo, 57(3):298–306, 2024

    Drew Hemment, Dave Murray-Rust, Vaishak Belle, Ruth Aylett, Matjaz Vidmar, and Frank Broz. Experiential AI: Between arts and explainable AI.Leonardo, 57(3):298–306, 2024

  47. [47]

    Cultural collapse: Toward a generative formalism for ai cultural production.Anthology of Computers and the Humanities, 3:575–588, 2025

    Ryan Heuser. Cultural collapse: Toward a generative formalism for ai cultural production.Anthology of Computers and the Humanities, 3:575–588, 2025

  48. [48]

    Multi-turn evaluation of anthropomorphic behaviours in large language models.arXiv preprint arXiv:2502.07077, 2025

    Lujain Ibrahim, Canfer Akbulut, Rasmi Elasmar, Charvi Rastogi, Minsuk Kahng, Meredith Ringel Morris, Kevin R McKee, Verena Rieser, Murray Shanahan, and Laura Weidinger. Multi-turn evaluation of anthropomorphic behaviours in large language models.arXiv preprint arXiv:2502.07077, 2025

  49. [49]

    Towards interactive evaluations for interaction harms in human-ai systems

    Lujain Ibrahim, Saffron Huang, Lama Ahmad, Umang Bhatt, and Markus Anderljung. Towards interactive evaluations for interaction harms in human-ai systems. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 8, pages 1302–1310, 2025

  50. [50]

    Springer, 2013

    Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani.An introduction to statistical learning: with applications in R, volume 103. Springer, 2013

  51. [51]

    Dead rats, dopamine, performance metrics, and peacock tails: Proxy failure is an inherent risk in goal-oriented systems.Behavioral and Brain Sciences, 47:e67, 2024

    Yohan J John, Leigh Caldwell, Dakota E McCoy, and Oliver Braganza. Dead rats, dopamine, performance metrics, and peacock tails: Proxy failure is an inherent risk in goal-oriented systems.Behavioral and Brain Sciences, 47:e67, 2024

  52. [52]

    Kapoor and A

    Sayash Kapoor, Benedikt Stroebl, Zachary S Siegel, Nitya Nadgir, and Arvind Narayanan. AI agents that matter.arXiv preprint arXiv:2407.01502, 2024

  53. [53]

    Provocations from the humanities for generative AI research.arXiv preprint arXiv:2502.19190, 2025

    Lauren Klein, Meredith Martin, André Brock, Maria Antoniak, Melanie Walsh, Jessica Marie Johnson, Lauren Tilton, and David Mimno. Provocations from the humanities for generative AI research.arXiv preprint arXiv:2502.19190, 2025

  54. [54]

    From protoscience to epistemic monoculture: How benchmarking set the stage for the deep learning revolution.arXiv preprint arXiv:2404.06647, 2024

    Bernard J Koch and David Peterson. From protoscience to epistemic monoculture: How benchmarking set the stage for the deep learning revolution.arXiv preprint arXiv:2404.06647, 2024

  55. [55]

    Sense-making, cultural scripts, and the inferential basis of meaningful experience

    Cody Kommers and Simon DeDeo. Sense-making, cultural scripts, and the inferential basis of meaningful experience. InProceedings of the Annual Meeting of the Cognitive Science Society, volume 47, 2025

  56. [56]

    Why slop matters.ACM AI Letters, 2025

    Cody Kommers, Eamon Duede, Julia Gordon, Ari Holtzman, Tess McNulty, Spencer Stewart, Lindsay Thomas, Richard Jean So, and Hoyt Long. Why slop matters.ACM AI Letters, 2025

  57. [57]

    Meaning is not a metric: Using LLMs to make cultural context legible at scale.arXiv preprint arXiv:2505.23785, 2025

    Cody Kommers, Drew Hemment, Maria Antoniak, Joel Z Leibo, Hoyt Long, Emily Robinson, and Adam Sobey. Meaning is not a metric: Using LLMs to make cultural context legible at scale.arXiv preprint arXiv:2505.23785, 2025. 11

  58. [58]

    The geometry of culture: Analyzing the meanings of class through word embeddings.American Sociological Review, 84(5):905–949, 2019

    Austin C Kozlowski, Matt Taddy, and James A Evans. The geometry of culture: Analyzing the meanings of class through word embeddings.American Sociological Review, 84(5):905–949, 2019

  59. [59]

    Ai safety on whose terms?Science, 381(6654):138–138, 2023

    Seth Lazar and Alondra Nelson. Ai safety on whose terms?Science, 381(6654):138–138, 2023

  60. [60]

    A theory of appropriateness with applications to generative artificial intelligence.arXiv preprint arXiv:2412.19010, 2024

    Joel Z Leibo, Alexander Sasha Vezhnevets, Manfred Diaz, John P Agapiou, William A Cunningham, Peter Sunehag, Julia Haas, Raphael Koster, Edgar A Duéñez-Guzmán, William S Isaac, et al. A theory of appropriateness with applications to generative artificial intelligence.arXiv preprint arXiv:2412.19010, 2024

  61. [61]

    North- western University Press, 1988

    Sanford Levinson and Steven Mailloux.Interpreting law and literature: A hermeneutic reader. North- western University Press, 1988

  62. [62]

    Can llms speak for diverse people? tuning llms via debate to generate controllable controversial statements.arXiv preprint arXiv:2402.10614, 2024

    Ming Li, Jiuhai Chen, Lichang Chen, and Tianyi Zhou. Can llms speak for diverse people? tuning llms via debate to generate controllable controversial statements.arXiv preprint arXiv:2402.10614, 2024

  63. [63]

    Holistic evaluation of language models

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. Transactions on Machine Learning Research, 2022

  64. [64]

    Rethinking model evaluation as narrowing the socio-technical gap.arXiv preprint arXiv:2306.03100, 2023

    Q Vera Liao and Ziang Xiao. Rethinking model evaluation as narrowing the socio-technical gap.arXiv preprint arXiv:2306.03100, 2023

  65. [65]

    Full-stack alignment: Co-aligning AI and institutions with thicker models of value

    Ryan Lowe, Joe Edelman, Tan Zhi-Xuan, Oliver Klingefjord, Ellie Hain, Vincent Wang, Atrisha Sarkar, Michiel A Bakker, Fazl Barez, Matija Franklin, et al. Full-stack alignment: Co-aligning AI and institutions with thicker models of value. In2nd Workshop on Models of Human Feedback for AI Alignment, 2025

  66. [66]

    Contex- tualized evaluations: Judging language model responses to underspecified queries.Transactions of the Association for Computational Linguistics, 13:878–900, 2025

    Chaitanya Malaviya, Joseph Chee Chang, Dan Roth, Mohit Iyyer, Mark Yatskar, and Kyle Lo. Contex- tualized evaluations: Judging language model responses to underspecified queries.Transactions of the Association for Computational Linguistics, 13:878–900, 2025

  67. [67]

    The reader is the metric: How textual features and reader profiles explain conflicting evaluations of AI creative writing.arXiv preprint arXiv:2506.03310, 2025

    Guillermo Marco, Julio Gonzalo, and Víctor Fresno. The reader is the metric: How textual features and reader profiles explain conflicting evaluations of AI creative writing.arXiv preprint arXiv:2506.03310, 2025

  68. [68]

    Inadequacies of large language model benchmarks in the era of generative artificial intelli- gence.IEEE Transactions on Artificial Intelligence, 2025

    Timothy R McIntosh, Teo Susnjak, Nalin Arachchilage, Tong Liu, Dan Xu, Paul Watters, and Malka N Halgamuge. Inadequacies of large language model benchmarks in the era of generative artificial intelli- gence.IEEE Transactions on Artificial Intelligence, 2025

  69. [69]

    Artificial intelligence and illusions of understanding in scientific research.Nature, 627(8002):49–58, 2024

    Lisa Messeri and Molly J Crockett. Artificial intelligence and illusions of understanding in scientific research.Nature, 627(8002):49–58, 2024

  70. [70]

    Learning to draw: Emergent communication through sketching

    Daniela Mihai and Jonathon Hare. Learning to draw: Emergent communication through sketching. Advances in Neural Information Processing Systems, 34:7153–7166, 2021

  71. [71]

    Why ai is weird and shouldn’t be this way: Towards ai for everyone, with everyone, by everyone

    Rada Mihalcea, Oana Ignat, Longju Bai, Angana Borah, Luis Chiruzzo, Zhijing Jin, Claude Kwizera, Joan Nwatu, Soujanya Poria, and Thamar Solorio. Why ai is weird and shouldn’t be this way: Towards ai for everyone, with everyone, by everyone. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 28657–28670, 2025

  72. [72]

    Distributed representations of words and phrases and their compositionality.Advances in Neural Information Processing Systems, 26, 2013

    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality.Advances in Neural Information Processing Systems, 26, 2013

  73. [73]

    State of what art? A call for multi-prompt LLM evaluation.Transactions of the Association for Computational Linguistics, 12:933–949, 2024

    Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, and Gabriel Stanovsky. State of what art? A call for multi-prompt LLM evaluation.Transactions of the Association for Computational Linguistics, 12:933–949, 2024

  74. [74]

    Toward a computational hermeneutics.Big Data & Society, 2(2):2053951715613809, 2015

    John W Mohr, Robin Wagner-Pacifici, and Ronald L Breiger. Toward a computational hermeneutics.Big Data & Society, 2(2):2053951715613809, 2015

  75. [75]

    Emergent interfaces: Vague, complex, bespoke and embodied interaction between humans and computers.Applied Sciences, 11(18):8531, 2021

    Tim Murray-Browne and Panagiotis Tigas. Emergent interfaces: Vague, complex, bespoke and embodied interaction between humans and computers.Applied Sciences, 11(18):8531, 2021

  76. [76]

    Word sense disambiguation: A survey.ACM Computing Surveys (CSUR), 41(2):1–69, 2009

    Roberto Navigli. Word sense disambiguation: A survey.ACM Computing Surveys (CSUR), 41(2):1–69, 2009. 12

  77. [77]

    Culturalframes: Assess- ing cultural expectation alignment in text-to-image models and evaluation metrics

    Shravan Nayak, Mehar Bhatia, Xiaofeng Zhang, Verena Rieser, Lisa Anne Hendricks, Sjoerd Van Steenkiste, Yash Goyal, Karolina Sta ´nczak, and Aishwarya Agrawal. Culturalframes: Assess- ing cultural expectation alignment in text-to-image models and evaluation metrics. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 20918–20953, 2025

  78. [78]

    Mapping global dynamics of benchmark creation and saturation in artificial intelligence.Nature Communications, 13(1):6793, 2022

    Simon Ott, Adriano Barbosa-Silva, Kathrin Blagec, Jan Brauner, and Matthias Samwald. Mapping global dynamics of benchmark creation and saturation in artificial intelligence.Nature Communications, 13(1):6793, 2022

  79. [79]

    Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

  80. [80]

    GloVe: Global vectors for word represen- tation

    Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word represen- tation. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, October 2014. Association for Computational Linguistics

Showing first 80 references.