arxiv: 2604.16403 · v1 · submitted 2026-03-31 · 💻 cs.AI · cs.CY

Recognition: 2 theorem links

· Lean Theorem

Computational Hermeneutics: Evaluating generative AI as a cultural technology

Adam Sobey, Aditya Vashistha, Ashley Noel-Hirst, Baptiste Caramiaux, Cody Kommers, Dalaki Livingston, Daniela Mihai, Deven Parker, Drew Hemment, Edgar Du\'e\~nez-Guzm\'an, Emily Robinson, Emmanouil Benetos, Evelyn Gius, Georgia Meyer, Hoyt Long, James Dobson, Jessica Ratcliff, Jonathan W. Y. Gray, Karina Rodriguez, Kerry Francksen, Kirsten Ostherr, Maria Antoniak, Martin Disley, Matthew Wilkens, Mercedes Bunz, Meredith Martin, Richard Jean So, Ruth Ahnert, Ryan Heuser, Sang Leigh, Sarah Immel, Shauna Concannon, Steve Benford, Ted Underwood, Yali Du, Yipeng Qin, Youyou Wu, Yuan Zheng

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:44 UTC · model grok-4.3

classification 💻 cs.AI cs.CY

keywords generative AIhermeneuticscultural technologyevaluation frameworksinterpretive challengescontext machinessituatednessplurality

0 comments

The pith

Generative AI systems function as context machines that address interpretive challenges of situatedness, plurality, and ambiguity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Generative AI is viewed here not merely as a technical system but as a cultural technology whose outputs gain meaning only through context. The authors draw on hermeneutic theory to claim that these systems must inherently manage three challenges: meanings that depend on their situation, the coexistence of multiple valid readings, and natural conflicts between interpretations. They introduce computational hermeneutics as a framework to interpret what these systems do and to improve their operation. The paper proposes three evaluation principles: making benchmarks iterative, involving human participants, and assessing cultural context rather than isolated outputs. This approach would redirect AI design and assessment from measuring accuracy on fixed questions toward understanding contextual meaning.

Core claim

We argue that GenAI systems function as context machines that must inherently address three interpretive challenges: situatedness where meaning only emerges in context, plurality where multiple valid interpretations coexist, and ambiguity where interpretations naturally conflict. We present computational hermeneutics as an emerging framework offering an interpretive account of what GenAI systems do, and how they might do it better. We offer three principles for hermeneutic evaluation: that benchmarks should be iterative, not one-off; include people, not just machines; and measure cultural context, not just model output. This perspective offers a nascent paradigm for designing and evaluating

What carries the argument

Computational hermeneutics as a framework that treats GenAI systems as context machines required to handle situatedness, plurality, and ambiguity in meaning.

If this is right

Evaluation benchmarks must be iterative processes rather than single fixed tests.
Assessment requires direct inclusion of human participants in addition to automated measures.
Metrics need to capture cultural context and interpretive fit instead of isolated output accuracy.
System design should prioritize addressing interpretive challenges over optimizing for standardized questions.
The overall paradigm for AI evaluation shifts from accuracy to contextual meaning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This view could guide training data curation to explicitly embed varied cultural contexts for better ambiguity handling.
It suggests potential links to existing work on human-AI collaboration in interpretive fields like history or literature.
Developers might test the framework by applying it to specific domains such as creative writing or historical analysis tasks.
Over time it could influence how regulators assess AI systems deployed in cultural or media production.

Load-bearing premise

Hermeneutic theory from the humanities can be straightforwardly applied to provide a computational account of GenAI operation and evaluation without requiring additional empirical validation or adaptation.

What would settle it

A controlled study comparing GenAI performance on cultural tasks using hermeneutic evaluation principles versus standard accuracy metrics, where the hermeneutic approach shows no measurable improvement in handling context or ambiguity.

read the original abstract

Generative AI systems are increasingly recognized as cultural technologies, yet current evaluation frameworks often treat culture as a variable to be measured rather than fundamental to the system's operation. Drawing on hermeneutic theory from the humanities, we argue that GenAI systems function as "context machines" that must inherently address three interpretive challenges: situatedness (meaning only emerges in context), plurality (multiple valid interpretations coexist), and ambiguity (interpretations naturally conflict). We present computational hermeneutics as an emerging framework offering an interpretive account of what GenAI systems do, and how they might do it better. We offer three principles for hermeneutic evaluation -- that benchmarks should be iterative, not one-off; include people, not just machines; and measure cultural context, not just model output. This perspective offers a nascent paradigm for designing and evaluating contemporary AI systems: shifting from standardized questions about accuracy to contextual ones about meaning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a conceptual position paper that imports hermeneutic ideas into GenAI evaluation but offers no concrete mappings or tests to make them operational.

read the letter

The paper's main contribution is a framework called computational hermeneutics that treats generative AI as context machines facing three interpretive challenges: situatedness, plurality, and ambiguity. It derives three evaluation principles from this—benchmarks should be iterative rather than one-off, include people alongside models, and measure cultural context rather than just output accuracy. This is a straightforward extension of existing humanities theory into AI, and the paper does a clear job of showing why accuracy-focused benchmarks fall short for cultural applications. The argument follows logically from the premises and engages the relevant literature without obvious internal contradictions. That said, the work stays entirely at the level of analogy. There are no formal definitions translating these concepts into model architectures, loss functions, or measurable quantities distinct from existing retrieval or context-window techniques, and no data or derivations to support the claims. The stress-test concern holds up: the bridge from hermeneutic theory to computable evaluation is not built. This is for readers already working at the AI-humanities intersection who want to rethink evaluation priorities. It will not deliver new methods or results for someone needing reproducible findings. The thinking is coherent on its own terms, so the paper deserves peer review to see whether referees can push it toward more actionable details.

Referee Report

2 major / 1 minor

Summary. The paper claims that generative AI systems function as 'context machines' that inherently address three interpretive challenges drawn from hermeneutic theory—situatedness (meaning emerges only in context), plurality (multiple valid interpretations coexist), and ambiguity (interpretations conflict)—and proposes 'computational hermeneutics' as an emerging framework for an interpretive account of GenAI operation and evaluation, along with three principles: benchmarks should be iterative rather than one-off, include people not just machines, and measure cultural context not just model output.

Significance. If the interpretive account holds, the paper offers a potentially significant shift in how GenAI is conceptualized and evaluated, moving from standardized accuracy metrics toward contextual questions about meaning and culture; this could influence design paradigms if the framework is later operationalized, and the explicit integration of humanities-derived hermeneutic theory into AI assessment is a clear strength of the conceptual contribution.

major comments (2)

[Introduction and framework presentation] The section introducing the three interpretive challenges and the 'context machines' framing: the manuscript asserts that GenAI systems inherently address situatedness, plurality, and ambiguity but supplies no derivation or technical mapping showing why standard next-token prediction fails to capture them or how these challenges translate into model architectures, loss functions, or metrics distinct from existing context-window or retrieval-augmented methods.
[Principles for hermeneutic evaluation] The section offering the three principles for hermeneutic evaluation: the principles (iterative benchmarks, people-inclusive, context-measuring) are stated at a high level without operationalization of 'cultural context' as a measurable quantity or any concrete examples of how they would alter benchmark design or model training, rendering the framework non-computable in its current form.

minor comments (1)

The manuscript would benefit from additional citations to specific hermeneutic theorists (e.g., Gadamer or Ricoeur) to make the theoretical grounding more traceable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential value of integrating hermeneutic theory into GenAI evaluation. We agree that the manuscript would benefit from greater clarity on the conceptual-to-technical linkages and from more concrete illustrations of the proposed principles. Below we respond point by point and indicate the revisions we will make.

read point-by-point responses

Referee: The section introducing the three interpretive challenges and the 'context machines' framing: the manuscript asserts that GenAI systems inherently address situatedness, plurality, and ambiguity but supplies no derivation or technical mapping showing why standard next-token prediction fails to capture them or how these challenges translate into model architectures, loss functions, or metrics distinct from existing context-window or retrieval-augmented methods.

Authors: We acknowledge that the current presentation remains at the level of interpretive framing rather than supplying explicit technical derivations. The manuscript positions computational hermeneutics as a conceptual lens rather than an immediate architectural proposal; therefore no detailed mapping to loss functions or novel metrics was included. To address the concern, we will revise the introduction to add a short subsection that (a) contrasts next-token prediction with the three challenges by reference to known limitations of fixed context windows, (b) illustrates how retrieval-augmented generation partially addresses plurality but leaves situatedness and ambiguity under-specified, and (c) sketches, at a conceptual level, how an iterative human-in-the-loop protocol could surface distinct evaluation signals. These additions will clarify the intended relationship without claiming new technical results. revision: partial
Referee: The section offering the three principles for hermeneutic evaluation: the principles (iterative benchmarks, people-inclusive, context-measuring) are stated at a high level without operationalization of 'cultural context' as a measurable quantity or any concrete examples of how they would alter benchmark design or model training, rendering the framework non-computable in its current form.

Authors: We accept that the principles are currently stated at a programmatic level and that operational definitions and examples are required to demonstrate feasibility. In the revised manuscript we will (1) define 'cultural context' operationally via two proxy measures—inter-annotator agreement on culturally specific references and the number of distinct valid interpretations elicited from diverse human evaluators—and (2) supply two worked examples: an adaptation of a standard multiple-choice benchmark that inserts iterative human clarification rounds, and a training-time objective that augments cross-entropy loss with an ambiguity-resolution term derived from multi-annotator disagreement. These changes will render the framework more actionable while preserving its theoretical grounding. revision: yes

Circularity Check

0 steps flagged

No significant circularity: interpretive framework draws on external hermeneutic theory

full rationale

The manuscript proposes computational hermeneutics as an interpretive lens for GenAI, framing systems as context machines that address situatedness, plurality, and ambiguity, then suggests three evaluation principles (iterative benchmarks, people-inclusive, context-measuring). These claims rest on direct citation of established humanities hermeneutic theory rather than any internal derivation, equations, fitted parameters, or self-referential definitions. No load-bearing step reduces a result to its own inputs by construction, no predictions are statistically forced from subsets of data, and no uniqueness theorems or ansatzes are smuggled via self-citation. The argument is self-contained as a conceptual extension of external sources, with no mathematical or empirical loop that would trigger circularity under the specified patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The claim rests on the applicability of hermeneutic theory to AI without new supporting evidence; no free parameters are fitted, but the framework introduces a new interpretive lens.

axioms (2)

domain assumption Hermeneutic theory supplies a valid interpretive account for how GenAI systems function as context machines
Invoked in the abstract as the basis for identifying the three interpretive challenges.
domain assumption Current evaluation frameworks treat culture as a variable to be measured rather than fundamental to system operation
Stated directly as the motivation for the new framework.

invented entities (1)

computational hermeneutics no independent evidence
purpose: Emerging framework for interpretive evaluation of GenAI
New term coined to describe the proposed approach combining hermeneutics with computational systems.

pith-pipeline@v0.9.0 · 5601 in / 1291 out tokens · 50942 ms · 2026-05-13T23:44:28.172838+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

117 extracted references · 117 canonical work pages

[1]

Century: A framework and dataset for evaluating historical contextualisation of sensitive images

Canfer Akbulut, Kevin Robinson, Maribeth Rauh, Isabela Albuquerque, Olivia Wiles, Laura Weidinger, Verena Rieser, Yana Hasson, Nahema Marchal, Iason Gabriel, et al. Century: A framework and dataset for evaluating historical contextualisation of sensitive images. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[2]

All too human? mapping and mitigating the risk from anthropomorphic ai

Canfer Akbulut, Laura Weidinger, Arianna Manzini, Iason Gabriel, and Verena Rieser. All too human? mapping and mitigating the risk from anthropomorphic ai. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 7, pages 13–26, 2024

work page 2024
[3]

When benchmarks are targets: Revealing the sensitivity of large language model leaderboards

Norah Alzahrani, Hisham Alyahya, Yazeed Alnumay, Sultan AlRashed, Shaykhah Alsubaie, Yousef Al- mushayqih, Faisal Mirza, Nouf Alotaibi, Nora Al-Twairesh, Areeb Alowisheq, M Saiful Bari, and Haidar Khan. When benchmarks are targets: Revealing the sensitivity of large language model leaderboards. In Proceedings of the 62nd Annual Meeting of the Association ...

work page 2024
[4]

A scenario-based design pack for exploring multimodal human–genai relations

Josh Andres, Chris Danta, Andrea Bianchi, Sahar Farzanfar, Gloria Milena Fernandez-Nieto, Alexa Becker, Tara Capel, Frances Liddell, Shelby Hagemann, Ned Cooper, et al. A scenario-based design pack for exploring multimodal human–genai relations. InProceedings of the 27th International Conference on Multimodal Interaction, pages 145–154, 2025

work page 2025
[5]

Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues.arXiv preprint arXiv:2402.14762, 2024

Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, et al. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues.arXiv preprint arXiv:2402.14762, 2024

work page arXiv 2024
[6]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY , USA, 2021. Association for Computing Machinery

work page 2021
[7]

Unsupervised feature learning and deep learning: A review and new perspectives.CoRR, abs/1206.5538, 1(2665):2012, 2012

Yoshua Bengio, Aaron C Courville, and Pascal Vincent. Unsupervised feature learning and deep learning: A review and new perspectives.CoRR, abs/1206.5538, 1(2665):2012, 2012

work page arXiv 2012
[8]

Seegull multilingual: a dataset of geo-culturally situated stereotypes

Mukul Bhutani, Kevin Robinson, Vinodkumar Prabhakaran, Shachi Dave, and Sunipa Dev. Seegull multilingual: a dataset of geo-culturally situated stereotypes. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 842–854, 2024

work page 2024
[9]

Benchmark datasets driving artificial intelligence development fail to capture the needs of medical professionals.Journal of Biomedical Informatics, 137:104274, 2023

Kathrin Blagec, Jakob Kraiger, Wolfgang Frühwirt, and Matthias Samwald. Benchmark datasets driving artificial intelligence development fail to capture the needs of medical professionals.Journal of Biomedical Informatics, 137:104274, 2023

work page 2023
[10]

Automatic extraction of metaphoric analogies from literary texts: Task for- mulation, dataset construction, and evaluation

Joanne Boisson, Zara Siddique, Hsuvas Borkakoty, Dimosthenis Antypas, Luis Espinosa Anke, and Jose Camacho-Collados. Automatic extraction of metaphoric analogies from literary texts: Task for- mulation, dataset construction, and evaluation. InProceedings of the 31st International Conference on Computational Linguistics, pages 6692–6704, 2025

work page 2025
[11]

Man is to computer programmer as woman is to homemaker? Debiasing word embeddings.Advances in Neural Information Processing Systems, 29, 2016

Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings.Advances in Neural Information Processing Systems, 29, 2016

work page 2016
[12]

Machine culture.Nature Human Behaviour, 7(11):1855–1868, 2023

Levin Brinkmann, Fabian Baumann, Jean-François Bonnefon, Maxime Derex, Thomas F Müller, Anne- Marie Nussberger, Agnieszka Czaplicka, Alberto Acerbi, Thomas L Griffiths, Joseph Henrich, et al. Machine culture.Nature Human Behaviour, 7(11):1855–1868, 2023

work page 2023
[13]

Rethink reporting of evaluation results in AI.Science, 380(6641):136–138, 2023

Ryan Burnell, Wout Schellaert, John Burden, Tomer D Ullman, Fernando Martinez-Plumed, Joshua B Tenenbaum, Danaja Rutar, Lucy G Cheke, Jascha Sohl-Dickstein, Melanie Mitchell, et al. Rethink reporting of evaluation results in AI.Science, 380(6641):136–138, 2023

work page 2023
[14]

Assessing cross- cultural alignment between ChatGPT and human societies: An empirical study

Yong Cao, Li Zhou, Seolhwa Lee, Laura Cabello, Min Chen, and Daniel Hershcovich. Assessing cross- cultural alignment between ChatGPT and human societies: An empirical study. InProceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP), pages 53–67, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics

work page 2023
[15]

Penguin UK, 2018

John D Caputo.Hermeneutics: Facts and interpretation in the age of information. Penguin UK, 2018

work page 2018
[16]

Explorers of unknown planets

Baptiste Caramiaux and Sarah Fdili Alaoui. “Explorers of unknown planets”: Practices and politics of artificial intelligence in visual arts.Proc. ACM Hum.-Comput. Interact., 6(CSCW2), November 2022. 9

work page 2022
[17]

Art or artifice? Large language models and the false promise of creativity

Tuhin Chakrabarty, Philippe Laban, Divyansh Agarwal, Smaranda Muresan, and Chien-Sheng Wu. Art or artifice? Large language models and the false promise of creativity. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI ’24, New York, NY , USA, 2024. Association for Computing Machinery

work page 2024
[18]

Yu, Qiang Yang, and Xing Xie

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. A survey on evaluation of large language models.ACM Trans. Intell. Syst. Technol., 15(3), March 2024

work page 2024
[19]

Unleashing the potential of prompt engineering for large language models.Patterns

Banghao Chen, Zhaofeng Zhang, Nicolas Langrené, and Shengxin Zhu. Unleashing the potential of prompt engineering for large language models.Patterns

work page
[20]

A computational framework for behavioral assessment of LLM therapists.arXiv preprint arXiv:2401.00820, 2024

Yu Ying Chiu, Ashish Sharma, Inna Wanyin Lin, and Tim Althoff. A computational framework for behavioral assessment of LLM therapists.arXiv preprint arXiv:2401.00820, 2024

work page arXiv 2024
[21]

Building machines that learn and think with people.Nature Human Behaviour, 8(10):1851–1863, 2024

Katherine M Collins, Ilia Sucholutsky, Umang Bhatt, Kartik Chandra, Lionel Wong, Mina Lee, Cedegao E Zhang, Tan Zhi-Xuan, Mark Ho, Vikash Mansinghka, et al. Building machines that learn and think with people.Nature Human Behaviour, 8(10):1851–1863, 2024

work page 2024
[22]

From driverless dilemmas to more practical commonsense tests for automated vehicles.Proceedings of the National Academy of Sciences, 118(11):e2010202118, 2021

Julian De Freitas, Andrea Censi, Bryant Walker Smith, Luigi Di Lillo, Sam E Anthony, and Emilio Fraz- zoli. From driverless dilemmas to more practical commonsense tests for automated vehicles.Proceedings of the National Academy of Sciences, 118(11):e2010202118, 2021

work page 2021
[23]

Bringing the people back in: Contesting benchmark machine learning datasets.arXiv preprint arXiv:2007.07399, 2020

Remi Denton, Alex Hanna, Razvan Amironesei, Andrew Smart, Hilary Nicole, and Morgan Klaus Scheuerman. Bringing the people back in: Contesting benchmark machine learning datasets.arXiv preprint arXiv:2007.07399, 2020

work page arXiv 2007
[24]

An archival perspective on pretraining data.Patterns, 5(4), 2024

Meera A Desai, Irene V Pasquetto, Abigail Z Jacobs, and Dallas Card. An archival perspective on pretraining data.Patterns, 5(4), 2024

work page 2024
[25]

A taxonomy of linguistic expressions that contribute to anthropomorphism of language technologies

Alicia DeVrio, Myra Cheng, Lisa Egede, Alexandra Olteanu, and Su Lin Blodgett. A taxonomy of linguistic expressions that contribute to anthropomorphism of language technologies. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–18, 2025

work page 2025
[26]

Princeton University Press, 1989

Wilhelm Dilthey.Introduction to the human sciences, volume 1. Princeton University Press, 1989

work page 1989
[27]

University of Illinois Press, 2019

James E Dobson.Critical digital humanities: The search for a methodology. University of Illinois Press, 2019

work page 2019
[28]

Vector hermeneutics: On the interpretation of vector space models of text.Digital Scholarship in the Humanities, 37(1):81–93, 2022

James E Dobson. Vector hermeneutics: On the interpretation of vector space models of text.Digital Scholarship in the Humanities, 37(1):81–93, 2022

work page 2022
[29]

Towards a rigorous science of interpretable machine learning.stat, 1050:2, 2017

Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning.stat, 1050:2, 2017

work page 2017
[30]

Relational norms for human-AI cooperation.arXiv preprint arXiv:2502.12102, 2025

Brian D Earp, Sebastian Porsdam Mann, Mateo Aboy, Edmond Awad, Monika Betzler, Marietjie Botes, Rachel Calcott, Mina Caraccio, Nick Chater, Mark Coeckelbergh, et al. Relational norms for human-AI cooperation.arXiv preprint arXiv:2502.12102, 2025

work page arXiv 2025
[31]

William Empson.Seven Types of Ambiguity. 1930

work page 1930
[32]

Can we trust AI benchmarks? An interdisciplinary review of current issues in AI evaluation.arXiv preprint arXiv:2502.06559, 2025

Maria Eriksson, Erasmo Purificato, Arman Noroozian, Joao Vinagre, Guillaume Chaslot, Emilia Gomez, and David Fernandez-Llorca. Can we trust AI benchmarks? An interdisciplinary review of current issues in AI evaluation.arXiv preprint arXiv:2502.06559, 2025

work page arXiv 2025
[33]

How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings

Kawin Ethayarajh. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 55–65, Hong Kong, China, November 2019...

work page 2019
[34]

Utility is in the eye of the user: A critique of NLP leaderboards

Kawin Ethayarajh and Dan Jurafsky. Utility is in the eye of the user: A critique of NLP leaderboards. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4846–4853, Online, November 2020. Association for Computational Linguistics

work page 2020
[35]

Large AI models are cultural and social technologies.Science, 387(6739):1153–1156, 2025

Henry Farrell, Alison Gopnik, Cosma Shalizi, and James Evans. Large AI models are cultural and social technologies.Science, 387(6739):1153–1156, 2025. 10

work page 2025
[36]

Entanglement HCI the next wave?ACM Trans

Christopher Frauenberger. Entanglement HCI the next wave?ACM Trans. Comput.-Hum. Interact., 27(1), November 2019

work page 2019
[37]

Perspectivist approaches to natural language processing: A survey.Language Resources and Evaluation, pages 1–28, 2024

Simona Frenda, Gavin Abercrombie, Valerio Basile, Alessandro Pedrani, Raffaella Panizzon, Alessan- dra Teresa Cignarella, Cristina Marco, and Davide Bernardi. Perspectivist approaches to natural language processing: A survey.Language Resources and Evaluation, pages 1–28, 2024

work page 2024
[38]

Hans-Georg Gadamer.Truth and method. 1960

work page 1960
[39]

Gaver, Jacob Beaver, and Steve Benford

William W. Gaver, Jacob Beaver, and Steve Benford. Ambiguity as a resource for design. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’03, page 233–240, New York, NY , USA, 2003. Association for Computing Machinery

work page 2003
[40]

How culture shapes what people want from AI

Xiao Ge, Chunchen Xu, Daigo Misaki, Hazel Rose Markus, and Jeanne L Tsai. How culture shapes what people want from AI. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI ’24, New York, NY , USA, 2024. Association for Computing Machinery

work page 2024
[41]

Basic Books, 1973

Clifford Geertz.The interpretation of cultures. Basic Books, 1973

work page 1973
[42]

Representation: Cultural representations and signifying practices.Culture, 1997

S Hall. Representation: Cultural representations and signifying practices.Culture, 1997

work page 1997
[43]

Situated knowledges: The science question in feminism and the privilege of partial perspective.Feminist Studies, 14(3):575–599, 1988

Donna Haraway. Situated knowledges: The science question in feminism and the privilege of partial perspective.Feminist Studies, 14(3):575–599, 1988

work page 1988
[44]

Martin Heidegger.Being and time. 1927

work page 1927
[45]

Doing AI differently: Rethinking the foundations of AI via the humanities

Drew Hemment, Cody Kommers, and colleagues. Doing AI differently: Rethinking the foundations of AI via the humanities. Technical report, London: The Alan Turing Institute, 2025

work page 2025
[46]

Experiential AI: Between arts and explainable AI.Leonardo, 57(3):298–306, 2024

Drew Hemment, Dave Murray-Rust, Vaishak Belle, Ruth Aylett, Matjaz Vidmar, and Frank Broz. Experiential AI: Between arts and explainable AI.Leonardo, 57(3):298–306, 2024

work page 2024
[47]

Cultural collapse: Toward a generative formalism for ai cultural production.Anthology of Computers and the Humanities, 3:575–588, 2025

Ryan Heuser. Cultural collapse: Toward a generative formalism for ai cultural production.Anthology of Computers and the Humanities, 3:575–588, 2025

work page 2025
[48]

Multi-turn evaluation of anthropomorphic behaviours in large language models.arXiv preprint arXiv:2502.07077, 2025

Lujain Ibrahim, Canfer Akbulut, Rasmi Elasmar, Charvi Rastogi, Minsuk Kahng, Meredith Ringel Morris, Kevin R McKee, Verena Rieser, Murray Shanahan, and Laura Weidinger. Multi-turn evaluation of anthropomorphic behaviours in large language models.arXiv preprint arXiv:2502.07077, 2025

work page arXiv 2025
[49]

Towards interactive evaluations for interaction harms in human-ai systems

Lujain Ibrahim, Saffron Huang, Lama Ahmad, Umang Bhatt, and Markus Anderljung. Towards interactive evaluations for interaction harms in human-ai systems. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 8, pages 1302–1310, 2025

work page 2025
[50]

Springer, 2013

Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani.An introduction to statistical learning: with applications in R, volume 103. Springer, 2013

work page 2013
[51]

Dead rats, dopamine, performance metrics, and peacock tails: Proxy failure is an inherent risk in goal-oriented systems.Behavioral and Brain Sciences, 47:e67, 2024

Yohan J John, Leigh Caldwell, Dakota E McCoy, and Oliver Braganza. Dead rats, dopamine, performance metrics, and peacock tails: Proxy failure is an inherent risk in goal-oriented systems.Behavioral and Brain Sciences, 47:e67, 2024

work page 2024
[52]

Kapoor and A

Sayash Kapoor, Benedikt Stroebl, Zachary S Siegel, Nitya Nadgir, and Arvind Narayanan. AI agents that matter.arXiv preprint arXiv:2407.01502, 2024

work page arXiv 2024
[53]

Provocations from the humanities for generative AI research.arXiv preprint arXiv:2502.19190, 2025

Lauren Klein, Meredith Martin, André Brock, Maria Antoniak, Melanie Walsh, Jessica Marie Johnson, Lauren Tilton, and David Mimno. Provocations from the humanities for generative AI research.arXiv preprint arXiv:2502.19190, 2025

work page arXiv 2025
[54]

From protoscience to epistemic monoculture: How benchmarking set the stage for the deep learning revolution.arXiv preprint arXiv:2404.06647, 2024

Bernard J Koch and David Peterson. From protoscience to epistemic monoculture: How benchmarking set the stage for the deep learning revolution.arXiv preprint arXiv:2404.06647, 2024

work page arXiv 2024
[55]

Sense-making, cultural scripts, and the inferential basis of meaningful experience

Cody Kommers and Simon DeDeo. Sense-making, cultural scripts, and the inferential basis of meaningful experience. InProceedings of the Annual Meeting of the Cognitive Science Society, volume 47, 2025

work page 2025
[56]

Why slop matters.ACM AI Letters, 2025

Cody Kommers, Eamon Duede, Julia Gordon, Ari Holtzman, Tess McNulty, Spencer Stewart, Lindsay Thomas, Richard Jean So, and Hoyt Long. Why slop matters.ACM AI Letters, 2025

work page 2025
[57]

Meaning is not a metric: Using LLMs to make cultural context legible at scale.arXiv preprint arXiv:2505.23785, 2025

Cody Kommers, Drew Hemment, Maria Antoniak, Joel Z Leibo, Hoyt Long, Emily Robinson, and Adam Sobey. Meaning is not a metric: Using LLMs to make cultural context legible at scale.arXiv preprint arXiv:2505.23785, 2025. 11

work page arXiv 2025
[58]

The geometry of culture: Analyzing the meanings of class through word embeddings.American Sociological Review, 84(5):905–949, 2019

Austin C Kozlowski, Matt Taddy, and James A Evans. The geometry of culture: Analyzing the meanings of class through word embeddings.American Sociological Review, 84(5):905–949, 2019

work page 2019
[59]

Ai safety on whose terms?Science, 381(6654):138–138, 2023

Seth Lazar and Alondra Nelson. Ai safety on whose terms?Science, 381(6654):138–138, 2023

work page 2023
[60]

A theory of appropriateness with applications to generative artificial intelligence.arXiv preprint arXiv:2412.19010, 2024

Joel Z Leibo, Alexander Sasha Vezhnevets, Manfred Diaz, John P Agapiou, William A Cunningham, Peter Sunehag, Julia Haas, Raphael Koster, Edgar A Duéñez-Guzmán, William S Isaac, et al. A theory of appropriateness with applications to generative artificial intelligence.arXiv preprint arXiv:2412.19010, 2024

work page arXiv 2024
[61]

North- western University Press, 1988

Sanford Levinson and Steven Mailloux.Interpreting law and literature: A hermeneutic reader. North- western University Press, 1988

work page 1988
[62]

Can llms speak for diverse people? tuning llms via debate to generate controllable controversial statements.arXiv preprint arXiv:2402.10614, 2024

Ming Li, Jiuhai Chen, Lichang Chen, and Tianyi Zhou. Can llms speak for diverse people? tuning llms via debate to generate controllable controversial statements.arXiv preprint arXiv:2402.10614, 2024

work page arXiv 2024
[63]

Holistic evaluation of language models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. Transactions on Machine Learning Research, 2022

work page 2022
[64]

Rethinking model evaluation as narrowing the socio-technical gap.arXiv preprint arXiv:2306.03100, 2023

Q Vera Liao and Ziang Xiao. Rethinking model evaluation as narrowing the socio-technical gap.arXiv preprint arXiv:2306.03100, 2023

work page arXiv 2023
[65]

Full-stack alignment: Co-aligning AI and institutions with thicker models of value

Ryan Lowe, Joe Edelman, Tan Zhi-Xuan, Oliver Klingefjord, Ellie Hain, Vincent Wang, Atrisha Sarkar, Michiel A Bakker, Fazl Barez, Matija Franklin, et al. Full-stack alignment: Co-aligning AI and institutions with thicker models of value. In2nd Workshop on Models of Human Feedback for AI Alignment, 2025

work page 2025
[66]

Contex- tualized evaluations: Judging language model responses to underspecified queries.Transactions of the Association for Computational Linguistics, 13:878–900, 2025

Chaitanya Malaviya, Joseph Chee Chang, Dan Roth, Mohit Iyyer, Mark Yatskar, and Kyle Lo. Contex- tualized evaluations: Judging language model responses to underspecified queries.Transactions of the Association for Computational Linguistics, 13:878–900, 2025

work page 2025
[67]

The reader is the metric: How textual features and reader profiles explain conflicting evaluations of AI creative writing.arXiv preprint arXiv:2506.03310, 2025

Guillermo Marco, Julio Gonzalo, and Víctor Fresno. The reader is the metric: How textual features and reader profiles explain conflicting evaluations of AI creative writing.arXiv preprint arXiv:2506.03310, 2025

work page arXiv 2025
[68]

Inadequacies of large language model benchmarks in the era of generative artificial intelli- gence.IEEE Transactions on Artificial Intelligence, 2025

Timothy R McIntosh, Teo Susnjak, Nalin Arachchilage, Tong Liu, Dan Xu, Paul Watters, and Malka N Halgamuge. Inadequacies of large language model benchmarks in the era of generative artificial intelli- gence.IEEE Transactions on Artificial Intelligence, 2025

work page 2025
[69]

Artificial intelligence and illusions of understanding in scientific research.Nature, 627(8002):49–58, 2024

Lisa Messeri and Molly J Crockett. Artificial intelligence and illusions of understanding in scientific research.Nature, 627(8002):49–58, 2024

work page 2024
[70]

Learning to draw: Emergent communication through sketching

Daniela Mihai and Jonathon Hare. Learning to draw: Emergent communication through sketching. Advances in Neural Information Processing Systems, 34:7153–7166, 2021

work page 2021
[71]

Why ai is weird and shouldn’t be this way: Towards ai for everyone, with everyone, by everyone

Rada Mihalcea, Oana Ignat, Longju Bai, Angana Borah, Luis Chiruzzo, Zhijing Jin, Claude Kwizera, Joan Nwatu, Soujanya Poria, and Thamar Solorio. Why ai is weird and shouldn’t be this way: Towards ai for everyone, with everyone, by everyone. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 28657–28670, 2025

work page 2025
[72]

Distributed representations of words and phrases and their compositionality.Advances in Neural Information Processing Systems, 26, 2013

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality.Advances in Neural Information Processing Systems, 26, 2013

work page 2013
[73]

State of what art? A call for multi-prompt LLM evaluation.Transactions of the Association for Computational Linguistics, 12:933–949, 2024

Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, and Gabriel Stanovsky. State of what art? A call for multi-prompt LLM evaluation.Transactions of the Association for Computational Linguistics, 12:933–949, 2024

work page 2024
[74]

Toward a computational hermeneutics.Big Data & Society, 2(2):2053951715613809, 2015

John W Mohr, Robin Wagner-Pacifici, and Ronald L Breiger. Toward a computational hermeneutics.Big Data & Society, 2(2):2053951715613809, 2015

work page 2015
[75]

Emergent interfaces: Vague, complex, bespoke and embodied interaction between humans and computers.Applied Sciences, 11(18):8531, 2021

Tim Murray-Browne and Panagiotis Tigas. Emergent interfaces: Vague, complex, bespoke and embodied interaction between humans and computers.Applied Sciences, 11(18):8531, 2021

work page 2021
[76]

Word sense disambiguation: A survey.ACM Computing Surveys (CSUR), 41(2):1–69, 2009

Roberto Navigli. Word sense disambiguation: A survey.ACM Computing Surveys (CSUR), 41(2):1–69, 2009. 12

work page 2009
[77]

Culturalframes: Assess- ing cultural expectation alignment in text-to-image models and evaluation metrics

Shravan Nayak, Mehar Bhatia, Xiaofeng Zhang, Verena Rieser, Lisa Anne Hendricks, Sjoerd Van Steenkiste, Yash Goyal, Karolina Sta ´nczak, and Aishwarya Agrawal. Culturalframes: Assess- ing cultural expectation alignment in text-to-image models and evaluation metrics. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 20918–20953, 2025

work page 2025
[78]

Mapping global dynamics of benchmark creation and saturation in artificial intelligence.Nature Communications, 13(1):6793, 2022

Simon Ott, Adriano Barbosa-Silva, Kathrin Blagec, Jan Brauner, and Matthias Samwald. Mapping global dynamics of benchmark creation and saturation in artificial intelligence.Nature Communications, 13(1):6793, 2022

work page 2022
[79]

Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

work page 2022
[80]

GloVe: Global vectors for word represen- tation

Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word represen- tation. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, October 2014. Association for Computational Linguistics

work page 2014

Showing first 80 references.