Recognition: 2 theorem links
· Lean TheoremComputational Hermeneutics: Evaluating generative AI as a cultural technology
Pith reviewed 2026-05-13 23:44 UTC · model grok-4.3
The pith
Generative AI systems function as context machines that address interpretive challenges of situatedness, plurality, and ambiguity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We argue that GenAI systems function as context machines that must inherently address three interpretive challenges: situatedness where meaning only emerges in context, plurality where multiple valid interpretations coexist, and ambiguity where interpretations naturally conflict. We present computational hermeneutics as an emerging framework offering an interpretive account of what GenAI systems do, and how they might do it better. We offer three principles for hermeneutic evaluation: that benchmarks should be iterative, not one-off; include people, not just machines; and measure cultural context, not just model output. This perspective offers a nascent paradigm for designing and evaluating
What carries the argument
Computational hermeneutics as a framework that treats GenAI systems as context machines required to handle situatedness, plurality, and ambiguity in meaning.
If this is right
- Evaluation benchmarks must be iterative processes rather than single fixed tests.
- Assessment requires direct inclusion of human participants in addition to automated measures.
- Metrics need to capture cultural context and interpretive fit instead of isolated output accuracy.
- System design should prioritize addressing interpretive challenges over optimizing for standardized questions.
- The overall paradigm for AI evaluation shifts from accuracy to contextual meaning.
Where Pith is reading between the lines
- This view could guide training data curation to explicitly embed varied cultural contexts for better ambiguity handling.
- It suggests potential links to existing work on human-AI collaboration in interpretive fields like history or literature.
- Developers might test the framework by applying it to specific domains such as creative writing or historical analysis tasks.
- Over time it could influence how regulators assess AI systems deployed in cultural or media production.
Load-bearing premise
Hermeneutic theory from the humanities can be straightforwardly applied to provide a computational account of GenAI operation and evaluation without requiring additional empirical validation or adaptation.
What would settle it
A controlled study comparing GenAI performance on cultural tasks using hermeneutic evaluation principles versus standard accuracy metrics, where the hermeneutic approach shows no measurable improvement in handling context or ambiguity.
read the original abstract
Generative AI systems are increasingly recognized as cultural technologies, yet current evaluation frameworks often treat culture as a variable to be measured rather than fundamental to the system's operation. Drawing on hermeneutic theory from the humanities, we argue that GenAI systems function as "context machines" that must inherently address three interpretive challenges: situatedness (meaning only emerges in context), plurality (multiple valid interpretations coexist), and ambiguity (interpretations naturally conflict). We present computational hermeneutics as an emerging framework offering an interpretive account of what GenAI systems do, and how they might do it better. We offer three principles for hermeneutic evaluation -- that benchmarks should be iterative, not one-off; include people, not just machines; and measure cultural context, not just model output. This perspective offers a nascent paradigm for designing and evaluating contemporary AI systems: shifting from standardized questions about accuracy to contextual ones about meaning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that generative AI systems function as 'context machines' that inherently address three interpretive challenges drawn from hermeneutic theory—situatedness (meaning emerges only in context), plurality (multiple valid interpretations coexist), and ambiguity (interpretations conflict)—and proposes 'computational hermeneutics' as an emerging framework for an interpretive account of GenAI operation and evaluation, along with three principles: benchmarks should be iterative rather than one-off, include people not just machines, and measure cultural context not just model output.
Significance. If the interpretive account holds, the paper offers a potentially significant shift in how GenAI is conceptualized and evaluated, moving from standardized accuracy metrics toward contextual questions about meaning and culture; this could influence design paradigms if the framework is later operationalized, and the explicit integration of humanities-derived hermeneutic theory into AI assessment is a clear strength of the conceptual contribution.
major comments (2)
- [Introduction and framework presentation] The section introducing the three interpretive challenges and the 'context machines' framing: the manuscript asserts that GenAI systems inherently address situatedness, plurality, and ambiguity but supplies no derivation or technical mapping showing why standard next-token prediction fails to capture them or how these challenges translate into model architectures, loss functions, or metrics distinct from existing context-window or retrieval-augmented methods.
- [Principles for hermeneutic evaluation] The section offering the three principles for hermeneutic evaluation: the principles (iterative benchmarks, people-inclusive, context-measuring) are stated at a high level without operationalization of 'cultural context' as a measurable quantity or any concrete examples of how they would alter benchmark design or model training, rendering the framework non-computable in its current form.
minor comments (1)
- The manuscript would benefit from additional citations to specific hermeneutic theorists (e.g., Gadamer or Ricoeur) to make the theoretical grounding more traceable.
Simulated Author's Rebuttal
We thank the referee for their constructive review and for recognizing the potential value of integrating hermeneutic theory into GenAI evaluation. We agree that the manuscript would benefit from greater clarity on the conceptual-to-technical linkages and from more concrete illustrations of the proposed principles. Below we respond point by point and indicate the revisions we will make.
read point-by-point responses
-
Referee: The section introducing the three interpretive challenges and the 'context machines' framing: the manuscript asserts that GenAI systems inherently address situatedness, plurality, and ambiguity but supplies no derivation or technical mapping showing why standard next-token prediction fails to capture them or how these challenges translate into model architectures, loss functions, or metrics distinct from existing context-window or retrieval-augmented methods.
Authors: We acknowledge that the current presentation remains at the level of interpretive framing rather than supplying explicit technical derivations. The manuscript positions computational hermeneutics as a conceptual lens rather than an immediate architectural proposal; therefore no detailed mapping to loss functions or novel metrics was included. To address the concern, we will revise the introduction to add a short subsection that (a) contrasts next-token prediction with the three challenges by reference to known limitations of fixed context windows, (b) illustrates how retrieval-augmented generation partially addresses plurality but leaves situatedness and ambiguity under-specified, and (c) sketches, at a conceptual level, how an iterative human-in-the-loop protocol could surface distinct evaluation signals. These additions will clarify the intended relationship without claiming new technical results. revision: partial
-
Referee: The section offering the three principles for hermeneutic evaluation: the principles (iterative benchmarks, people-inclusive, context-measuring) are stated at a high level without operationalization of 'cultural context' as a measurable quantity or any concrete examples of how they would alter benchmark design or model training, rendering the framework non-computable in its current form.
Authors: We accept that the principles are currently stated at a programmatic level and that operational definitions and examples are required to demonstrate feasibility. In the revised manuscript we will (1) define 'cultural context' operationally via two proxy measures—inter-annotator agreement on culturally specific references and the number of distinct valid interpretations elicited from diverse human evaluators—and (2) supply two worked examples: an adaptation of a standard multiple-choice benchmark that inserts iterative human clarification rounds, and a training-time objective that augments cross-entropy loss with an ambiguity-resolution term derived from multi-annotator disagreement. These changes will render the framework more actionable while preserving its theoretical grounding. revision: yes
Circularity Check
No significant circularity: interpretive framework draws on external hermeneutic theory
full rationale
The manuscript proposes computational hermeneutics as an interpretive lens for GenAI, framing systems as context machines that address situatedness, plurality, and ambiguity, then suggests three evaluation principles (iterative benchmarks, people-inclusive, context-measuring). These claims rest on direct citation of established humanities hermeneutic theory rather than any internal derivation, equations, fitted parameters, or self-referential definitions. No load-bearing step reduces a result to its own inputs by construction, no predictions are statistically forced from subsets of data, and no uniqueness theorems or ansatzes are smuggled via self-citation. The argument is self-contained as a conceptual extension of external sources, with no mathematical or empirical loop that would trigger circularity under the specified patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Hermeneutic theory supplies a valid interpretive account for how GenAI systems function as context machines
- domain assumption Current evaluation frameworks treat culture as a variable to be measured rather than fundamental to system operation
invented entities (1)
-
computational hermeneutics
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Century: A framework and dataset for evaluating historical contextualisation of sensitive images
Canfer Akbulut, Kevin Robinson, Maribeth Rauh, Isabela Albuquerque, Olivia Wiles, Laura Weidinger, Verena Rieser, Yana Hasson, Nahema Marchal, Iason Gabriel, et al. Century: A framework and dataset for evaluating historical contextualisation of sensitive images. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[2]
All too human? mapping and mitigating the risk from anthropomorphic ai
Canfer Akbulut, Laura Weidinger, Arianna Manzini, Iason Gabriel, and Verena Rieser. All too human? mapping and mitigating the risk from anthropomorphic ai. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 7, pages 13–26, 2024
work page 2024
-
[3]
When benchmarks are targets: Revealing the sensitivity of large language model leaderboards
Norah Alzahrani, Hisham Alyahya, Yazeed Alnumay, Sultan AlRashed, Shaykhah Alsubaie, Yousef Al- mushayqih, Faisal Mirza, Nouf Alotaibi, Nora Al-Twairesh, Areeb Alowisheq, M Saiful Bari, and Haidar Khan. When benchmarks are targets: Revealing the sensitivity of large language model leaderboards. In Proceedings of the 62nd Annual Meeting of the Association ...
work page 2024
-
[4]
A scenario-based design pack for exploring multimodal human–genai relations
Josh Andres, Chris Danta, Andrea Bianchi, Sahar Farzanfar, Gloria Milena Fernandez-Nieto, Alexa Becker, Tara Capel, Frances Liddell, Shelby Hagemann, Ned Cooper, et al. A scenario-based design pack for exploring multimodal human–genai relations. InProceedings of the 27th International Conference on Multimodal Interaction, pages 145–154, 2025
work page 2025
-
[5]
Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, et al. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues.arXiv preprint arXiv:2402.14762, 2024
-
[6]
Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY , USA, 2021. Association for Computing Machinery
work page 2021
-
[7]
Yoshua Bengio, Aaron C Courville, and Pascal Vincent. Unsupervised feature learning and deep learning: A review and new perspectives.CoRR, abs/1206.5538, 1(2665):2012, 2012
-
[8]
Seegull multilingual: a dataset of geo-culturally situated stereotypes
Mukul Bhutani, Kevin Robinson, Vinodkumar Prabhakaran, Shachi Dave, and Sunipa Dev. Seegull multilingual: a dataset of geo-culturally situated stereotypes. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 842–854, 2024
work page 2024
-
[9]
Kathrin Blagec, Jakob Kraiger, Wolfgang Frühwirt, and Matthias Samwald. Benchmark datasets driving artificial intelligence development fail to capture the needs of medical professionals.Journal of Biomedical Informatics, 137:104274, 2023
work page 2023
-
[10]
Joanne Boisson, Zara Siddique, Hsuvas Borkakoty, Dimosthenis Antypas, Luis Espinosa Anke, and Jose Camacho-Collados. Automatic extraction of metaphoric analogies from literary texts: Task for- mulation, dataset construction, and evaluation. InProceedings of the 31st International Conference on Computational Linguistics, pages 6692–6704, 2025
work page 2025
-
[11]
Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings.Advances in Neural Information Processing Systems, 29, 2016
work page 2016
-
[12]
Machine culture.Nature Human Behaviour, 7(11):1855–1868, 2023
Levin Brinkmann, Fabian Baumann, Jean-François Bonnefon, Maxime Derex, Thomas F Müller, Anne- Marie Nussberger, Agnieszka Czaplicka, Alberto Acerbi, Thomas L Griffiths, Joseph Henrich, et al. Machine culture.Nature Human Behaviour, 7(11):1855–1868, 2023
work page 2023
-
[13]
Rethink reporting of evaluation results in AI.Science, 380(6641):136–138, 2023
Ryan Burnell, Wout Schellaert, John Burden, Tomer D Ullman, Fernando Martinez-Plumed, Joshua B Tenenbaum, Danaja Rutar, Lucy G Cheke, Jascha Sohl-Dickstein, Melanie Mitchell, et al. Rethink reporting of evaluation results in AI.Science, 380(6641):136–138, 2023
work page 2023
-
[14]
Assessing cross- cultural alignment between ChatGPT and human societies: An empirical study
Yong Cao, Li Zhou, Seolhwa Lee, Laura Cabello, Min Chen, and Daniel Hershcovich. Assessing cross- cultural alignment between ChatGPT and human societies: An empirical study. InProceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP), pages 53–67, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics
work page 2023
-
[15]
John D Caputo.Hermeneutics: Facts and interpretation in the age of information. Penguin UK, 2018
work page 2018
-
[16]
Baptiste Caramiaux and Sarah Fdili Alaoui. “Explorers of unknown planets”: Practices and politics of artificial intelligence in visual arts.Proc. ACM Hum.-Comput. Interact., 6(CSCW2), November 2022. 9
work page 2022
-
[17]
Art or artifice? Large language models and the false promise of creativity
Tuhin Chakrabarty, Philippe Laban, Divyansh Agarwal, Smaranda Muresan, and Chien-Sheng Wu. Art or artifice? Large language models and the false promise of creativity. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI ’24, New York, NY , USA, 2024. Association for Computing Machinery
work page 2024
-
[18]
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. A survey on evaluation of large language models.ACM Trans. Intell. Syst. Technol., 15(3), March 2024
work page 2024
-
[19]
Unleashing the potential of prompt engineering for large language models.Patterns
Banghao Chen, Zhaofeng Zhang, Nicolas Langrené, and Shengxin Zhu. Unleashing the potential of prompt engineering for large language models.Patterns
-
[20]
Yu Ying Chiu, Ashish Sharma, Inna Wanyin Lin, and Tim Althoff. A computational framework for behavioral assessment of LLM therapists.arXiv preprint arXiv:2401.00820, 2024
-
[21]
Building machines that learn and think with people.Nature Human Behaviour, 8(10):1851–1863, 2024
Katherine M Collins, Ilia Sucholutsky, Umang Bhatt, Kartik Chandra, Lionel Wong, Mina Lee, Cedegao E Zhang, Tan Zhi-Xuan, Mark Ho, Vikash Mansinghka, et al. Building machines that learn and think with people.Nature Human Behaviour, 8(10):1851–1863, 2024
work page 2024
-
[22]
Julian De Freitas, Andrea Censi, Bryant Walker Smith, Luigi Di Lillo, Sam E Anthony, and Emilio Fraz- zoli. From driverless dilemmas to more practical commonsense tests for automated vehicles.Proceedings of the National Academy of Sciences, 118(11):e2010202118, 2021
work page 2021
-
[23]
Remi Denton, Alex Hanna, Razvan Amironesei, Andrew Smart, Hilary Nicole, and Morgan Klaus Scheuerman. Bringing the people back in: Contesting benchmark machine learning datasets.arXiv preprint arXiv:2007.07399, 2020
-
[24]
An archival perspective on pretraining data.Patterns, 5(4), 2024
Meera A Desai, Irene V Pasquetto, Abigail Z Jacobs, and Dallas Card. An archival perspective on pretraining data.Patterns, 5(4), 2024
work page 2024
-
[25]
A taxonomy of linguistic expressions that contribute to anthropomorphism of language technologies
Alicia DeVrio, Myra Cheng, Lisa Egede, Alexandra Olteanu, and Su Lin Blodgett. A taxonomy of linguistic expressions that contribute to anthropomorphism of language technologies. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–18, 2025
work page 2025
-
[26]
Princeton University Press, 1989
Wilhelm Dilthey.Introduction to the human sciences, volume 1. Princeton University Press, 1989
work page 1989
-
[27]
University of Illinois Press, 2019
James E Dobson.Critical digital humanities: The search for a methodology. University of Illinois Press, 2019
work page 2019
-
[28]
James E Dobson. Vector hermeneutics: On the interpretation of vector space models of text.Digital Scholarship in the Humanities, 37(1):81–93, 2022
work page 2022
-
[29]
Towards a rigorous science of interpretable machine learning.stat, 1050:2, 2017
Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning.stat, 1050:2, 2017
work page 2017
-
[30]
Relational norms for human-AI cooperation.arXiv preprint arXiv:2502.12102, 2025
Brian D Earp, Sebastian Porsdam Mann, Mateo Aboy, Edmond Awad, Monika Betzler, Marietjie Botes, Rachel Calcott, Mina Caraccio, Nick Chater, Mark Coeckelbergh, et al. Relational norms for human-AI cooperation.arXiv preprint arXiv:2502.12102, 2025
-
[31]
William Empson.Seven Types of Ambiguity. 1930
work page 1930
-
[32]
Maria Eriksson, Erasmo Purificato, Arman Noroozian, Joao Vinagre, Guillaume Chaslot, Emilia Gomez, and David Fernandez-Llorca. Can we trust AI benchmarks? An interdisciplinary review of current issues in AI evaluation.arXiv preprint arXiv:2502.06559, 2025
-
[33]
Kawin Ethayarajh. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 55–65, Hong Kong, China, November 2019...
work page 2019
-
[34]
Utility is in the eye of the user: A critique of NLP leaderboards
Kawin Ethayarajh and Dan Jurafsky. Utility is in the eye of the user: A critique of NLP leaderboards. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4846–4853, Online, November 2020. Association for Computational Linguistics
work page 2020
-
[35]
Large AI models are cultural and social technologies.Science, 387(6739):1153–1156, 2025
Henry Farrell, Alison Gopnik, Cosma Shalizi, and James Evans. Large AI models are cultural and social technologies.Science, 387(6739):1153–1156, 2025. 10
work page 2025
-
[36]
Entanglement HCI the next wave?ACM Trans
Christopher Frauenberger. Entanglement HCI the next wave?ACM Trans. Comput.-Hum. Interact., 27(1), November 2019
work page 2019
-
[37]
Simona Frenda, Gavin Abercrombie, Valerio Basile, Alessandro Pedrani, Raffaella Panizzon, Alessan- dra Teresa Cignarella, Cristina Marco, and Davide Bernardi. Perspectivist approaches to natural language processing: A survey.Language Resources and Evaluation, pages 1–28, 2024
work page 2024
-
[38]
Hans-Georg Gadamer.Truth and method. 1960
work page 1960
-
[39]
Gaver, Jacob Beaver, and Steve Benford
William W. Gaver, Jacob Beaver, and Steve Benford. Ambiguity as a resource for design. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’03, page 233–240, New York, NY , USA, 2003. Association for Computing Machinery
work page 2003
-
[40]
How culture shapes what people want from AI
Xiao Ge, Chunchen Xu, Daigo Misaki, Hazel Rose Markus, and Jeanne L Tsai. How culture shapes what people want from AI. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI ’24, New York, NY , USA, 2024. Association for Computing Machinery
work page 2024
- [41]
-
[42]
Representation: Cultural representations and signifying practices.Culture, 1997
S Hall. Representation: Cultural representations and signifying practices.Culture, 1997
work page 1997
-
[43]
Donna Haraway. Situated knowledges: The science question in feminism and the privilege of partial perspective.Feminist Studies, 14(3):575–599, 1988
work page 1988
-
[44]
Martin Heidegger.Being and time. 1927
work page 1927
-
[45]
Doing AI differently: Rethinking the foundations of AI via the humanities
Drew Hemment, Cody Kommers, and colleagues. Doing AI differently: Rethinking the foundations of AI via the humanities. Technical report, London: The Alan Turing Institute, 2025
work page 2025
-
[46]
Experiential AI: Between arts and explainable AI.Leonardo, 57(3):298–306, 2024
Drew Hemment, Dave Murray-Rust, Vaishak Belle, Ruth Aylett, Matjaz Vidmar, and Frank Broz. Experiential AI: Between arts and explainable AI.Leonardo, 57(3):298–306, 2024
work page 2024
-
[47]
Ryan Heuser. Cultural collapse: Toward a generative formalism for ai cultural production.Anthology of Computers and the Humanities, 3:575–588, 2025
work page 2025
-
[48]
Lujain Ibrahim, Canfer Akbulut, Rasmi Elasmar, Charvi Rastogi, Minsuk Kahng, Meredith Ringel Morris, Kevin R McKee, Verena Rieser, Murray Shanahan, and Laura Weidinger. Multi-turn evaluation of anthropomorphic behaviours in large language models.arXiv preprint arXiv:2502.07077, 2025
-
[49]
Towards interactive evaluations for interaction harms in human-ai systems
Lujain Ibrahim, Saffron Huang, Lama Ahmad, Umang Bhatt, and Markus Anderljung. Towards interactive evaluations for interaction harms in human-ai systems. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, volume 8, pages 1302–1310, 2025
work page 2025
-
[50]
Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani.An introduction to statistical learning: with applications in R, volume 103. Springer, 2013
work page 2013
-
[51]
Yohan J John, Leigh Caldwell, Dakota E McCoy, and Oliver Braganza. Dead rats, dopamine, performance metrics, and peacock tails: Proxy failure is an inherent risk in goal-oriented systems.Behavioral and Brain Sciences, 47:e67, 2024
work page 2024
-
[52]
Sayash Kapoor, Benedikt Stroebl, Zachary S Siegel, Nitya Nadgir, and Arvind Narayanan. AI agents that matter.arXiv preprint arXiv:2407.01502, 2024
-
[53]
Provocations from the humanities for generative AI research.arXiv preprint arXiv:2502.19190, 2025
Lauren Klein, Meredith Martin, André Brock, Maria Antoniak, Melanie Walsh, Jessica Marie Johnson, Lauren Tilton, and David Mimno. Provocations from the humanities for generative AI research.arXiv preprint arXiv:2502.19190, 2025
-
[54]
Bernard J Koch and David Peterson. From protoscience to epistemic monoculture: How benchmarking set the stage for the deep learning revolution.arXiv preprint arXiv:2404.06647, 2024
-
[55]
Sense-making, cultural scripts, and the inferential basis of meaningful experience
Cody Kommers and Simon DeDeo. Sense-making, cultural scripts, and the inferential basis of meaningful experience. InProceedings of the Annual Meeting of the Cognitive Science Society, volume 47, 2025
work page 2025
-
[56]
Why slop matters.ACM AI Letters, 2025
Cody Kommers, Eamon Duede, Julia Gordon, Ari Holtzman, Tess McNulty, Spencer Stewart, Lindsay Thomas, Richard Jean So, and Hoyt Long. Why slop matters.ACM AI Letters, 2025
work page 2025
-
[57]
Cody Kommers, Drew Hemment, Maria Antoniak, Joel Z Leibo, Hoyt Long, Emily Robinson, and Adam Sobey. Meaning is not a metric: Using LLMs to make cultural context legible at scale.arXiv preprint arXiv:2505.23785, 2025. 11
-
[58]
Austin C Kozlowski, Matt Taddy, and James A Evans. The geometry of culture: Analyzing the meanings of class through word embeddings.American Sociological Review, 84(5):905–949, 2019
work page 2019
-
[59]
Ai safety on whose terms?Science, 381(6654):138–138, 2023
Seth Lazar and Alondra Nelson. Ai safety on whose terms?Science, 381(6654):138–138, 2023
work page 2023
-
[60]
Joel Z Leibo, Alexander Sasha Vezhnevets, Manfred Diaz, John P Agapiou, William A Cunningham, Peter Sunehag, Julia Haas, Raphael Koster, Edgar A Duéñez-Guzmán, William S Isaac, et al. A theory of appropriateness with applications to generative artificial intelligence.arXiv preprint arXiv:2412.19010, 2024
-
[61]
North- western University Press, 1988
Sanford Levinson and Steven Mailloux.Interpreting law and literature: A hermeneutic reader. North- western University Press, 1988
work page 1988
-
[62]
Ming Li, Jiuhai Chen, Lichang Chen, and Tianyi Zhou. Can llms speak for diverse people? tuning llms via debate to generate controllable controversial statements.arXiv preprint arXiv:2402.10614, 2024
-
[63]
Holistic evaluation of language models
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. Transactions on Machine Learning Research, 2022
work page 2022
-
[64]
Q Vera Liao and Ziang Xiao. Rethinking model evaluation as narrowing the socio-technical gap.arXiv preprint arXiv:2306.03100, 2023
-
[65]
Full-stack alignment: Co-aligning AI and institutions with thicker models of value
Ryan Lowe, Joe Edelman, Tan Zhi-Xuan, Oliver Klingefjord, Ellie Hain, Vincent Wang, Atrisha Sarkar, Michiel A Bakker, Fazl Barez, Matija Franklin, et al. Full-stack alignment: Co-aligning AI and institutions with thicker models of value. In2nd Workshop on Models of Human Feedback for AI Alignment, 2025
work page 2025
-
[66]
Chaitanya Malaviya, Joseph Chee Chang, Dan Roth, Mohit Iyyer, Mark Yatskar, and Kyle Lo. Contex- tualized evaluations: Judging language model responses to underspecified queries.Transactions of the Association for Computational Linguistics, 13:878–900, 2025
work page 2025
-
[67]
Guillermo Marco, Julio Gonzalo, and Víctor Fresno. The reader is the metric: How textual features and reader profiles explain conflicting evaluations of AI creative writing.arXiv preprint arXiv:2506.03310, 2025
-
[68]
Timothy R McIntosh, Teo Susnjak, Nalin Arachchilage, Tong Liu, Dan Xu, Paul Watters, and Malka N Halgamuge. Inadequacies of large language model benchmarks in the era of generative artificial intelli- gence.IEEE Transactions on Artificial Intelligence, 2025
work page 2025
-
[69]
Lisa Messeri and Molly J Crockett. Artificial intelligence and illusions of understanding in scientific research.Nature, 627(8002):49–58, 2024
work page 2024
-
[70]
Learning to draw: Emergent communication through sketching
Daniela Mihai and Jonathon Hare. Learning to draw: Emergent communication through sketching. Advances in Neural Information Processing Systems, 34:7153–7166, 2021
work page 2021
-
[71]
Why ai is weird and shouldn’t be this way: Towards ai for everyone, with everyone, by everyone
Rada Mihalcea, Oana Ignat, Longju Bai, Angana Borah, Luis Chiruzzo, Zhijing Jin, Claude Kwizera, Joan Nwatu, Soujanya Poria, and Thamar Solorio. Why ai is weird and shouldn’t be this way: Towards ai for everyone, with everyone, by everyone. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 28657–28670, 2025
work page 2025
-
[72]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality.Advances in Neural Information Processing Systems, 26, 2013
work page 2013
-
[73]
Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, and Gabriel Stanovsky. State of what art? A call for multi-prompt LLM evaluation.Transactions of the Association for Computational Linguistics, 12:933–949, 2024
work page 2024
-
[74]
Toward a computational hermeneutics.Big Data & Society, 2(2):2053951715613809, 2015
John W Mohr, Robin Wagner-Pacifici, and Ronald L Breiger. Toward a computational hermeneutics.Big Data & Society, 2(2):2053951715613809, 2015
work page 2015
-
[75]
Tim Murray-Browne and Panagiotis Tigas. Emergent interfaces: Vague, complex, bespoke and embodied interaction between humans and computers.Applied Sciences, 11(18):8531, 2021
work page 2021
-
[76]
Word sense disambiguation: A survey.ACM Computing Surveys (CSUR), 41(2):1–69, 2009
Roberto Navigli. Word sense disambiguation: A survey.ACM Computing Surveys (CSUR), 41(2):1–69, 2009. 12
work page 2009
-
[77]
Shravan Nayak, Mehar Bhatia, Xiaofeng Zhang, Verena Rieser, Lisa Anne Hendricks, Sjoerd Van Steenkiste, Yash Goyal, Karolina Sta ´nczak, and Aishwarya Agrawal. Culturalframes: Assess- ing cultural expectation alignment in text-to-image models and evaluation metrics. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 20918–20953, 2025
work page 2025
-
[78]
Simon Ott, Adriano Barbosa-Silva, Kathrin Blagec, Jan Brauner, and Matthias Samwald. Mapping global dynamics of benchmark creation and saturation in artificial intelligence.Nature Communications, 13(1):6793, 2022
work page 2022
-
[79]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022
work page 2022
-
[80]
GloVe: Global vectors for word represen- tation
Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word represen- tation. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, October 2014. Association for Computational Linguistics
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.