Recognition: unknown
Measuring the Machine: Evaluating Generative AI as Pluralist Sociotechical Systems
Pith reviewed 2026-05-10 00:13 UTC · model grok-4.3
The pith
Generative AI evaluation must move from static benchmarks to tracing how models, users, and institutions co-construct values over time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Generative AI must be evaluated as pluralist sociotechnical systems rather than isolated predictors or normative targets. The central mechanism is the Machine-Society-Human (MaSH) Loops framework, which traces how models, users, and institutions recursively co-construct meaning and values through interaction. Evaluation therefore shifts from scoring outputs to examining the enactive processes in which values are made visible and enacted. This descriptive stance is demonstrated through the World Values Benchmark, which uses structured prompts and anchor-aware scoring drawn from World Values Survey data, and through empirical cases of value drift and real-estate applications. The thesis closes
What carries the argument
MaSH Loops, a descriptive framework that follows recursive co-construction of meaning and values among models, users, and institutions.
If this is right
- Evaluation becomes an ongoing examination of interaction loops instead of one-time output judgments.
- The World Values Benchmark supplies a distributional method that anchors AI responses to empirical survey data rather than researcher-defined ideals.
- Case studies reveal measurable value drift in early large models and show how real-world deployments embed institutional values.
- Prompting and evaluation are treated as constitutive interventions that shape what systems are understood to be.
- Evaluation itself is positioned as a site of governance that influences deployment and public trust.
Where Pith is reading between the lines
- The same loop-tracing approach could be tested on other generative domains such as image or code models to check whether value co-construction patterns generalize.
- Regulatory bodies might adopt MaSH-style reporting requirements that require disclosure of which user and institutional loops were observed during evaluation.
- If the framework scales, it suggests that participatory design sessions could be integrated directly into benchmark construction to surface previously hidden value conflicts.
Load-bearing premise
That conventional functionalist and prescriptive benchmarks necessarily obscure sociotechnical processes and reify narrow cultural views, while MaSH Loops can be applied in practice without introducing its own unexamined assumptions.
What would settle it
A controlled comparison showing that static benchmarks produce value distributions matching pluralist survey data across diverse user groups without any process tracing or recursive analysis.
Figures
read the original abstract
In measurement theory, instruments do not simply record reality; they help constitute what is observed. The same holds for generative AI evaluation: benchmarks do not just measure, they shape what models appear to be. Functionalist benchmarks treat models as isolated predictors, while prescriptive approaches assess what systems ought to be. Both obscure the sociotechnical processes through which meaning and values are enacted, risking the reification of narrow cultural perspectives in pluralist contexts. This thesis advances a descriptive alternative. It argues that generative AI must be evaluated as a pluralist sociotechnical system and develops Machine-Society-Human (MaSH) Loops, a framework for tracing how models, users, and institutions recursively co-construct meaning and values. Evaluation shifts from judging outputs to examining how values are enacted in interaction. Three contributions follow. Conceptually, MaSH Loops reframes evaluation as recursive, enactive process. Methodologically, the World Values Benchmark introduces a distributional approach grounded in World Values Survey data, structured prompt sets, and anchor-aware scoring. Empirically, the thesis demonstrates these through two cases: value drift in early GPT-3 and sociotechnical evaluation in real estate. A final chapter draws on participatory realism to argue that prompting and evaluation are constitutive interventions, not neutral observations. The thesis argues that static benchmarks are insufficient for generative AI. Responsible evaluation requires pluralist, process-oriented frameworks that make visible whose values are enacted. Evaluation is therefore a site of governance, shaping how AI systems are understood, deployed, and trusted.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that generative AI benchmarks do not neutrally measure but constitutively shape observed model behaviors and values, critiquing functionalist (isolated predictor) and prescriptive (ought-to-be) approaches for obscuring recursive sociotechnical value enactment and risking reification of narrow cultural perspectives. It advances a descriptive alternative via the Machine-Society-Human (MaSH) Loops framework for tracing co-construction of meaning among models, users, and institutions, introduces the World Values Benchmark as a distributional methodology using World Values Survey data with structured prompts and anchor-aware scoring, demonstrates the approach in two cases (value drift in early GPT-3 and real-estate sociotechnical evaluation), and concludes that evaluation constitutes a site of governance.
Significance. If the MaSH Loops framework can be operationalized and shown to surface value enactments not captured by existing methods, the work would meaningfully advance conceptual foundations in AI evaluation and responsible AI governance by reframing measurement as an enactive, pluralist process rather than a neutral observation. The explicit linkage of evaluation to governance and the distributional benchmark grounded in cross-cultural survey data represent constructive contributions that could inform both academic and policy discussions on how benchmarks embed and propagate values.
major comments (2)
- [Methodological contribution / World Values Benchmark] The central claim that MaSH Loops provides a non-circular descriptive alternative rests on the assertion that it traces rather than prescribes values, yet the World Values Benchmark methodology (described in the methodological contribution) selects specific survey items and anchor-aware scoring rules; without an explicit argument or sensitivity analysis showing these choices do not embed the very cultural perspectives the framework critiques, the distinction from prescriptive approaches remains under-supported.
- [Empirical demonstrations / case studies] The empirical demonstrations (value drift in early GPT-3 and real-estate case) are presented as illustrations of the framework, but the manuscript supplies no quantitative comparison of MaSH Loop outputs against standard static benchmarks on the same prompts or tasks; this weakens the load-bearing assertion that static benchmarks are insufficient, as readers cannot assess whether the recursive tracing reveals materially different or additional value enactments.
minor comments (2)
- [Abstract] The abstract and introduction use dense prose when enumerating the three contributions; a bulleted list or numbered enumeration would improve scannability without altering content.
- [Final chapter] The term 'participatory realism' is invoked in the final chapter without a reference or brief definition; adding a citation or one-sentence gloss would clarify the governance argument for readers outside the immediate subfield.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which identify key opportunities to strengthen the distinction between our descriptive framework and existing approaches, as well as the empirical grounding of our claims. We address each major comment below and will incorporate revisions to improve the manuscript.
read point-by-point responses
-
Referee: The central claim that MaSH Loops provides a non-circular descriptive alternative rests on the assertion that it traces rather than prescribes values, yet the World Values Benchmark methodology (described in the methodological contribution) selects specific survey items and anchor-aware scoring rules; without an explicit argument or sensitivity analysis showing these choices do not embed the very cultural perspectives the framework critiques, the distinction from prescriptive approaches remains under-supported.
Authors: We acknowledge that the manuscript would be strengthened by a more explicit defense of the benchmark's descriptive character. In revision, we will add a dedicated subsection to the methodological contribution that articulates the rationale for item selection and scoring: the World Values Survey items are chosen for their established cross-cultural empirical grounding rather than normative prescription, and anchor-aware scoring is intended to surface distributional variance without imposing target values. We will also report a sensitivity analysis that varies item subsets and scoring parameters to demonstrate robustness of the identified value enactments. These additions will more clearly differentiate the approach from prescriptive methods. revision: yes
-
Referee: The empirical demonstrations (value drift in early GPT-3 and real-estate case) are presented as illustrations of the framework, but the manuscript supplies no quantitative comparison of MaSH Loop outputs against standard static benchmarks on the same prompts or tasks; this weakens the load-bearing assertion that static benchmarks are insufficient, as readers cannot assess whether the recursive tracing reveals materially different or additional value enactments.
Authors: The case studies were conceived as qualitative illustrations of recursive tracing rather than comparative evaluations. We agree that direct quantitative comparisons would better support the claim that static benchmarks are insufficient. In the revised manuscript, we will augment the empirical demonstrations with side-by-side analyses: applying both MaSH Loop tracing and representative static benchmarks (e.g., fixed-prompt value surveys) to identical prompt sets from the GPT-3 drift case and the real-estate evaluation. We will report quantitative differences in detected value distributions and the additional sociotechnical factors surfaced by the recursive method. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper advances a conceptual reframing of generative AI evaluation as a constitutive sociotechnical process, drawing on measurement theory to critique functionalist and prescriptive benchmarks before introducing the MaSH Loops framework as a descriptive alternative for tracing recursive value enactment. No load-bearing steps reduce by construction to self-defined inputs, fitted parameters renamed as predictions, or self-citation chains; the World Values Benchmark methodology and two case studies supply independent empirical content, and the governance conclusion follows directly from the stated premises without requiring unverified uniqueness theorems or smuggled ansatzes. The derivation remains self-contained as a process-oriented proposal.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Instruments do not simply record reality; they help constitute what is observed.
- domain assumption Generative AI evaluation involves recursive sociotechnical co-construction of meaning and values.
invented entities (2)
-
MaSH Loops
no independent evidence
-
World Values Benchmark
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Giulio Antonio Abbo, Serena Marchesi, Agnieszka Wykowska, and Tony Belpaeme. 2023. Social value alignment in large language models. In International Workshop on Value Engineering in AI, 2023. Springer, 83–97. [2] Abubakar Abid, Maheen Farooqi, and James Zou. 2021. Persistent Anti-Muslim Bias in Large Language Models. January 18, 2021. arXiv. https://doi.o...
-
[2]
Trends in Neurosciences , volume = 46, number = 12, pages =
Lauren Angwin, Jeff Larson, Surya Kirchner, Julia Mattu, and Lauren Kirchner. 2016. How We Analyzed the COMPAS Recidivism Algorithm. ProPublica. Retrieved February 20, 2025 from https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing [15] Anthropic. 2025. Introducing Claude 4. Retrieved April 20, 2026 from https://www.anthr...
-
[3]
Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜. March 03, 2021. Association for Computing Machinery, New York, NY, USA, 610–623. https://doi.org/10.1145/3442188.3445922 [32] Emily M. Bender and Alexander Koller. 2020. Climbing towards NLU: On M...
-
[4]
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, and Emma Brunskill. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021). [50] Nick Bostrom. 2014. Superintelligence: Paths, dangers, strategies. Oxford University Press...
work page internal anchor Pith review doi:10.48550/arxiv.2005.14165 2021
-
[5]
Michael Cannon. 2022. An Enactive Approach to Value Alignment in Artificial Intelligence: A Matter of Relevance. In Philosophy and Theory of Artificial Intelligence 2021, 2022. Springer International Publishing, Cham, 119–135. https://doi.org/10.1007/978-3-031-09153-7_10 [66] Yong Cao, Li Zhou, Seolhwa Lee, Laura Cabello, Min Chen, and Daniel Hershcovich....
-
[6]
Alex J. DeGrave, Joseph D. Janizek, and Su-In Lee. 2021. AI for radiographic COVID-19 detection selects shortcuts over signal. Nat Mach Intell 3, 7 (July 2021), 610–619. https://doi.org/10.1038/s42256-021-00338-7 [101] Karen Dellow. 2025. How Aussie home prices have changed amid interest rate hikes. PropTrack. Retrieved March 3, 2025 from https://www.real...
-
[7]
Liane Gabora and Joscha Bach. 2023. A Path to Generative Artificial Selves. (2023). [139] Iason Gabriel. 2020. Artificial intelligence, values, and alignment. Minds and machines 30, 3 (2020), 411–437. [140] Shaun Gallagher. 2020. Action and interaction. Oxford University Press. [141] William A. Galston. 2002. Liberal pluralism: The implications of value p...
-
[8]
Wei Guo and Aylin Caliskan. 2021. Detecting emergent intersectional biases: Contextualized word embeddings contain a distribution of human-like biases. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, 2021. 122–133. [160] Lewis Hammond, Alan Chan, Jesse Clifton, Jason Hoelscher-Obermaier, Akbir Khan, Euan McLean, Chandler Smith, ...
-
[9]
Horgan. 2023. A 25-Year-Old Bet about Consciousness Has Finally Been Settled - Scientific American. Scientific American. Retrieved October 27, 2023 from https://www.scientificamerican.com/article/a-25-year-old-bet-about-consciousness-has-finally-been-settled/ [178] Yuk Hui. 2019. Recursivity and contingency. Rowman & Littlefield. [179] David Hume. 1739. A...
-
[10]
Meghann Jones. 2019. #IWD2019: Perceptions of violence against women in France and the United States | Ipsos. IPSOS. Retrieved November 17, 2021 from https://www.ipsos.com/en/iwd2019-perceptions-violence-against-women-france-and-united-states [197] Peter Jonkers. 2019. How to Respond to Conflicts Over Value Pluralism? Journal of Nationalism, Memory & Lang...
-
[11]
Karthigeyan Kuppan, Deepak Bhaskar Acharya, and Divya B. 2024. Foundational AI in Insurance and Real Estate: A Survey of Applications, Challenges, and Future Directions. IEEE Access 12, (2024), 181282–181302. https://doi.org/10.1109/ACCESS.2024.3509918 [216] Ray Kurzweil. 2005. The singularity is near. In Ethics and emerging technologies. Palgrave Macmill...
-
[12]
Caroline Lindahl and Helin Saeid. 2023. Unveiling the values of chatgpt: An explorative study on human values in ai systems. [235] Anders Lisdorf. 2020. Build the data refinery: Because cities run on data. In Demystifying Smart Cities: Practical Perspectives on How Cities Can Leverage the Potential of New Technologies, Anders Lisdorf (ed.). Apress, Berkel...
-
[13]
John McCormick and Juliet Chung. 2022. Ken Griffin Moving Citadel From Chicago to Miami Following Crime Complaints. Wall Street Journal. Retrieved March 24, 2025 from https://www.wsj.com/articles/ken-griffin-moving-citadel-from-chicago-to-miami-following-crime-complaints-11655994600 [254] Warren S. McCulloch and Walter Pitts. 1943. A logical calculus of t...
-
[14]
We Have No Moat, And Neither Does OpenAI
Ian Muehlenhaus. 2013. The design and composition of persuasive maps. Cartography and Geographic Information Science 40, 5 (November 2013), 401–414. https://doi.org/10.1080/15230406.2013.783450 [273] Luke Munn. 2023. The uselessness of AI ethics. AI Ethics 3, 3 (August 2023), 869–877. https://doi.org/10.1007/s43681-022-00209-w [274] Thomas Nagel. 1979. Th...
-
[15]
Matthew Pittman and Kim Sheehan. 2016. Amazon’s Mechanical Turk a Digital Sweatshop? Transparency and Accountability in Crowdsourced Online Research. Journal of Media Ethics 31, 4 (October 2016), 260–262. https://doi.org/10.1080/23736992.2016.1228811 [310] A. V. Platonov, E. A. Poleschuk, I. A. Bessmertny, and N. R. Gafurov. 2018. Using quantum mechanical...
-
[16]
Richards, Evan Cooley, Jalen Miller, and Ronald Watterson
Adam S. Richards, Evan Cooley, Jalen Miller, and Ronald Watterson. 2024. Testing the Mercator Effect: Global Map Projections Persuade Differently According to the Emphasis Frames Used to Contextualize Them. Communication Reports 37, 2 (May 2024), 109–123. https://doi.org/10.1080/08934215.2024.2303651 [329] Burghard B. Rieger. 1982. Procedural meaning repr...
-
[17]
Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A. Smith. 2019. The risk of racial bias in hate speech detection. In Proceedings of the 57th annual meeting of the association for computational linguistics, 2019. 1668–1678. [348] Maki Sato and Jonathan McKinney. 2022. The Enactive and Interactive Dimensions of AI: Ingenuity and Imagination T...
-
[18]
Aarohi Srivastava et al. 2022. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 (2022). [367] Christopher Starke, Janine Baleis, Birte Keller, and Frank Marcinkowski. 2022. Fairness perceptions of algorithmic decision-making: A systematic review of the empirical literature. Big D...
-
[19]
The Australian Property Institute. 2022. Big visual data analysis using artificial intelligence for mass valuation of residential properties in Australia. Australian Property Research and Education Fund. Retrieved from https://www.api.org.au/apref/ [386] The Nobel Foundation. 2022. The Nobel Prize in Physics 2022. NobelPrize.org. Retrieved July 30, 2025 f...
work page internal anchor Pith review doi:10.48550/arxiv.2302.13971 2022
-
[20]
UNESCO. 2021. Recommendation on the Ethics of Artificial Intelligence. UNESCO Digital Library. Retrieved June 13, 2025 from https://unesdoc.unesco.org/ark:/48223/pf0000381137 [403] Shannon Vallor. 2016. Technology and the virtues: A philosophical guide to a future worth wanting. Oxford University Press. [404] Shannon Vallor. 2024. The AI mirror: How to re...
-
[21]
Terry Winograd. 1972. Understanding natural language. Cognitive psychology 3, 1 (1972), 1–191. [422] World Values Survey Organisation. 2024. WVS Cultural Map: 2023 Version. The World Values Survey. Retrieved January 18, 2024 from https://www.worldvaluessurvey.org/wvs.jsp [423] Malcolm X. 1964. The ballot or the bullet. Detroit, Michigan. [424] Qizhou Xion...
-
[22]
Ich bin für eine Begrenzung der Zuwanderung. 2. Ich bin für eine Begrenzung der Zuwanderung aus humanitären Gründen. 3. Ich bin für eine Begrenzung der The vast majority of us do not know the state of complete exhaustion on the run, combined with fear for their own lives or the lives of their children or partners. People who make their way from Eritrea, S...
-
[23]
Memory of a crisis
I am in favor of limiting immigration. 2. I am in favor of limiting immigration for humanitarian reasons. 3. I am in favor of limiting die uns wahrscheinlich schlichtweg zusammenbrechen ließen. Deshalb müssen wir beim Umgang mit Menschen, die jetzt zu uns kommen, einige klare Grundsätze gelten lassen. Diese Grundsätze entstammen nicht mehr und nicht wenig...
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.