arxiv: 2604.20545 · v1 · submitted 2026-04-22 · 💻 cs.AI

Recognition: unknown

Measuring the Machine: Evaluating Generative AI as Pluralist Sociotechical Systems

Rebecca L. Johnson

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:13 UTC · model grok-4.3

classification 💻 cs.AI

keywords generative AI evaluationsociotechnical systemsvalue enactmentbenchmarksMaSH Loopspluralist frameworksgovernance

0 comments

The pith

Generative AI evaluation must move from static benchmarks to tracing how models, users, and institutions co-construct values over time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that benchmarks for generative AI do not neutrally record capabilities but actively shape what those systems appear to be and which values they enact. Functionalist approaches treat models as isolated predictors while prescriptive ones judge them against fixed ideals, both of which hide the ongoing interactions through which meaning emerges in pluralist societies. In place of these, it develops a descriptive framework that follows recursive loops among machines, society, and humans. A sympathetic reader would care because this reframing turns evaluation from a technical measurement exercise into a governance question about whose perspectives get reinforced or excluded. The work supports its case with a new distributional benchmark grounded in survey data and with case studies that track value changes in real deployments.

Core claim

Generative AI must be evaluated as pluralist sociotechnical systems rather than isolated predictors or normative targets. The central mechanism is the Machine-Society-Human (MaSH) Loops framework, which traces how models, users, and institutions recursively co-construct meaning and values through interaction. Evaluation therefore shifts from scoring outputs to examining the enactive processes in which values are made visible and enacted. This descriptive stance is demonstrated through the World Values Benchmark, which uses structured prompts and anchor-aware scoring drawn from World Values Survey data, and through empirical cases of value drift and real-estate applications. The thesis closes

What carries the argument

MaSH Loops, a descriptive framework that follows recursive co-construction of meaning and values among models, users, and institutions.

If this is right

Evaluation becomes an ongoing examination of interaction loops instead of one-time output judgments.
The World Values Benchmark supplies a distributional method that anchors AI responses to empirical survey data rather than researcher-defined ideals.
Case studies reveal measurable value drift in early large models and show how real-world deployments embed institutional values.
Prompting and evaluation are treated as constitutive interventions that shape what systems are understood to be.
Evaluation itself is positioned as a site of governance that influences deployment and public trust.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same loop-tracing approach could be tested on other generative domains such as image or code models to check whether value co-construction patterns generalize.
Regulatory bodies might adopt MaSH-style reporting requirements that require disclosure of which user and institutional loops were observed during evaluation.
If the framework scales, it suggests that participatory design sessions could be integrated directly into benchmark construction to surface previously hidden value conflicts.

Load-bearing premise

That conventional functionalist and prescriptive benchmarks necessarily obscure sociotechnical processes and reify narrow cultural views, while MaSH Loops can be applied in practice without introducing its own unexamined assumptions.

What would settle it

A controlled comparison showing that static benchmarks produce value distributions matching pluralist survey data across diverse user groups without any process tracing or recursive analysis.

Figures

Figures reproduced from arXiv: 2604.20545 by Rebecca L. Johnson.

**Figure 1.** Figure 1: MaSH Loops as an enactivist evaluation framework: machine, society, and human processes are treated as mutually conditioning dimensions of sociotechnical evaluation...................................................................................................................... 45 [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗

**Figure 8.** Figure 8: Applying the Zestimate case study to a sociotechnical map. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 15.** Figure 15: Examples of value differences amongst people from three Western countries with [PITH_FULL_IMAGE:figures/full_fig_p012_15.png] view at source ↗

read the original abstract

In measurement theory, instruments do not simply record reality; they help constitute what is observed. The same holds for generative AI evaluation: benchmarks do not just measure, they shape what models appear to be. Functionalist benchmarks treat models as isolated predictors, while prescriptive approaches assess what systems ought to be. Both obscure the sociotechnical processes through which meaning and values are enacted, risking the reification of narrow cultural perspectives in pluralist contexts. This thesis advances a descriptive alternative. It argues that generative AI must be evaluated as a pluralist sociotechnical system and develops Machine-Society-Human (MaSH) Loops, a framework for tracing how models, users, and institutions recursively co-construct meaning and values. Evaluation shifts from judging outputs to examining how values are enacted in interaction. Three contributions follow. Conceptually, MaSH Loops reframes evaluation as recursive, enactive process. Methodologically, the World Values Benchmark introduces a distributional approach grounded in World Values Survey data, structured prompt sets, and anchor-aware scoring. Empirically, the thesis demonstrates these through two cases: value drift in early GPT-3 and sociotechnical evaluation in real estate. A final chapter draws on participatory realism to argue that prompting and evaluation are constitutive interventions, not neutral observations. The thesis argues that static benchmarks are insufficient for generative AI. Responsible evaluation requires pluralist, process-oriented frameworks that make visible whose values are enacted. Evaluation is therefore a site of governance, shaping how AI systems are understood, deployed, and trusted.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that generative AI benchmarks do not neutrally measure but constitutively shape observed model behaviors and values, critiquing functionalist (isolated predictor) and prescriptive (ought-to-be) approaches for obscuring recursive sociotechnical value enactment and risking reification of narrow cultural perspectives. It advances a descriptive alternative via the Machine-Society-Human (MaSH) Loops framework for tracing co-construction of meaning among models, users, and institutions, introduces the World Values Benchmark as a distributional methodology using World Values Survey data with structured prompts and anchor-aware scoring, demonstrates the approach in two cases (value drift in early GPT-3 and real-estate sociotechnical evaluation), and concludes that evaluation constitutes a site of governance.

Significance. If the MaSH Loops framework can be operationalized and shown to surface value enactments not captured by existing methods, the work would meaningfully advance conceptual foundations in AI evaluation and responsible AI governance by reframing measurement as an enactive, pluralist process rather than a neutral observation. The explicit linkage of evaluation to governance and the distributional benchmark grounded in cross-cultural survey data represent constructive contributions that could inform both academic and policy discussions on how benchmarks embed and propagate values.

major comments (2)

[Methodological contribution / World Values Benchmark] The central claim that MaSH Loops provides a non-circular descriptive alternative rests on the assertion that it traces rather than prescribes values, yet the World Values Benchmark methodology (described in the methodological contribution) selects specific survey items and anchor-aware scoring rules; without an explicit argument or sensitivity analysis showing these choices do not embed the very cultural perspectives the framework critiques, the distinction from prescriptive approaches remains under-supported.
[Empirical demonstrations / case studies] The empirical demonstrations (value drift in early GPT-3 and real-estate case) are presented as illustrations of the framework, but the manuscript supplies no quantitative comparison of MaSH Loop outputs against standard static benchmarks on the same prompts or tasks; this weakens the load-bearing assertion that static benchmarks are insufficient, as readers cannot assess whether the recursive tracing reveals materially different or additional value enactments.

minor comments (2)

[Abstract] The abstract and introduction use dense prose when enumerating the three contributions; a bulleted list or numbered enumeration would improve scannability without altering content.
[Final chapter] The term 'participatory realism' is invoked in the final chapter without a reference or brief definition; adding a citation or one-sentence gloss would clarify the governance argument for readers outside the immediate subfield.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which identify key opportunities to strengthen the distinction between our descriptive framework and existing approaches, as well as the empirical grounding of our claims. We address each major comment below and will incorporate revisions to improve the manuscript.

read point-by-point responses

Referee: The central claim that MaSH Loops provides a non-circular descriptive alternative rests on the assertion that it traces rather than prescribes values, yet the World Values Benchmark methodology (described in the methodological contribution) selects specific survey items and anchor-aware scoring rules; without an explicit argument or sensitivity analysis showing these choices do not embed the very cultural perspectives the framework critiques, the distinction from prescriptive approaches remains under-supported.

Authors: We acknowledge that the manuscript would be strengthened by a more explicit defense of the benchmark's descriptive character. In revision, we will add a dedicated subsection to the methodological contribution that articulates the rationale for item selection and scoring: the World Values Survey items are chosen for their established cross-cultural empirical grounding rather than normative prescription, and anchor-aware scoring is intended to surface distributional variance without imposing target values. We will also report a sensitivity analysis that varies item subsets and scoring parameters to demonstrate robustness of the identified value enactments. These additions will more clearly differentiate the approach from prescriptive methods. revision: yes
Referee: The empirical demonstrations (value drift in early GPT-3 and real-estate case) are presented as illustrations of the framework, but the manuscript supplies no quantitative comparison of MaSH Loop outputs against standard static benchmarks on the same prompts or tasks; this weakens the load-bearing assertion that static benchmarks are insufficient, as readers cannot assess whether the recursive tracing reveals materially different or additional value enactments.

Authors: The case studies were conceived as qualitative illustrations of recursive tracing rather than comparative evaluations. We agree that direct quantitative comparisons would better support the claim that static benchmarks are insufficient. In the revised manuscript, we will augment the empirical demonstrations with side-by-side analyses: applying both MaSH Loop tracing and representative static benchmarks (e.g., fixed-prompt value surveys) to identical prompt sets from the GPT-3 drift case and the real-estate evaluation. We will report quantitative differences in detected value distributions and the additional sociotechnical factors surfaced by the recursive method. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper advances a conceptual reframing of generative AI evaluation as a constitutive sociotechnical process, drawing on measurement theory to critique functionalist and prescriptive benchmarks before introducing the MaSH Loops framework as a descriptive alternative for tracing recursive value enactment. No load-bearing steps reduce by construction to self-defined inputs, fitted parameters renamed as predictions, or self-citation chains; the World Values Benchmark methodology and two case studies supply independent empirical content, and the governance conclusion follows directly from the stated premises without requiring unverified uniqueness theorems or smuggled ansatzes. The derivation remains self-contained as a process-oriented proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The argument rests on the premise that measurement instruments constitute observed reality and on the domain claim that generative AI is inherently sociotechnical; it introduces MaSH Loops and the World Values Benchmark as new constructs without independent empirical grounding in the provided text.

axioms (2)

domain assumption Instruments do not simply record reality; they help constitute what is observed.
Opening sentence applies measurement-theory premise directly to AI evaluation.
domain assumption Generative AI evaluation involves recursive sociotechnical co-construction of meaning and values.
Central premise used to justify shifting from output judgment to process tracing.

invented entities (2)

MaSH Loops no independent evidence
purpose: Framework for tracing recursive co-construction of meaning and values among models, society, and humans.
Core conceptual contribution introduced to replace static benchmarks.
World Values Benchmark no independent evidence
purpose: Distributional evaluation method using World Values Survey data, structured prompts, and anchor-aware scoring.
Methodological contribution proposed as an empirical demonstration of the framework.

pith-pipeline@v0.9.0 · 5568 in / 1526 out tokens · 57329 ms · 2026-05-10T00:13:29.264949+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 21 canonical work pages · 2 internal anchors

[1]

Giulio Antonio Abbo, Serena Marchesi, Agnieszka Wykowska, and Tony Belpaeme. 2023. Social value alignment in large language models. In International Workshop on Value Engineering in AI, 2023. Springer, 83–97. [2] Abubakar Abid, Maheen Farooqi, and James Zou. 2021. Persistent Anti-Muslim Bias in Large Language Models. January 18, 2021. arXiv. https://doi.o...

work page doi:10.48550/arxiv.2101.05783 2023
[2]

Trends in Neurosciences , volume = 46, number = 12, pages =

Lauren Angwin, Jeff Larson, Surya Kirchner, Julia Mattu, and Lauren Kirchner. 2016. How We Analyzed the COMPAS Recidivism Algorithm. ProPublica. Retrieved February 20, 2025 from https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing [15] Anthropic. 2025. Introducing Claude 4. Retrieved April 20, 2026 from https://www.anthr...

work page doi:10.1016/j.tins.2023.09.009 2016
[3]

Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜. March 03, 2021. Association for Computing Machinery, New York, NY, USA, 610–623. https://doi.org/10.1145/3442188.3445922 [32] Emily M. Bender and Alexander Koller. 2020. Climbing towards NLU: On M...

work page doi:10.1145/3442188.3445922 2021
[4]

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, and Emma Brunskill. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021). [50] Nick Bostrom. 2014. Superintelligence: Paths, dangers, strategies. Oxford University Press...

work page internal anchor Pith review doi:10.48550/arxiv.2005.14165 2021
[5]

Red Book

Michael Cannon. 2022. An Enactive Approach to Value Alignment in Artificial Intelligence: A Matter of Relevance. In Philosophy and Theory of Artificial Intelligence 2021, 2022. Springer International Publishing, Cham, 119–135. https://doi.org/10.1007/978-3-031-09153-7_10 [66] Yong Cao, Li Zhou, Seolhwa Lee, Laura Cabello, Min Chen, and Daniel Hershcovich....

work page doi:10.1007/978-3-031-09153-7_10 2022
[6]

AI for radiographic COVID-19 detection selects short- cuts over signal.Nature Machine Intelligence, 3:610– 619, 2021

Alex J. DeGrave, Joseph D. Janizek, and Su-In Lee. 2021. AI for radiographic COVID-19 detection selects shortcuts over signal. Nat Mach Intell 3, 7 (July 2021), 610–619. https://doi.org/10.1038/s42256-021-00338-7 [101] Karen Dellow. 2025. How Aussie home prices have changed amid interest rate hikes. PropTrack. Retrieved March 3, 2025 from https://www.real...

work page doi:10.1038/s42256-021-00338-7 2021
[7]

AI pause

Liane Gabora and Joscha Bach. 2023. A Path to Generative Artificial Selves. (2023). [139] Iason Gabriel. 2020. Artificial intelligence, values, and alignment. Minds and machines 30, 3 (2020), 411–437. [140] Shaun Gallagher. 2020. Action and interaction. Oxford University Press. [141] William A. Galston. 2002. Liberal pluralism: The implications of value p...

work page doi:10.2307/1400123 2023
[8]

Wei Guo and Aylin Caliskan. 2021. Detecting emergent intersectional biases: Contextualized word embeddings contain a distribution of human-like biases. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, 2021. 122–133. [160] Lewis Hammond, Alan Chan, Jesse Clifton, Jason Hoelscher-Obermaier, Akbir Khan, Euan McLean, Chandler Smith, ...

work page doi:10.2307/3178066 2021
[9]

Horgan. 2023. A 25-Year-Old Bet about Consciousness Has Finally Been Settled - Scientific American. Scientific American. Retrieved October 27, 2023 from https://www.scientificamerican.com/article/a-25-year-old-bet-about-consciousness-has-finally-been-settled/ [178] Yuk Hui. 2019. Recursivity and contingency. Rowman & Littlefield. [179] David Hume. 1739. A...

work page doi:10.1145/3531146.3533233 2023
[10]

godfather

Meghann Jones. 2019. #IWD2019: Perceptions of violence against women in France and the United States | Ipsos. IPSOS. Retrieved November 17, 2021 from https://www.ipsos.com/en/iwd2019-perceptions-violence-against-women-france-and-united-states [197] Peter Jonkers. 2019. How to Respond to Conflicts Over Value Pluralism? Journal of Nationalism, Memory & Lang...

work page doi:10.1049/ccs2.12027 2019
[11]

consciousness

Karthigeyan Kuppan, Deepak Bhaskar Acharya, and Divya B. 2024. Foundational AI in Insurance and Real Estate: A Survey of Applications, Challenges, and Future Directions. IEEE Access 12, (2024), 181282–181302. https://doi.org/10.1109/ACCESS.2024.3509918 [216] Ray Kurzweil. 2005. The singularity is near. In Ethics and emerging technologies. Palgrave Macmill...

work page doi:10.1109/access.2024.3509918 2024
[12]

Caroline Lindahl and Helin Saeid. 2023. Unveiling the values of chatgpt: An explorative study on human values in ai systems. [235] Anders Lisdorf. 2020. Build the data refinery: Because cities run on data. In Demystifying Smart Cities: Practical Perspectives on How Cities Can Leverage the Potential of New Technologies, Anders Lisdorf (ed.). Apress, Berkel...

work page doi:10.1007/978-1-4842-5377-9_9 2023
[13]

John McCormick and Juliet Chung. 2022. Ken Griffin Moving Citadel From Chicago to Miami Following Crime Complaints. Wall Street Journal. Retrieved March 24, 2025 from https://www.wsj.com/articles/ken-griffin-moving-citadel-from-chicago-to-miami-following-crime-complaints-11655994600 [254] Warren S. McCulloch and Walter Pitts. 1943. A logical calculus of t...

work page doi:10.1145/3457607 2022
[14]

We Have No Moat, And Neither Does OpenAI

Ian Muehlenhaus. 2013. The design and composition of persuasive maps. Cartography and Geographic Information Science 40, 5 (November 2013), 401–414. https://doi.org/10.1080/15230406.2013.783450 [273] Luke Munn. 2023. The uselessness of AI ethics. AI Ethics 3, 3 (August 2023), 869–877. https://doi.org/10.1007/s43681-022-00209-w [274] Thomas Nagel. 1979. Th...

work page doi:10.1080/15230406.2013.783450 2013
[15]

Matthew Pittman and Kim Sheehan. 2016. Amazon’s Mechanical Turk a Digital Sweatshop? Transparency and Accountability in Crowdsourced Online Research. Journal of Media Ethics 31, 4 (October 2016), 260–262. https://doi.org/10.1080/23736992.2016.1228811 [310] A. V. Platonov, E. A. Poleschuk, I. A. Bessmertny, and N. R. Gafurov. 2018. Using quantum mechanical...

work page doi:10.1080/23736992.2016.1228811 2016
[16]

Richards, Evan Cooley, Jalen Miller, and Ronald Watterson

Adam S. Richards, Evan Cooley, Jalen Miller, and Ronald Watterson. 2024. Testing the Mercator Effect: Global Map Projections Persuade Differently According to the Emphasis Frames Used to Contextualize Them. Communication Reports 37, 2 (May 2024), 109–123. https://doi.org/10.1080/08934215.2024.2303651 [329] Burghard B. Rieger. 1982. Procedural meaning repr...

work page doi:10.1080/08934215.2024.2303651 2024
[17]

Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A. Smith. 2019. The risk of racial bias in hate speech detection. In Proceedings of the 57th annual meeting of the association for computational linguistics, 2019. 1668–1678. [348] Maki Sato and Jonathan McKinney. 2022. The Enactive and Interactive Dimensions of AI: Ingenuity and Imagination T...

work page doi:10.1162/artl_a_00376 2019
[18]

Aarohi Srivastava et al. 2022. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 (2022). [367] Christopher Starke, Janine Baleis, Birte Keller, and Frank Marcinkowski. 2022. Fairness perceptions of algorithmic decision-making: A systematic review of the empirical literature. Big D...

work page doi:10.48550/arxiv.2504.08863 2022
[19]

The Australian Property Institute. 2022. Big visual data analysis using artificial intelligence for mass valuation of residential properties in Australia. Australian Property Research and Education Fund. Retrieved from https://www.api.org.au/apref/ [386] The Nobel Foundation. 2022. The Nobel Prize in Physics 2022. NobelPrize.org. Retrieved July 30, 2025 f...

work page internal anchor Pith review doi:10.48550/arxiv.2302.13971 2022
[20]

UNESCO. 2021. Recommendation on the Ethics of Artificial Intelligence. UNESCO Digital Library. Retrieved June 13, 2025 from https://unesdoc.unesco.org/ark:/48223/pf0000381137 [403] Shannon Vallor. 2016. Technology and the virtues: A philosophical guide to a future worth wanting. Oxford University Press. [404] Shannon Vallor. 2024. The AI mirror: How to re...

work page doi:10.7551/mitpress/6730.001.0001 2021
[21]

cherry-picking,

Terry Winograd. 1972. Understanding natural language. Cognitive psychology 3, 1 (1972), 1–191. [422] World Values Survey Organisation. 2024. WVS Cultural Map: 2023 Version. The World Values Survey. Retrieved January 18, 2024 from https://www.worldvaluessurvey.org/wvs.jsp [423] Malcolm X. 1964. The ballot or the bullet. Detroit, Michigan. [424] Qizhou Xion...

work page doi:10.1093/9780198945215.003.0071 1972
[22]

Ich bin für eine Begrenzung der Zuwanderung. 2. Ich bin für eine Begrenzung der Zuwanderung aus humanitären Gründen. 3. Ich bin für eine Begrenzung der The vast majority of us do not know the state of complete exhaustion on the run, combined with fear for their own lives or the lives of their children or partners. People who make their way from Eritrea, S...
[23]

Memory of a crisis

I am in favor of limiting immigration. 2. I am in favor of limiting immigration for humanitarian reasons. 3. I am in favor of limiting die uns wahrscheinlich schlichtweg zusammenbrechen ließen. Deshalb müssen wir beim Umgang mit Menschen, die jetzt zu uns kommen, einige klare Grundsätze gelten lassen. Diese Grundsätze entstammen nicht mehr und nicht wenig...

2021