arxiv: 2605.02050 · v1 · submitted 2026-05-03 · 💻 cs.CY · cs.AI· cs.HC· cs.LG

Recognition: 2 theorem links

Principles and Guidelines for Randomized Controlled Trials in AI Evaluation

Alexandra Campili, Angelica Chowdhury, Bimpe Ayoola, Christopher Kelly, Devin Barbour, Rokas Gipi\v{s}kis, Thomas Chen Dawson, Ze Shen Chin

Pith reviewed 2026-05-08 18:50 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.HCcs.LG

keywords randomized controlled trialsAI evaluationcausal inferencehuman performancetransparency guidelinesvalidity frameworkexperimental designRCT standardization

0 comments

The pith

This paper proposes a five-principle framework with 33 guidelines to standardize randomized controlled trials for evaluating AI effects on human performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The work sets out to create consistent methods for running randomized controlled trials in AI evaluation by adapting established validity principles from psychology, medicine, and economics. It extends the classic four-validity framework with an added emphasis on transparency and repeatability, then translates everything into practical guidelines suited to AI issues such as model updates, data contamination, and human-AI interactions. A sympathetic reader would value this because current AI tests often lack clear causal evidence about real-world human impacts, making it hard to compare results or trust claims about benefits and harms. The guidelines aim to serve as both a planning checklist and a way to judge past studies.

Core claim

We adopt the Shadish et al. four-validity framework and extend it with a fifth principle on transparency, repeatability, and verification drawn from the TOP Guidelines. We then operationalize the five principles into 33 requirements with rationales and implementation steps tailored to AI evaluation RCTs. The framework centers causal inference on human performance changes, incorporates heterogeneity analysis, and directly addresses AI-specific problems including model versioning, spillover effects, and equitable impact assessment. It positions the resulting principles and guidelines as tools for study design, assessment of existing work, and future standard-setting.

What carries the argument

The Shadish four-validity framework extended by a transparency-repeatability-verification principle, expressed as 33 AI-adapted guidelines.

If this is right

Studies planned with the guidelines will produce clearer causal claims about AI effects on people rather than just model outputs.
Existing AI evaluation papers can be scored against the guidelines to identify gaps in validity or transparency.
The framework supplies a common blueprint that fields can use when converging on norms for AI RCTs.
Analysis of practical significance and heterogeneity becomes a required part of reporting human impacts.
AI-specific issues such as contamination, spillover, and model versioning receive explicit handling instructions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread use could enable reliable meta-analyses across different AI systems by making human-impact results more comparable.
Regulators or deployers might begin requiring evidence from guideline-aligned RCTs before approving high-stakes AI applications.
The guidelines could be tested by re-running a set of published AI studies under the new rules and checking for changed conclusions.
Future work might extend the same structure to non-RCT AI evaluations such as observational studies.

Load-bearing premise

That validity and transparency practices from medicine, psychology, and economics can be applied directly to AI without needing substantial new checks for its distinctive features such as rapid model changes and data contamination.

What would settle it

A head-to-head comparison of AI evaluation studies that follow versus ignore the 33 guidelines, measuring whether the guideline-compliant studies show higher rates of successful replication on measures of human performance change.

read the original abstract

This work establishes a foundational framework for standardizing AI evaluation RCTs (sometimes called human uplift studies). Drawing on established experimental practices from disciplines with established RCT traditions, including software engineering, economics, clinical and health sciences, and psychology, we adopt the (Shadish et al., 2002) four-validity framework and extend it with a fifth principle on transparency, repeatability, and verification adapted from the Transparency and Openness Promotion (TOP) Guidelines (Center for Open Science, 2025). We operationalize all five principles into 33 guidelines adapted for AI evaluation RCT contexts, expressed as requirements with rationales, implementation instructions, and evidence bases. We position the principles and guidelines as serving three key roles for AI evaluation RCTs: a design tool for planning studies, an evaluation rubric for assessing existing work, and a blueprint for standard setting as the field converges on norms. Our framework extends prior work by centering evaluation on human performance rather than model output alone, formalizing causal inference through RCT methodology for AI contexts, integrating heterogeneity analysis and practical significance assessment, implementing a graded transparency and repeatability framework, and addressing AI-specific challenges including model versioning, human-AI interaction dynamics, contamination and spillover effects, and equitable impact assessment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper turns the Shadish validity framework plus TOP transparency into 33 concrete guidelines for AI RCTs focused on human outcomes, a useful synthesis that still needs real testing.

read the letter

The main thing here is a set of 33 guidelines that adapt established RCT validity principles for AI evaluation studies that measure effects on people rather than just model scores. They take the four-validity structure from Shadish, add a transparency principle drawn from TOP, and spell out what each means in practice for AI work, including model versioning, spillover, and equitable impact checks. The rationales and implementation steps are laid out clearly, and they position the whole thing as a design checklist, a review rubric, and a starting point for field norms. That operationalization step is the actual new piece, and it pulls in relevant AI-specific issues without obvious contradictions in the logic. The synthesis itself draws directly from the cited sources with proper attribution, so no circularity problem. The soft spots are straightforward. This is a proposal document with no empirical test or worked example applying the full set of guidelines, so we do not yet know how well they hold up when used on actual studies. The assumption that the older frameworks transfer with only modest tweaks could run into trouble with the speed of model changes and interaction effects that other fields do not face at the same scale. Some of the 33 items might also prove burdensome in fast-moving development settings unless the paper adds prioritization or examples. This is for researchers who plan, run, or review human-subject AI evaluations, especially in applied or high-stakes domains. Anyone looking for a structured way to think through causal claims in AI work will find it practical as a reference. I would send it for peer review. The subfield needs shared standards for these kinds of trials, and a paper that formalizes the adaptation this way gives referees something concrete to engage with even if revisions are needed.

Referee Report

3 major / 3 minor

Summary. The paper claims to establish a foundational framework for standardizing AI evaluation RCTs by adopting the Shadish et al. (2002) four-validity framework, extending it with a fifth principle on transparency, repeatability, and verification adapted from the TOP Guidelines, and operationalizing the result into 33 AI-specific guidelines (with rationales, implementation instructions, and evidence bases). These are positioned as a design tool for planning studies, an evaluation rubric for existing work, and a blueprint for standard setting, while addressing AI challenges such as model versioning, contamination, spillover effects, human-AI interaction dynamics, heterogeneity analysis, practical significance, and equitable impact assessment, with a focus on human performance rather than model outputs alone.

Significance. If adopted, the framework could help standardize causal inference and validity assessment in AI evaluations, drawing usefully from established RCT traditions in psychology, clinical sciences, and software engineering. The explicit synthesis, human-centered focus, and operationalization into actionable guidelines represent a constructive contribution to a field where evaluations often lack rigor. Strengths include the graded transparency framework and attention to practical issues like contamination; however, as a proposal without empirical testing of the guidelines, its long-term impact depends on community uptake and refinement.

major comments (3)

[section on AI-specific challenges and extensions] The central claim of establishing the framework rests on direct adaptation of the Shadish four-validity structure and TOP Guidelines; however, the manuscript provides only high-level discussion of AI-specific adaptations (e.g., for model versioning and spillover) without a systematic analysis of where traditional RCT assumptions break down or require new empirical support in AI settings.
[section operationalizing the principles into guidelines] Operationalization into exactly 33 guidelines is presented as comprehensive, but without an explicit mapping table or justification showing how each guideline distributes across the five principles (and avoids gaps or redundancy), it is difficult to verify that all aspects of internal, external, construct, statistical conclusion validity plus transparency are fully covered.
[discussion of roles as design tool, rubric, and blueprint] The positioning of the framework as an 'evaluation rubric' for existing work assumes the guidelines are sufficiently precise for scoring or assessment, yet the manuscript does not include even a single worked example applying the full set to a published AI RCT, which would be needed to substantiate this use case.

minor comments (3)

[Abstract] The abstract is lengthy and repetitive in listing extensions; condensing it while retaining the core claim would improve accessibility.
Consider adding a summary table listing all 33 guidelines by principle, with brief implementation notes, to serve as a quick reference for readers using the paper as a design tool.
[introduction and framework sections] Some citations to Shadish et al. (2002) and the TOP Guidelines could be expanded with specific page or section references when adapting particular validity types to AI contexts.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive review and recommendation for minor revision. We address each major comment point by point below, with honest indications of where changes will be incorporated.

read point-by-point responses

Referee: The central claim of establishing the framework rests on direct adaptation of the Shadish four-validity structure and TOP Guidelines; however, the manuscript provides only high-level discussion of AI-specific adaptations (e.g., for model versioning and spillover) without a systematic analysis of where traditional RCT assumptions break down or require new empirical support in AI settings.

Authors: We appreciate this observation. The manuscript's core contribution is the direct adaptation of the Shadish et al. (2002) framework extended by TOP Guidelines, with AI-specific challenges (model versioning, spillover, contamination, human-AI dynamics) addressed through the operationalized guidelines rather than a standalone meta-analysis of assumption breakdowns. A systematic empirical examination of where traditional RCT assumptions fail in AI would require new data collection or large-scale review, which is outside the scope of this foundational proposal. In revision we will expand the discussion section to more explicitly flag potential assumption violations with supporting citations from existing AI evaluation literature. revision: partial
Referee: Operationalization into exactly 33 guidelines is presented as comprehensive, but without an explicit mapping table or justification showing how each guideline distributes across the five principles (and avoids gaps or redundancy), it is difficult to verify that all aspects of internal, external, construct, statistical conclusion validity plus transparency are fully covered.

Authors: We agree that an explicit mapping would strengthen verifiability. The 33 guidelines were developed by assigning each to the most relevant principle(s) while cross-checking for coverage and minimizing overlap, but this process is currently described only narratively. In the revised manuscript we will insert a mapping table that lists each guideline against the five principles, with brief notes on rationale and redundancy avoidance. revision: yes
Referee: The positioning of the framework as an 'evaluation rubric' for existing work assumes the guidelines are sufficiently precise for scoring or assessment, yet the manuscript does not include even a single worked example applying the full set to a published AI RCT, which would be needed to substantiate this use case.

Authors: This is a valid concern for the rubric use case. While the guidelines include implementation instructions intended to support assessment, a concrete example would make the claim more tangible. Adding a full scoring of an existing published RCT would substantially increase length and risk copyright or selection-bias issues. In revision we will add a concise illustrative walkthrough applying the guidelines to a synthetic but realistic AI RCT scenario, demonstrating rubric-style assessment without claiming exhaustive coverage of any single published study. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is explicitly a synthesis and proposal that adopts the external Shadish et al. (2002) four-validity framework and TOP Guidelines (Center for Open Science) with direct attribution, then operationalizes them into 33 AI-adapted guidelines. No equations, fitted parameters, self-citations as load-bearing premises, or reductions of new claims to the authors' own prior outputs appear. The derivation chain consists of adaptation and formalization steps that remain independent of the paper's own content.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on established external frameworks rather than new mathematical derivations or fitted parameters; the main assumptions are that RCT methods transfer to AI evaluation and that the cited validity principles remain applicable.

axioms (2)

domain assumption The Shadish et al. (2002) four-validity framework applies without major modification to AI evaluation RCTs.
The paper adopts this as the base and extends it rather than re-deriving it.
domain assumption The TOP Guidelines on transparency can be extended to AI RCT contexts.
Used to create the fifth principle.

pith-pipeline@v0.9.0 · 5547 in / 1306 out tokens · 41161 ms · 2026-05-08T18:50:21.862295+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 6 canonical work pages

[1]

Science , volume=

Promoting an open research culture , author=. Science , volume=. 2015 , publisher=

2015
[2]

Meta-analysis of

Zhao, Y and others , journal=. Meta-analysis of
[3]

2020 , howpublished=

Empirical Standards for Software Engineering Research , author=. 2020 , howpublished=

2020
[4]

2024 , howpublished=

2024
[5]

Toward an evaluation science for generative

Weidinger, Laura and Raji, Inioluwa Deborah and Wallach, Hanna and Mitchell, Margaret and Wang, Angelina and Salaudeen, Oluwatobiloba and others , journal=. Toward an evaluation science for generative
[6]

arXiv preprint arXiv:2412.09232 , year=

Uplift modeling with continuous treatments: A predict-then-optimize approach , author=. arXiv preprint arXiv:2412.09232 , year=

work page arXiv
[7]

International Conference on Predictive Applications and APIs , pages=

Causal inference and uplift modelling: A review of the literature , author=. International Conference on Predictive Applications and APIs , pages=. 2017 , organization=

2017
[8]

predictive modeling: a theoretical analysis , author=

Uplift vs. predictive modeling: a theoretical analysis , author=. arXiv preprint arXiv:2309.12036 , year=

work page arXiv
[9]

Encyclopedia of Machine Learning and Data Mining , pages=

Uplift modeling , author=. Encyclopedia of Machine Learning and Data Mining , pages=. 2017 , publisher=

2017
[10]

Preliminary suggestions for rigorous

Paskov, Patricia and Byun, Michelle and Wei, Kevin and Webster, Thomas , institution=. Preliminary suggestions for rigorous
[11]

Measurement in

Mishra, Saurabh and Clark, Jack and Perrault, C Raymond , journal=. Measurement in
[12]

Friedland, Andrew , year=
[13]

2009 , publisher=

Talmon, Jan and Ammenwerth, Elske and Brender, Jytte and De Keizer, Nicolette and Nykanen, Pirkko and Rigby, Michael , journal=. 2009 , publisher=

2009
[14]

2020 , publisher=

Liu, Xiaoxuan and Cruz Rivera, Samantha and Moher, David and Calvert, Melanie J and Denniston, Alastair K , journal=. 2020 , publisher=

2020
[15]

2020 , publisher=

Rivera, Samantha Cruz and Liu, Xiaoxuan and Chan, An-Wen and Denniston, Alastair K and Calvert, Melanie J , journal=. 2020 , publisher=

2020
[16]

McCaslin, R and others , year=
[17]

Reporting quality in

Kashani, S and others , year=. Reporting quality in
[18]

Limits of trust in medical

Hatherley, Joshua , journal=. Limits of trust in medical. 2022 , publisher=

2022
[19]

2002 , publisher=

Experimental and quasi-experimental designs for generalized causal inference , author=. 2002 , publisher=

2002
[20]

Psychological Bulletin , volume=

Construct validity in psychological tests , author=. Psychological Bulletin , volume=. 1955 , publisher=

1955
[21]

1963 , publisher=

Experimental and quasi-experimental designs for research , author=. 1963 , publisher=

1963
[22]

Acta Obstetricia et Gynecologica Scandinavica , volume=

A simplified guide to randomized controlled trials , author=. Acta Obstetricia et Gynecologica Scandinavica , volume=. 2018 , publisher=

2018
[23]

The operational risks of

Mouton, Christopher and Lucas, Caleb and Guest, Ella , institution=. The operational risks of
[24]

2026 , url=

Paskov, Patricia and Hong, Shen Zhou and others , journal=. 2026 , url=

2026
[25]

Nature , year=

Investigating the analytical robustness of the social and behavioural sciences , author=. Nature , year=
[26]

Nature , year=

Investigating the replicability of the social and behavioural sciences , author=. Nature , year=
[27]

Nature , year=

Investigating the reproducibility of the social and behavioural sciences , author=. Nature , year=
[28]

Nature , year=

Reproducibility and robustness of economics and political science research , author=. Nature , year=
[29]

2026 , howpublished=

A brief glossary of terms about repeatability: Replicability, robustness, and reproducibility , author=. 2026 , howpublished=

2026
[30]

Proceedings of the National Academy of Sciences , volume=

A framework for assessing the trustworthiness of scientific research findings , author=. Proceedings of the National Academy of Sciences , volume=. 2026 , doi=

2026
[31]

2025 , howpublished=

Transparency and Openness Promotion (. 2025 , howpublished=

2025
[32]

2025 , howpublished=

Claude 3.7 Sonnet System Card , author=. 2025 , howpublished=

2025
[33]

Frontier AI regulation: Managing emerging risks to public safety.arXiv preprint arXiv:2307.03718,

Anderljung, Markus and Barnhart, Joslyn and Korinek, Anton and Leung, Jade and O'Keefe, Cullen and Whittlestone, Jess and others , year=. Frontier. 2307.03718 , archivePrefix=

work page arXiv
[34]

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

Wijk, Hjalmar and others , year=. 2411.15114 , archivePrefix=

work page arXiv
[35]

Emerging Practices in Frontier

Buhl, Marie Davidsen and Bucknall, Ben and Masterson, Tammy , year=. Emerging Practices in Frontier. 2503.04746 , archivePrefix=

work page arXiv
[36]

2021 , eprint=

On the Opportunities and Risks of Foundation Models , author=. 2021 , eprint=

2021
[37]

2023 , eprint=

A Survey of Large Language Models , author=. 2023 , eprint=

2023
[38]

Defining

Sun, Yuanyuan and Parker, Timothy and Gierschmann, Lara and Shams, Sana and Canmetin, Teo and Duteil, Mathieu and Gipi. Defining. arXiv preprint arXiv:2603.10023 , year=

work page arXiv
[39]

Schulz, Kenneth F and Altman, Douglas G and Moher, David , journal=
[40]

Annals of Internal Medicine , volume=

Chan, An-Wen and Tetzlaff, Jennifer M and Altman, Douglas G and Laupacis, Andreas and G. Annals of Internal Medicine , volume=