Recognition: 2 theorem links
Principles and Guidelines for Randomized Controlled Trials in AI Evaluation
Pith reviewed 2026-05-08 18:50 UTC · model grok-4.3
The pith
This paper proposes a five-principle framework with 33 guidelines to standardize randomized controlled trials for evaluating AI effects on human performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We adopt the Shadish et al. four-validity framework and extend it with a fifth principle on transparency, repeatability, and verification drawn from the TOP Guidelines. We then operationalize the five principles into 33 requirements with rationales and implementation steps tailored to AI evaluation RCTs. The framework centers causal inference on human performance changes, incorporates heterogeneity analysis, and directly addresses AI-specific problems including model versioning, spillover effects, and equitable impact assessment. It positions the resulting principles and guidelines as tools for study design, assessment of existing work, and future standard-setting.
What carries the argument
The Shadish four-validity framework extended by a transparency-repeatability-verification principle, expressed as 33 AI-adapted guidelines.
If this is right
- Studies planned with the guidelines will produce clearer causal claims about AI effects on people rather than just model outputs.
- Existing AI evaluation papers can be scored against the guidelines to identify gaps in validity or transparency.
- The framework supplies a common blueprint that fields can use when converging on norms for AI RCTs.
- Analysis of practical significance and heterogeneity becomes a required part of reporting human impacts.
- AI-specific issues such as contamination, spillover, and model versioning receive explicit handling instructions.
Where Pith is reading between the lines
- Widespread use could enable reliable meta-analyses across different AI systems by making human-impact results more comparable.
- Regulators or deployers might begin requiring evidence from guideline-aligned RCTs before approving high-stakes AI applications.
- The guidelines could be tested by re-running a set of published AI studies under the new rules and checking for changed conclusions.
- Future work might extend the same structure to non-RCT AI evaluations such as observational studies.
Load-bearing premise
That validity and transparency practices from medicine, psychology, and economics can be applied directly to AI without needing substantial new checks for its distinctive features such as rapid model changes and data contamination.
What would settle it
A head-to-head comparison of AI evaluation studies that follow versus ignore the 33 guidelines, measuring whether the guideline-compliant studies show higher rates of successful replication on measures of human performance change.
read the original abstract
This work establishes a foundational framework for standardizing AI evaluation RCTs (sometimes called human uplift studies). Drawing on established experimental practices from disciplines with established RCT traditions, including software engineering, economics, clinical and health sciences, and psychology, we adopt the (Shadish et al., 2002) four-validity framework and extend it with a fifth principle on transparency, repeatability, and verification adapted from the Transparency and Openness Promotion (TOP) Guidelines (Center for Open Science, 2025). We operationalize all five principles into 33 guidelines adapted for AI evaluation RCT contexts, expressed as requirements with rationales, implementation instructions, and evidence bases. We position the principles and guidelines as serving three key roles for AI evaluation RCTs: a design tool for planning studies, an evaluation rubric for assessing existing work, and a blueprint for standard setting as the field converges on norms. Our framework extends prior work by centering evaluation on human performance rather than model output alone, formalizing causal inference through RCT methodology for AI contexts, integrating heterogeneity analysis and practical significance assessment, implementing a graded transparency and repeatability framework, and addressing AI-specific challenges including model versioning, human-AI interaction dynamics, contamination and spillover effects, and equitable impact assessment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to establish a foundational framework for standardizing AI evaluation RCTs by adopting the Shadish et al. (2002) four-validity framework, extending it with a fifth principle on transparency, repeatability, and verification adapted from the TOP Guidelines, and operationalizing the result into 33 AI-specific guidelines (with rationales, implementation instructions, and evidence bases). These are positioned as a design tool for planning studies, an evaluation rubric for existing work, and a blueprint for standard setting, while addressing AI challenges such as model versioning, contamination, spillover effects, human-AI interaction dynamics, heterogeneity analysis, practical significance, and equitable impact assessment, with a focus on human performance rather than model outputs alone.
Significance. If adopted, the framework could help standardize causal inference and validity assessment in AI evaluations, drawing usefully from established RCT traditions in psychology, clinical sciences, and software engineering. The explicit synthesis, human-centered focus, and operationalization into actionable guidelines represent a constructive contribution to a field where evaluations often lack rigor. Strengths include the graded transparency framework and attention to practical issues like contamination; however, as a proposal without empirical testing of the guidelines, its long-term impact depends on community uptake and refinement.
major comments (3)
- [section on AI-specific challenges and extensions] The central claim of establishing the framework rests on direct adaptation of the Shadish four-validity structure and TOP Guidelines; however, the manuscript provides only high-level discussion of AI-specific adaptations (e.g., for model versioning and spillover) without a systematic analysis of where traditional RCT assumptions break down or require new empirical support in AI settings.
- [section operationalizing the principles into guidelines] Operationalization into exactly 33 guidelines is presented as comprehensive, but without an explicit mapping table or justification showing how each guideline distributes across the five principles (and avoids gaps or redundancy), it is difficult to verify that all aspects of internal, external, construct, statistical conclusion validity plus transparency are fully covered.
- [discussion of roles as design tool, rubric, and blueprint] The positioning of the framework as an 'evaluation rubric' for existing work assumes the guidelines are sufficiently precise for scoring or assessment, yet the manuscript does not include even a single worked example applying the full set to a published AI RCT, which would be needed to substantiate this use case.
minor comments (3)
- [Abstract] The abstract is lengthy and repetitive in listing extensions; condensing it while retaining the core claim would improve accessibility.
- Consider adding a summary table listing all 33 guidelines by principle, with brief implementation notes, to serve as a quick reference for readers using the paper as a design tool.
- [introduction and framework sections] Some citations to Shadish et al. (2002) and the TOP Guidelines could be expanded with specific page or section references when adapting particular validity types to AI contexts.
Simulated Author's Rebuttal
We thank the referee for their constructive review and recommendation for minor revision. We address each major comment point by point below, with honest indications of where changes will be incorporated.
read point-by-point responses
-
Referee: The central claim of establishing the framework rests on direct adaptation of the Shadish four-validity structure and TOP Guidelines; however, the manuscript provides only high-level discussion of AI-specific adaptations (e.g., for model versioning and spillover) without a systematic analysis of where traditional RCT assumptions break down or require new empirical support in AI settings.
Authors: We appreciate this observation. The manuscript's core contribution is the direct adaptation of the Shadish et al. (2002) framework extended by TOP Guidelines, with AI-specific challenges (model versioning, spillover, contamination, human-AI dynamics) addressed through the operationalized guidelines rather than a standalone meta-analysis of assumption breakdowns. A systematic empirical examination of where traditional RCT assumptions fail in AI would require new data collection or large-scale review, which is outside the scope of this foundational proposal. In revision we will expand the discussion section to more explicitly flag potential assumption violations with supporting citations from existing AI evaluation literature. revision: partial
-
Referee: Operationalization into exactly 33 guidelines is presented as comprehensive, but without an explicit mapping table or justification showing how each guideline distributes across the five principles (and avoids gaps or redundancy), it is difficult to verify that all aspects of internal, external, construct, statistical conclusion validity plus transparency are fully covered.
Authors: We agree that an explicit mapping would strengthen verifiability. The 33 guidelines were developed by assigning each to the most relevant principle(s) while cross-checking for coverage and minimizing overlap, but this process is currently described only narratively. In the revised manuscript we will insert a mapping table that lists each guideline against the five principles, with brief notes on rationale and redundancy avoidance. revision: yes
-
Referee: The positioning of the framework as an 'evaluation rubric' for existing work assumes the guidelines are sufficiently precise for scoring or assessment, yet the manuscript does not include even a single worked example applying the full set to a published AI RCT, which would be needed to substantiate this use case.
Authors: This is a valid concern for the rubric use case. While the guidelines include implementation instructions intended to support assessment, a concrete example would make the claim more tangible. Adding a full scoring of an existing published RCT would substantially increase length and risk copyright or selection-bias issues. In revision we will add a concise illustrative walkthrough applying the guidelines to a synthetic but realistic AI RCT scenario, demonstrating rubric-style assessment without claiming exhaustive coverage of any single published study. revision: partial
Circularity Check
No significant circularity
full rationale
The paper is explicitly a synthesis and proposal that adopts the external Shadish et al. (2002) four-validity framework and TOP Guidelines (Center for Open Science) with direct attribution, then operationalizes them into 33 AI-adapted guidelines. No equations, fitted parameters, self-citations as load-bearing premises, or reductions of new claims to the authors' own prior outputs appear. The derivation chain consists of adaptation and formalization steps that remain independent of the paper's own content.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The Shadish et al. (2002) four-validity framework applies without major modification to AI evaluation RCTs.
- domain assumption The TOP Guidelines on transparency can be extended to AI RCT contexts.
Reference graph
Works this paper leans on
-
[1]
Science , volume=
Promoting an open research culture , author=. Science , volume=. 2015 , publisher=
2015
-
[2]
Meta-analysis of
Zhao, Y and others , journal=. Meta-analysis of
-
[3]
2020 , howpublished=
Empirical Standards for Software Engineering Research , author=. 2020 , howpublished=
2020
-
[4]
2024 , howpublished=
2024
-
[5]
Toward an evaluation science for generative
Weidinger, Laura and Raji, Inioluwa Deborah and Wallach, Hanna and Mitchell, Margaret and Wang, Angelina and Salaudeen, Oluwatobiloba and others , journal=. Toward an evaluation science for generative
-
[6]
arXiv preprint arXiv:2412.09232 , year=
Uplift modeling with continuous treatments: A predict-then-optimize approach , author=. arXiv preprint arXiv:2412.09232 , year=
-
[7]
International Conference on Predictive Applications and APIs , pages=
Causal inference and uplift modelling: A review of the literature , author=. International Conference on Predictive Applications and APIs , pages=. 2017 , organization=
2017
-
[8]
predictive modeling: a theoretical analysis , author=
Uplift vs. predictive modeling: a theoretical analysis , author=. arXiv preprint arXiv:2309.12036 , year=
-
[9]
Encyclopedia of Machine Learning and Data Mining , pages=
Uplift modeling , author=. Encyclopedia of Machine Learning and Data Mining , pages=. 2017 , publisher=
2017
-
[10]
Preliminary suggestions for rigorous
Paskov, Patricia and Byun, Michelle and Wei, Kevin and Webster, Thomas , institution=. Preliminary suggestions for rigorous
-
[11]
Measurement in
Mishra, Saurabh and Clark, Jack and Perrault, C Raymond , journal=. Measurement in
-
[12]
Friedland, Andrew , year=
-
[13]
2009 , publisher=
Talmon, Jan and Ammenwerth, Elske and Brender, Jytte and De Keizer, Nicolette and Nykanen, Pirkko and Rigby, Michael , journal=. 2009 , publisher=
2009
-
[14]
2020 , publisher=
Liu, Xiaoxuan and Cruz Rivera, Samantha and Moher, David and Calvert, Melanie J and Denniston, Alastair K , journal=. 2020 , publisher=
2020
-
[15]
2020 , publisher=
Rivera, Samantha Cruz and Liu, Xiaoxuan and Chan, An-Wen and Denniston, Alastair K and Calvert, Melanie J , journal=. 2020 , publisher=
2020
-
[16]
McCaslin, R and others , year=
-
[17]
Reporting quality in
Kashani, S and others , year=. Reporting quality in
-
[18]
Limits of trust in medical
Hatherley, Joshua , journal=. Limits of trust in medical. 2022 , publisher=
2022
-
[19]
2002 , publisher=
Experimental and quasi-experimental designs for generalized causal inference , author=. 2002 , publisher=
2002
-
[20]
Psychological Bulletin , volume=
Construct validity in psychological tests , author=. Psychological Bulletin , volume=. 1955 , publisher=
1955
-
[21]
1963 , publisher=
Experimental and quasi-experimental designs for research , author=. 1963 , publisher=
1963
-
[22]
Acta Obstetricia et Gynecologica Scandinavica , volume=
A simplified guide to randomized controlled trials , author=. Acta Obstetricia et Gynecologica Scandinavica , volume=. 2018 , publisher=
2018
-
[23]
The operational risks of
Mouton, Christopher and Lucas, Caleb and Guest, Ella , institution=. The operational risks of
-
[24]
2026 , url=
Paskov, Patricia and Hong, Shen Zhou and others , journal=. 2026 , url=
2026
-
[25]
Nature , year=
Investigating the analytical robustness of the social and behavioural sciences , author=. Nature , year=
-
[26]
Nature , year=
Investigating the replicability of the social and behavioural sciences , author=. Nature , year=
-
[27]
Nature , year=
Investigating the reproducibility of the social and behavioural sciences , author=. Nature , year=
-
[28]
Nature , year=
Reproducibility and robustness of economics and political science research , author=. Nature , year=
-
[29]
2026 , howpublished=
A brief glossary of terms about repeatability: Replicability, robustness, and reproducibility , author=. 2026 , howpublished=
2026
-
[30]
Proceedings of the National Academy of Sciences , volume=
A framework for assessing the trustworthiness of scientific research findings , author=. Proceedings of the National Academy of Sciences , volume=. 2026 , doi=
2026
-
[31]
2025 , howpublished=
Transparency and Openness Promotion (. 2025 , howpublished=
2025
-
[32]
2025 , howpublished=
Claude 3.7 Sonnet System Card , author=. 2025 , howpublished=
2025
-
[33]
Frontier AI regulation: Managing emerging risks to public safety.arXiv preprint arXiv:2307.03718,
Anderljung, Markus and Barnhart, Joslyn and Korinek, Anton and Leung, Jade and O'Keefe, Cullen and Whittlestone, Jess and others , year=. Frontier. 2307.03718 , archivePrefix=
-
[34]
Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models
Wijk, Hjalmar and others , year=. 2411.15114 , archivePrefix=
-
[35]
Emerging Practices in Frontier
Buhl, Marie Davidsen and Bucknall, Ben and Masterson, Tammy , year=. Emerging Practices in Frontier. 2503.04746 , archivePrefix=
-
[36]
2021 , eprint=
On the Opportunities and Risks of Foundation Models , author=. 2021 , eprint=
2021
-
[37]
2023 , eprint=
A Survey of Large Language Models , author=. 2023 , eprint=
2023
- [38]
-
[39]
Schulz, Kenneth F and Altman, Douglas G and Moher, David , journal=
-
[40]
Annals of Internal Medicine , volume=
Chan, An-Wen and Tetzlaff, Jennifer M and Altman, Douglas G and Laupacis, Andreas and G. Annals of Internal Medicine , volume=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.