arxiv: 2605.12832 · v1 · submitted 2026-05-12 · 📊 stat.AP · cs.LG· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Digital Twins as Synthetic Controls in Single-Arm Trials

Aaron M. Smith, Daniele Bertolini, Franklin Fuller, Jonathan R. Walsh, Run Zhuang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:16 UTC · model grok-4.3

classification 📊 stat.AP cs.LGstat.ML

keywords digital twinssynthetic controlssingle-arm trialsmachine learningclinical trial designdisease progressiondoubly robust estimatorsreal-world evidence

0 comments

The pith

Digital twins from machine learning models can serve as synthetic controls in single-arm clinical trials

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that personalized disease progression predictions generated by machine learning models trained on historical data, known as digital twins, can function as effective synthetic control arms for single-arm trials. This approach uses flexible outcome modeling to produce more robust estimates of treatment effects than simpler matching methods, especially when new patients differ from those in past datasets. It includes doubly robust estimation techniques, formulas for power and sample size, and advice on choosing historical data while addressing regulatory considerations for AI use in drug development. The methods are illustrated by reanalyzing data from amyotrophic lateral sclerosis and Huntington's disease trials. A sympathetic reader would care because single-arm trials are common for ethical and practical reasons but need strong comparators to support reliable conclusions about new treatments.

Core claim

Outcome-model-based synthetic control arms are an important tool for single-arm trials. Digital twins, which are personalized predictions of disease progression generated from machine learning models trained on historical datasets, naturally leverage these flexible approaches to yield more robust estimates of treatment effects and provide a principled way to incorporate corrections when external data are not directly comparable.

What carries the argument

Digital twins: personalized predictions of disease progression from machine learning models trained on historical datasets, serving as outcome-model-based synthetic controls

Load-bearing premise

Machine learning models trained on historical datasets produce accurate and unbiased predictions of disease progression for patients in the current single-arm trial even when populations differ in unmeasured ways

What would settle it

A randomized controlled trial of the same intervention showing a treatment effect estimate that differs substantially from the one derived using digital twin synthetic controls

Figures

Figures reproduced from arXiv: 2605.12832 by Aaron M. Smith, Daniele Bertolini, Franklin Fuller, Jonathan R. Walsh, Run Zhuang.

**Figure 2.** Figure 2: Ratio of the single-arm trial sample size (treated participants only) to the total sample [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗

**Figure 3.** Figure 3: Outcome model influence is lowest when highly relevant historical data and a well [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗

**Figure 4.** Figure 4: Celebrex analysis. Top: control arm as a single-arm study vs. external control. Left: abso [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

**Figure 5.** Figure 5: 2CARE analysis. Top: control arm as a single-arm study vs. external control. Panels and [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Overlap-resampling sweep for the ALS Celebrex analysis. Left: mean absolute standardized [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: Overlap-resampling sweep for the HD 2CARE analysis. Panels and conventions as in Fig. [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

read the original abstract

Single-arm trials are an important study design for evaluating drug efficacy and safety without enrolling patients into a control arm. Although they do not provide the gold-standard evidence of randomized controlled trials, they are increasingly used in clinical development as they offer an efficient, ethical, and practical alternative. A wide variety of approaches can be used to construct control comparators and estimate treatment effects, from fixed comparators informed by clinical knowledge to data-based and model-based patient-level comparators, also known as synthetic controls. Powerful and flexible machine learning models can allow outcome-model-based synthetic controls to overcome key limitations of direct data-based approaches, yield more robust estimates of treatment effects, and provide a principled way to incorporate corrections or encode additional assumptions when external data are not directly comparable. In this work, we argue that outcome-model-based synthetic control arms are an important tool for single-arm trials. We focus on digital twins, personalized predictions of disease progression generated from machine learning models trained on historical datasets, which naturally leverage these flexible approaches. We review doubly robust estimators, present power and sample size formulas, and discuss trade-offs in selecting historical data for training and analysis. We also outline practical considerations for deploying digital twins within the framework of recent FDA draft guidance on the use of artificial intelligence in drug development. Finally, we reanalyze data from trials in amyotrophic lateral sclerosis and Huntington's disease to demonstrate the proposed methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper pushes digital twins from ML models as synthetic controls for single-arm trials, with new power formulas and reanalyses, but the claims rest on untested generalization across populations.

read the letter

The main takeaway is that outcome-model-based synthetic controls via digital twins can strengthen evidence from single-arm trials, especially for rare diseases, by using ML predictions from historical data instead of direct matching. The paper adds power and sample size formulas for this setup and reanalyzes ALS and Huntington's trial data to show feasibility, while linking the approach to FDA AI guidance and reviewing doubly robust estimators. These elements give it some concrete value over earlier synthetic control work. It also covers trade-offs in choosing historical datasets for training, which is a practical step forward. The reanalyses demonstrate that the methods can be applied to real datasets without obvious breakdowns. The central weakness is the assumption that models trained on past data will deliver unbiased predictions for new trial patients. Double robustness helps only if at least one component is correct, but unmeasured shifts in prognostic factors, evolving standard of care, or different selection rules can break both the outcome model and any implicit weighting. The reanalyses use similar populations and do not test transportability under mismatch, so the robustness is not strongly supported yet. Validation metrics and sensitivity checks appear light based on the outline. This work is for trial statisticians and regulatory teams handling single-arm studies in rare diseases or oncology. Readers who need sample size tools or FDA-aligned deployment notes could extract useful pieces. The engagement with existing estimators and the concrete examples are honest enough to merit a serious referee, mainly to verify the formulas and push for better checks on distribution shift.

Referee Report

3 major / 2 minor

Summary. The manuscript argues that outcome-model-based synthetic controls using digital twins—personalized ML predictions of disease progression trained on historical datasets—offer a flexible and robust approach for estimating treatment effects in single-arm trials. It reviews doubly robust estimators, presents power and sample size formulas, discusses trade-offs in selecting historical training data, outlines practical considerations for alignment with FDA draft guidance on AI in drug development, and demonstrates the methods via reanalyses of amyotrophic lateral sclerosis and Huntington's disease trial data.

Significance. If the core assumptions hold, the work provides a timely framework for improving rigor in single-arm trials, which are common in rare-disease settings where RCTs are impractical. The integration of flexible ML outcome models with doubly robust estimation, combined with power formulas and regulatory alignment, could support more efficient trial design and analysis. The reanalyses illustrate feasibility on real neurodegenerative data, and the emphasis on handling non-comparable external data is a practical strength.

major comments (3)

[Section on doubly robust estimators] The manuscript references doubly robust estimators but provides no explicit mathematical formulation (e.g., the precise form of the augmentation term combining the digital-twin outcome model with any weighting or propensity component) or derivation of consistency under distribution shift. Without this, it is difficult to verify the conditions under which double robustness protects against misspecification when the ML model is trained on historical data that may differ from the trial population in unmeasured prognostic factors.
[Reanalysis sections] In the reanalysis sections for ALS and Huntington's data, the manuscript does not report model training details (feature engineering, hyperparameter selection, cross-validation strategy), predictive performance metrics on held-out historical data, or sensitivity analyses for covariate or outcome shifts between historical and trial cohorts. These omissions limit assessment of whether the reported treatment-effect estimates remain reliable when the digital-twin predictions are transported to the current trial population.
[Power and sample size formulas] The power and sample size formulas are presented without accompanying derivation, simulation studies, or empirical validation showing type-I error control and coverage under realistic ML model misspecification or distribution shift scenarios. This weakens the practical utility of the formulas for trial planning.

minor comments (2)

[Abstract] The abstract states that the methods are demonstrated on ALS and Huntington's data but does not summarize the key numerical findings (e.g., estimated treatment effects or confidence intervals), which would help readers quickly gauge the magnitude of the results.
[Notation and methods] Notation for the digital-twin predictions and the doubly robust estimator is introduced without a dedicated notation table or consistent symbol definitions across sections, making some equations harder to follow.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which have identified important opportunities to strengthen the clarity and rigor of our manuscript. We address each major comment below and will revise the paper accordingly.

read point-by-point responses

Referee: [Section on doubly robust estimators] The manuscript references doubly robust estimators but provides no explicit mathematical formulation (e.g., the precise form of the augmentation term combining the digital-twin outcome model with any weighting or propensity component) or derivation of consistency under distribution shift. Without this, it is difficult to verify the conditions under which double robustness protects against misspecification when the ML model is trained on historical data that may differ from the trial population in unmeasured prognostic factors.

Authors: We agree that an explicit formulation and derivation will improve verifiability. In the revised manuscript we will add the precise doubly robust estimator expression (augmented inverse-probability-weighted form that combines the digital-twin outcome predictions with a propensity-based correction term) together with a short derivation of its consistency under distribution shift between historical training data and the trial population, conditional on correct specification of either the outcome model or the propensity model. revision: yes
Referee: [Reanalysis sections] In the reanalysis sections for ALS and Huntington's data, the manuscript does not report model training details (feature engineering, hyperparameter selection, cross-validation strategy), predictive performance metrics on held-out historical data, or sensitivity analyses for covariate or outcome shifts between historical and trial cohorts. These omissions limit assessment of whether the reported treatment-effect estimates remain reliable when the digital-twin predictions are transported to the current trial population.

Authors: We acknowledge these omissions limit reproducibility and transportability assessment. The revised manuscript will include a new subsection reporting feature engineering choices, hyperparameter tuning via cross-validation, predictive performance metrics (e.g., RMSE on held-out historical data), and sensitivity analyses that examine the impact of covariate and outcome distribution shifts between the historical training cohorts and the trial populations. revision: yes
Referee: [Power and sample size formulas] The power and sample size formulas are presented without accompanying derivation, simulation studies, or empirical validation showing type-I error control and coverage under realistic ML model misspecification or distribution shift scenarios. This weakens the practical utility of the formulas for trial planning.

Authors: We agree that supporting material is needed for practical use. The revision will add an appendix containing the full derivation of the power and sample-size formulas from the asymptotic variance of the doubly robust estimator, plus simulation studies that evaluate type-I error control and coverage under ML misspecification and realistic distribution-shift scenarios between historical and trial data. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external estimators and independent reanalyses

full rationale

The paper reviews established doubly robust estimators, derives power formulas from standard statistical principles, and demonstrates methods via reanalysis of external ALS and Huntington's datasets. No equations or central claims reduce by construction to fitted parameters renamed as predictions, nor do they depend on self-citation chains or author-specific uniqueness theorems. The core argument for digital twins as synthetic controls is supported by references to prior literature on doubly robust methods without self-referential loops, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that historical data distributions are close enough to current trial populations for ML predictions to serve as valid controls; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Historical datasets can train models that generalize to predict outcomes in new single-arm trial populations
Required for digital twins to function as unbiased synthetic controls; stated implicitly in the focus on training on historical data.

pith-pipeline@v0.9.0 · 5557 in / 1227 out tokens · 77177 ms · 2026-05-14T19:16:27.084601+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
outcome-model-based synthetic control arms... digital twins... doubly robust estimators... AIPW
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat ≃ Nat recovery unclear
power and sample size formulas for AIPW

Reference graph

Works this paper leans on

29 extracted references · 1 canonical work pages

[1]

Hern´ an and James M

Miguel A. Hern´ an and James M. Robins.Causal Inference: What If. Chapman & Hall/CRC, Boca Raton, FL, 2020

2020
[2]

Imbens and Donald B

Guido W. Imbens and Donald B. Rubin.Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, Cambridge, 2015

2015
[3]

Donald B. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of Educational Psychology, 66(5):688–701, 1974

1974
[4]

Long story short: Omitted variable bias in causal machine learning, 2024

Victor Chernozhukov, Carlos Cinelli, Whitney Newey, Amit Sharma, and Vasilis Syrgkanis. Long story short: Omitted variable bias in causal machine learning, 2024

2024
[5]

Placebo effects: from the neurobiological paradigm to translational implica- tions.Neuron, 84(3):623–637, November 2014

Fabrizio Benedetti. Placebo effects: from the neurobiological paradigm to translational implica- tions.Neuron, 84(3):623–637, November 2014

2014
[6]

A novel cognitive disease progression model for clinical trials in autosomal-dominant alzheimer’s disease.Stat

Guoqiao Wang, Scott Berry, Chengjie Xiong, Jason Hassenstab, Melanie Quintana, Eric M Mc- Dade, Paul Delmar, Matteo Vestrucci, Gopalan Sethuraman, Randall J Bateman, and Dominantly Inherited Alzheimer Network Trials Unit. A novel cognitive disease progression model for clinical trials in autosomal-dominant alzheimer’s disease.Stat. Med., 37(21):3047–3055,...

2018
[7]

Rosenbaum and Donald B

Paul R. Rosenbaum and Donald B. Rubin. The central role of the propensity score in observational studies for causal effects.Biometrika, 70(1):41–55, 1983

1983
[8]

Alberto Abadie and Guido W. Imbens. Large sample properties of matching estimators for average treatment effects.Econometrica, 74(1):235–267, 2006

2006
[9]

Elizabeth A. Stuart. Matching methods for causal inference: A review and a look forward. Statistical Science, 25(1):1–21, 2010

2010
[10]

Imbens, and Geert Ridder

Keisuke Hirano, Guido W. Imbens, and Geert Ridder. Efficient estimation of average treatment effects using the estimated propensity score.Econometrica, 71(4):1161–1189, 2003

2003
[11]

Tsiatis.Semiparametric Theory and Missing Data

Anastasios A. Tsiatis.Semiparametric Theory and Missing Data. Springer Series in Statistics. Springer, New York, 2006. 25

2006
[12]

Double/debiased machine learning for treatment and structural parameters.The Econometrics Journal, 21(1):C1–C68, 2018

Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whit- ney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters.The Econometrics Journal, 21(1):C1–C68, 2018

2018
[13]

Edward H. Kennedy. Semiparametric theory and empirical processes in causal inference.Statistical Science, 37(3):289–308, 2022

2022
[14]

Increasing the effi- ciency of randomized trial estimates via linear adjustment for a prognostic score.The International Journal of Biostatistics, 18(2):329–356, 2022

Alejandro Schuler, David Walsh, Diana Hall, Jon Walsh, and Charles Fisher. Increasing the effi- ciency of randomized trial estimates via linear adjustment for a prognostic score.The International Journal of Biostatistics, 18(2):329–356, 2022

2022
[15]

van der Laan and Daniel Rubin

Mark J. van der Laan and Daniel Rubin. Targeted maximum likelihood learning.The International Journal of Biostatistics, 2(1):1–40, 2006

2006
[16]

van der Laan and Sherri Rose.Targeted Learning: Causal Inference for Observational and Experimental Data

Mark J. van der Laan and Sherri Rose.Targeted Learning: Causal Inference for Observational and Experimental Data. Springer, 2011

2011
[17]

Food and Drug Administration

U.S. Food and Drug Administration. Considerations for the use of artificial intelligence to support regulatory decision-making for drug and biological products. Draft Guidance for Industry and Other Interested Parties, January 2025

2025
[18]

Food and Drug Administration

U.S. Food and Drug Administration. Considerations for the design and conduct of externally controlled trials for drug and biological products. Draft guidance for industry, Center for Drug Evaluation and Research (CDER), Center for Biologics Evaluation and Research (CBER), and Oncology Center of Excellence (OCE), Silver Spring, MD, February 2023. Docket No...

2023
[19]

Food and Drug Administration

U.S. Food and Drug Administration. Real-world data: Assessing electronic health records and medical claims data to support regulatory decision-making for drug and biological products. Guid- ance for industry, Center for Drug Evaluation and Research (CDER) and Center for Biologics Evaluation and Research (CBER), Silver Spring, MD, July 2024

2024
[20]

Food and Drug Administration

U.S. Food and Drug Administration. Real-world data: Assessing registries to support regulatory decision-making for drug and biological products. Guidance for industry, Center for Drug Eval- uation and Research (CDER) and Center for Biologics Evaluation and Research (CBER), Silver Spring, MD, December 2023. Docket No. FDA-2021-D-1146

2023
[21]

Cudkowicz, Jeremy M

Merit E. Cudkowicz, Jeremy M. Shefner, David A. Schoenfeld, Robert H. Brown, Heather Johnson, Mohsin Qureshi, Alan Pestronk, James Caress, Peter Donofrio, Erik Sorenson, Walter G. Bradley, William E. Antholine, Sherry Shrader, Tom Ferguson, , and ALS CNTF Treatment Study Group. Trial of celecoxib in amyotrophic lateral sclerosis.Annals of Neurology, 60(1)...

2006
[22]

McDermott, Karl Kieburtz, Elizabeth A

Aileen McGarry, Michael P. McDermott, Karl Kieburtz, Elizabeth A. de Blieck, M. Flint Beal, Rong Chen, Jody Corey-Bloom, Andrew Feigin, Tamara Pringsheim, Ira Shoulson, John Tetrud, Richard L. Watts, Hui Zhao, and Huntington Study Group. A randomized, double-blind, placebo- controlled trial of coenzyme q10 in huntington disease.Neurology, 88(2):152–159, 2017

2017
[23]

Peter C. Austin. Optimal caliper widths for propensity-score matching when estimating differ- ences in means and differences in proportions in observational studies.Pharmaceutical Statistics, 10(2):150–161, 2011

2011
[24]

Politis and Joseph P

Dimitris N. Politis and Joseph P. Romano. Large sample confidence regions based on subsamples under minimal assumptions.The Annals of Statistics, 22(4):2031–2050, 1994

2031
[25]

Alberto Abadie and Guido W. Imbens. On the failure of the bootstrap for matching estimators. Econometrica, 76(6):1537–1557, 2008

2008
[26]

Edward H. Kennedy. Semiparametric doubly robust targeted double machine learning: A re- view. InHandbook of Statistical Methods for Precision Medicine, pages 207–236. Chapman and Hall/CRC, 2024. 26

2024
[27]

Demystifying statistical learning based on efficient influence functions.The American Statistician, 76(3):292–304, 2022

Oliver Hines, Oliver Dukes, Karla Diaz-Ordaz, and Stijn Vansteelandt. Demystifying statistical learning based on efficient influence functions.The American Statistician, 76(3):292–304, 2022

2022
[28]

X x µ1(x)p(x|A= 1) # = X x [EIF[µ1(x)]p(x|A= 1) +µ 1(x)EIF[p(x|A= 1)]], EIF[τ0] =EIF

Nameyeh Alam, Jake Basilico, Daniele Bertolini, Satish Casie Chetty, Heather D’Angelo, Ryan Douglas, Charles K. Fisher, Franklin Fuller, Melissa Gomes, Rishabh Gupta, Alex Lang, Anton Loukianov, Rachel Mak-McCully, Cary Murray, Hanalei Pham, Susanna Qiao, Elena Ryapolova- Webb, Aaron Smith, Dimitri Theoharatos, Anil Tolwani, Eric W. Tramel, Anna Vidovszky...

work page arXiv 2024
[29]

D Case Study Datasets In this appendix, we describe the datasets used for the case studies analysis of Sec

in terms of the marginal varianceσ 2 0 and the correlation ρ0 = Corr[µ0(x), Y|A= 0], assuming a constant treatment effect. D Case Study Datasets In this appendix, we describe the datasets used for the case studies analysis of Sec. 9. For each indica- tion, we report the baseline characteristics of the trial analyzed as well as the baseline characteristics...

1943