arxiv: 2605.01050 · v2 · submitted 2026-05-01 · 📊 stat.AP

Recognition: 2 theorem links

· Lean Theorem

Trust Me, I'm a Doctor?

Mats Stensrud, Zach Shahn

Pith reviewed 2026-05-12 01:56 UTC · model grok-4.3

classification 📊 stat.AP

keywords causal inferencetreatment effect heterogeneityphysician discretionnested randomized trialobservational datasharp boundsgain scoreevidence-based medicine

0 comments

The pith

Combined randomized and observational data yield sharp bounds on how many physicians outperform the trial's best fixed treatment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether doctors' individual treatment choices can beat the single strategy that performed best on average in a randomized trial. It uses data from a trial nested inside a larger observational cohort drawn from the same population to compare each physician's observed outcomes against the trial winner. A gain score is defined to measure this difference for each doctor. Sharp bounds are then derived on the proportion of physicians whose gain scores are nonnegative. This matters because it shows what the data can and cannot tell us about when physician discretion improves on rigid adherence to trial averages.

Core claim

We define a gain score that formalizes the comparison between a physician's personal treatment strategy and the strategy of always choosing the treatment that performed better on average in the randomized trial. Using outcomes observed under treatment, control, and usual care in a nested design, we derive sharp bounds on the proportion of physicians whose personal strategies perform at least as well as, or better than, the trial's better-performing treatment.

What carries the argument

The gain score, which compares each physician's observed outcomes to those expected from always selecting the trial's better treatment, together with the sharp bounds on the fraction of physicians for whom this score is nonnegative.

If this is right

The data can place both lower and upper limits on the share of physicians who match or exceed the trial recommendation.
When the lower bound is close to zero, the observed data supply little support for preferring physician discretion over the trial result.
When the lower bound is high, the data are consistent with a substantial group of physicians doing better than the trial average.
The bounds are sharp, meaning they cannot be tightened further without additional assumptions or data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bounding approach could apply to other decision-makers whose choices are observed alongside trial data, such as teachers or managers.
Simulations with known physician strategies could check how often the bounds correctly contain the true proportion in finite samples.
Extending the gain score to account for patient covariates might narrow the bounds when treatment effects vary systematically.
The framework highlights a general tension between average-effect evidence and individualized practice that appears in many fields beyond medicine.

Load-bearing premise

A randomized trial is nested inside an observational cohort from the same target population so that outcomes can be seen under treatment, control, and usual care.

What would settle it

Collect a new dataset in which each physician's actual long-run success rate is measured directly and compare that empirical proportion to the numerical bounds produced by the method; values outside the interval would contradict the claim that the bounds are sharp under the stated assumptions.

read the original abstract

Clinical trials usually target average treatment effects, but treatment decisions are made for individuals. This tension motivates a common criticism of evidence-based medicine: a treatment that is beneficial on average may be inappropriate for a particular patient, and skilled physicians may outperform rigid adherence to the strategy that performed best in a randomized trial. We consider how randomized and observational data from the same target population can be used to assess that possibility. Specifically, we study settings in which a randomized trial is nested within an observational cohort, so that outcomes are observed under treatment, control, and usual care. We ask what the observed data can reveal about how often physicians outperform the strategy suggested by the trial. We define a gain score to formalize this comparison and derive sharp bounds on the proportion of physicians whose personal strategies perform at least as well as, or better than, always choosing the better performing treatment from the trial. These results shed light on when clinical data support relying on physician discretion over the trial-average recommendation and when stronger justification is required.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a gain score and derives sharp bounds on the share of physicians outperforming the trial's best arm using the three marginal distributions from a nested RCT inside an observational cohort.

read the letter

The punchline is that this work turns the average-versus-individual tension into a partial identification problem by defining a gain score for physician strategies relative to the trial winner and bounding the fraction of doctors who match or beat it. The nested data structure supplies the needed treatment, control, and usual-care outcome distributions, which lets them avoid modeling physician decisions directly and still get sharp bounds under the stated assumptions.

Referee Report

1 major / 2 minor

Summary. The paper addresses the tension between average treatment effects from clinical trials and individual physician decisions. In settings where a randomized trial is nested within an observational cohort, allowing observation of outcomes under treatment, control, and usual care, the authors define a gain score to compare physicians' strategies to the better-performing trial treatment. They derive sharp bounds on the proportion of physicians whose personal strategies perform at least as well as or better than always choosing the better trial arm.

Significance. If the derived bounds are sharp and the identification is correct, this provides a valuable partial identification framework for assessing when clinical data support physician discretion over trial recommendations. The nested design is cleverly used to obtain the three necessary marginal outcome distributions. This could have implications for evidence-based medicine debates. The approach avoids parametric assumptions by focusing on sharp bounds.

major comments (1)

Abstract: The claim that sharp bounds are derived from the observed data structure is central, but the description leaves the exact identification assumptions and proof strategy implicit; without explicit verification that the bounds are sharp given only the three marginal distributions (and no additional restrictions), it is difficult to assess whether the result holds under the stated nested design alone.

minor comments (2)

The gain score definition would benefit from an intuitive example or numerical illustration early in the text to clarify how it formalizes the comparison between physician strategies and the trial's better arm.
Consider adding a brief sensitivity discussion on how the bounds change if the nested design assumption (randomized trial within the same target population) is mildly violated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and for recommending minor revision. The single major comment is addressed below.

read point-by-point responses

Referee: Abstract: The claim that sharp bounds are derived from the observed data structure is central, but the description leaves the exact identification assumptions and proof strategy implicit; without explicit verification that the bounds are sharp given only the three marginal distributions (and no additional restrictions), it is difficult to assess whether the result holds under the stated nested design alone.

Authors: We agree that the abstract would benefit from greater explicitness on this point. The nested design directly supplies the three marginal outcome distributions (under treatment, under control, and under usual care). The sharp bounds on the proportion of physicians whose strategies achieve a gain score at least as high as the better trial arm are obtained by optimizing over all joints consistent with these marginals; sharpness follows from the existence of extremal joints that attain the bound values while respecting the observed marginals and the nested sampling structure, without parametric restrictions or further assumptions. The full identification argument and proof of sharpness appear in the main text and appendix. We will revise the abstract to state explicitly that the bounds are sharp given only the three observed marginal distributions from the nested design. revision: yes

Circularity Check

0 steps flagged

No significant circularity; bounds derived from nested data structure

full rationale

The paper defines a gain score formalizing physician performance relative to the trial's better arm and derives sharp bounds on the proportion of physicians meeting or exceeding it. This uses the explicit nested design (randomized trial inside observational cohort) to obtain the three marginal outcome distributions under treatment, control, and usual care. The partial-identification argument relies directly on these observed distributions as inputs; no step reduces the target quantity to a fitted parameter, self-referential equation, or self-citation chain. The abstract and skeptic analysis confirm the bounds are presented as sharp given that data structure, with no internal redefinition or smuggling of assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the nested data structure and the newly defined gain score; no numerical free parameters are mentioned in the abstract.

axioms (1)

domain assumption The randomized trial is nested within an observational cohort from the same target population, with outcomes observed under treatment, control, and usual care.
This is the core data-generating setting stated in the abstract.

invented entities (1)

gain score no independent evidence
purpose: To formalize the comparison between a physician's personal strategy and the trial's best average treatment.
Newly introduced quantity used to define the target proportion.

pith-pipeline@v0.9.0 · 5461 in / 1307 out tokens · 54111 ms · 2026-05-12T01:56:40.312901+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 1 internal anchor

[1]

American Journal of Epidemiology , volume=

Perspective on ‘harm’ in personalized medicine , author=. American Journal of Epidemiology , volume=. 2025 , publisher=

work page 2025
[2]

Journal of the Royal Statistical Society Series B: Statistical Methodology , pages=

Covariate-assisted bounds on causal effects with instrumental variables , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , pages=. 2025 , publisher=

work page 2025
[3]

American Journal of Epidemiology , year=

Perspective on ‘Harm’ in Personalized Medicine--An Alternative Perspective , author=. American Journal of Epidemiology , year=

work page
[4]

American Journal of Epidemiology , volume=

Rejoinder to ``Perspectives on `harm' in personalized medicine--an alternative perspective'' , author=. American Journal of Epidemiology , volume=. 2025 , publisher=

work page 2025
[5]

Journal of Causal Inference , volume=

Personalized decision making--A conceptual introduction , author=. Journal of Causal Inference , volume=. 2023 , publisher=

work page 2023
[6]

arXiv preprint arXiv:2405.08727 , year=

Intervention effects based on potential benefit , author=. arXiv preprint arXiv:2405.08727 , year=

work page arXiv
[7]

Advances in Neural Information Processing Systems , volume=

Counterfactual harm , author=. Advances in Neural Information Processing Systems , volume=

work page
[8]

Advances in Neural Information Processing Systems , volume=

What's the harm? sharp bounds on the fraction negatively affected by treatment , author=. Advances in Neural Information Processing Systems , volume=

work page
[9]

Journal of the American Statistical Association , volume=

Some probability paradoxes in choice from among random alternatives , author=. Journal of the American Statistical Association , volume=. 1972 , publisher=

work page 1972
[10]

Undergraduate Review , volume=

The Mystery of the Non-Transitive Grime Dice , author=. Undergraduate Review , volume=

work page
[11]

The College Mathematics Journal , volume=

The bizarre world of nontransitive dice: games for two or more players , author=. The College Mathematics Journal , volume=. 2017 , publisher=

work page 2017
[12]

Annals of Mathematics and Artificial Intelligence , volume=

Probabilities of causation: Bounds and identification , author=. Annals of Mathematics and Artificial Intelligence , volume=. 2000 , publisher=

work page 2000
[13]

arXiv preprint arXiv:2301.11976 , year=

Personalised decision-making without counterfactuals , author=. arXiv preprint arXiv:2301.11976 , year=

work page arXiv
[14]

Journal of the American statistical Association , volume=

Causal inference without counterfactuals , author=. Journal of the American statistical Association , volume=. 2000 , publisher=

work page 2000
[15]

2024 , publisher=

Causal Inference: What If , author=. 2024 , publisher=

work page 2024
[16]

BMJ Quality & Safety , year=

Artificial intelligence-powered chatbots in search engines: a cross-sectional study on the quality and risks of drug information for patients , author=. BMJ Quality & Safety , year=

work page
[17]

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence,

Quantifying harm , author=. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence,. 2023 , month =. doi:10.24963/ijcai.2023/41 , url =

work page doi:10.24963/ijcai.2023/41 2023
[18]

Minds and Machines , volume=

A causal analysis of harm , author=. Minds and Machines , volume=. 2024 , publisher=

work page 2024
[19]

Journal of Business & Economic Statistics , volume=

Generalizing the Results from Social Experiments: Theory and Evidence from India , author=. Journal of Business & Economic Statistics , volume=. 2024 , publisher=

work page 2024
[20]

Econometric Theory , volume=

Sharp bounds on the distribution of treatment effects and their statistical inference , author=. Econometric Theory , volume=. 2010 , publisher=

work page 2010
[21]

Journal of the American Statistical Association , volume=

Decomposing treatment effect variation , author=. Journal of the American Statistical Association , volume=. 2019 , publisher=

work page 2019
[22]

2008 , publisher=

Implementing the WHO Stop TB Strategy: a handbook for national TB control programmes , author=. 2008 , publisher=

work page 2008
[23]

2010 , publisher=

Treatment of tuberculosis: guidelines , author=. 2010 , publisher=

work page 2010
[24]

arXiv preprint arXiv:2509.20506 , year=

Identification and Estimation of Joint Potential Outcome Distributions from a Single Study , author=. arXiv preprint arXiv:2509.20506 , year=

work page arXiv
[25]

2009 , publisher=

Causality , author=. 2009 , publisher=

work page 2009
[26]

1947 , publisher=

Theory of games and economic behavior, 2nd rev , author=. 1947 , publisher=

work page 1947
[27]

OUP Catalogue , year=

Foundations of rational choice under risk , author=. OUP Catalogue , year=

work page
[28]

Social science & medicine , volume=

Understanding and misunderstanding randomized controlled trials , author=. Social science & medicine , volume=. 2018 , publisher=

work page 2018
[29]

Analysis , volume=

Great harms from small benefits grow: how death can be outweighed by headaches , author=. Analysis , volume=. 1998 , publisher=

work page 1998
[30]

The Lancet Oncology , volume=

30-day mortality after systemic anticancer treatment for breast and lung cancer in England: a population-based, observational study , author=. The Lancet Oncology , volume=. 2016 , publisher=

work page 2016
[31]

arXiv preprint arXiv:2110.10961 , year=

Individualized decision-making under partial identification: Three perspectives, two optimality results, and one paradox , author=. arXiv preprint arXiv:2110.10961 , year=

work page arXiv
[32]

Advances in neural information processing systems , volume=

Reliable decision support using counterfactual models , author=. Advances in neural information processing systems , volume=

work page
[33]

Biometrika , volume=

Optimal regimes for algorithm-assisted human decision-making , author=. Biometrika , volume=. 2024 , publisher=

work page 2024
[34]

arXiv preprint arXiv:2502.10049 , year=

The Probability of Tiered Benefit: Partial Identification with Robust and Stable Inference , author=. arXiv preprint arXiv:2502.10049 , year=

work page arXiv
[35]

arXiv preprint arXiv:2411.01234 , year=

Identifying and bounding the probability of necessity for causes of effects with ordinal outcomes , author=. arXiv preprint arXiv:2411.01234 , year=

work page arXiv
[36]

Center for the Statistics and the Social Sciences, University of Washington Series

Single world intervention graphs (SWIGs): A unification of the counterfactual and graphical approaches to causality , author=. Center for the Statistics and the Social Sciences, University of Washington Series. Working Paper , volume=. 2013 , publisher=

work page 2013
[37]

Fr. G. Fundamenta mathematicae , volume=. 1935 , publisher=

work page 1935
[38]

R. Fr. Advances in Probability Distributions with Given Marginals: beyond the copulas , pages=. 1991 , publisher=

work page 1991
[39]

Ijcai , volume=

Incremental utility elicitation with the minimax regret decision criterion , author=. Ijcai , volume=

work page
[40]

AAAI/IAAI , pages=

Visual exploration and incremental utility elicitation , author=. AAAI/IAAI , pages=

work page
[41]

, author=

Utility Elicitation as a Classification Problem. , author=. UAI , volume=

work page
[42]

Socio-economic planning sciences , volume=

Social preferences for health states: an empirical evaluation of three measurement techniques , author=. Socio-economic planning sciences , volume=. 1976 , publisher=

work page 1976
[43]

2015 , publisher=

Methods for the economic evaluation of health care programmes , author=. 2015 , publisher=

work page 2015
[44]

Journal of clinical epidemiology , volume=

Deriving a preference-based single index from the UK SF-36 Health Survey , author=. Journal of clinical epidemiology , volume=. 1998 , publisher=

work page 1998
[45]

Mathematics and Computers in Simulation , volume=

Nontransitivity of tuples of random variables with polynomial density and its effects in Bayesian models , author=. Mathematics and Computers in Simulation , volume=. 2022 , publisher=

work page 2022
[46]

arXiv preprint arXiv:2407.14635 , year=

Predicting the Distribution of Treatment Effects: A Covariate-Adjustment Approach , author=. arXiv preprint arXiv:2407.14635 , year=

work page arXiv
[47]

arXiv preprint arXiv:2311.15878 , year=

Policy learning with distributional welfare , author=. arXiv preprint arXiv:2311.15878 , year=

work page arXiv
[48]

Journal of the American Statistical Association , volume=

Policy learning with asymmetric counterfactual utilities , author=. Journal of the American Statistical Association , volume=. 2024 , publisher=

work page 2024
[49]

Biometrika , volume=

Population intervention models in causal inference , author=. Biometrika , volume=. 2008 , publisher=

work page 2008
[50]

Journal of the American Statistical Association , pages=

Improved bounds and inference on optimal regimes , author=. Journal of the American Statistical Association , pages=. 2025 , publisher=

work page 2025
[51]

Stroke , volume =

Neurosurgical Clipping Versus Endovascular Coiling of Patients With Ruptured Intracranial Aneurysms , author =. Stroke , volume =. 2003 , doi =

work page 2003
[52]

American Journal of Epidemiology , pages=

Counterfactual Harm: A Counter-argument , author=. American Journal of Epidemiology , pages=. 2026 , publisher=

work page 2026
[53]

American Journal of Epidemiology , volume =

Combined Analysis of Women's Health Initiative Observational and Clinical Trial Data on Postmenopausal Hormone Treatment and Cardiovascular Disease , author =. American Journal of Epidemiology , volume =. 2006 , doi =

work page 2006
[54]

2002 , publisher=

Lectures on Choquet’s theorem , author=. 2002 , publisher=

work page 2002
[55]

Mathematics of Operations Research , volume=

Extreme points of moment sets , author=. Mathematics of Operations Research , volume=. 1988 , publisher=

work page 1988
[56]

Archiv der Mathematik , volume=

Minimalstellen von funktionen und extremalpunkte , author=. Archiv der Mathematik , volume=. 1958 , publisher=

work page 1958
[57]

Parameterfreie absch

Richter, Hans , journal=. Parameterfreie absch. 1957 , publisher=

work page 1957
[58]

European journal of epidemiology , volume=

Prospective benchmarking of an observational analysis in the SWEDEHEART registry against the REDUCE-AMI randomized trial , author=. European journal of epidemiology , volume=. 2024 , publisher=

work page 2024
[59]

New England Journal of Medicine , volume =

Transcatheter Aortic-Valve Replacement with a Balloon-Expandable Valve in Low-Risk Patients , author =. New England Journal of Medicine , volume =. 2019 , doi =

work page 2019
[60]

Thrombus Aspiration during ST-Segment Elevation Myocardial Infarction , journal =

Fr. Thrombus Aspiration during ST-Segment Elevation Myocardial Infarction , journal =. 2013 , volume =

work page 2013
[61]

Bivalirudin versus Heparin Monotherapy in Myocardial Infarction , journal =

Erlinge, David and Omerovic, Elmir and Fr. Bivalirudin versus Heparin Monotherapy in Myocardial Infarction , journal =. 2017 , volume =

work page 2017
[62]

Biometrika , volume=

Russian roulette: the need for stochastic potential outcomes when utilities depend on counterfactuals , author=. Biometrika , volume=. 2025 , publisher=

work page 2025
[63]

Quantifying Individual Risk for Binary Outcomes

Quantifying individual risk for binary outcome , author=. arXiv preprint arXiv:2402.10537 , year=

work page internal anchor Pith review Pith/arXiv arXiv