pith. machine review for the scientific record. sign in

arxiv: 2605.01050 · v2 · submitted 2026-05-01 · 📊 stat.AP

Recognition: 2 theorem links

· Lean Theorem

Trust Me, I'm a Doctor?

Mats Stensrud, Zach Shahn

Pith reviewed 2026-05-12 01:56 UTC · model grok-4.3

classification 📊 stat.AP
keywords causal inferencetreatment effect heterogeneityphysician discretionnested randomized trialobservational datasharp boundsgain scoreevidence-based medicine
0
0 comments X

The pith

Combined randomized and observational data yield sharp bounds on how many physicians outperform the trial's best fixed treatment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether doctors' individual treatment choices can beat the single strategy that performed best on average in a randomized trial. It uses data from a trial nested inside a larger observational cohort drawn from the same population to compare each physician's observed outcomes against the trial winner. A gain score is defined to measure this difference for each doctor. Sharp bounds are then derived on the proportion of physicians whose gain scores are nonnegative. This matters because it shows what the data can and cannot tell us about when physician discretion improves on rigid adherence to trial averages.

Core claim

We define a gain score that formalizes the comparison between a physician's personal treatment strategy and the strategy of always choosing the treatment that performed better on average in the randomized trial. Using outcomes observed under treatment, control, and usual care in a nested design, we derive sharp bounds on the proportion of physicians whose personal strategies perform at least as well as, or better than, the trial's better-performing treatment.

What carries the argument

The gain score, which compares each physician's observed outcomes to those expected from always selecting the trial's better treatment, together with the sharp bounds on the fraction of physicians for whom this score is nonnegative.

If this is right

  • The data can place both lower and upper limits on the share of physicians who match or exceed the trial recommendation.
  • When the lower bound is close to zero, the observed data supply little support for preferring physician discretion over the trial result.
  • When the lower bound is high, the data are consistent with a substantial group of physicians doing better than the trial average.
  • The bounds are sharp, meaning they cannot be tightened further without additional assumptions or data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bounding approach could apply to other decision-makers whose choices are observed alongside trial data, such as teachers or managers.
  • Simulations with known physician strategies could check how often the bounds correctly contain the true proportion in finite samples.
  • Extending the gain score to account for patient covariates might narrow the bounds when treatment effects vary systematically.
  • The framework highlights a general tension between average-effect evidence and individualized practice that appears in many fields beyond medicine.

Load-bearing premise

A randomized trial is nested inside an observational cohort from the same target population so that outcomes can be seen under treatment, control, and usual care.

What would settle it

Collect a new dataset in which each physician's actual long-run success rate is measured directly and compare that empirical proportion to the numerical bounds produced by the method; values outside the interval would contradict the claim that the bounds are sharp under the stated assumptions.

read the original abstract

Clinical trials usually target average treatment effects, but treatment decisions are made for individuals. This tension motivates a common criticism of evidence-based medicine: a treatment that is beneficial on average may be inappropriate for a particular patient, and skilled physicians may outperform rigid adherence to the strategy that performed best in a randomized trial. We consider how randomized and observational data from the same target population can be used to assess that possibility. Specifically, we study settings in which a randomized trial is nested within an observational cohort, so that outcomes are observed under treatment, control, and usual care. We ask what the observed data can reveal about how often physicians outperform the strategy suggested by the trial. We define a gain score to formalize this comparison and derive sharp bounds on the proportion of physicians whose personal strategies perform at least as well as, or better than, always choosing the better performing treatment from the trial. These results shed light on when clinical data support relying on physician discretion over the trial-average recommendation and when stronger justification is required.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper addresses the tension between average treatment effects from clinical trials and individual physician decisions. In settings where a randomized trial is nested within an observational cohort, allowing observation of outcomes under treatment, control, and usual care, the authors define a gain score to compare physicians' strategies to the better-performing trial treatment. They derive sharp bounds on the proportion of physicians whose personal strategies perform at least as well as or better than always choosing the better trial arm.

Significance. If the derived bounds are sharp and the identification is correct, this provides a valuable partial identification framework for assessing when clinical data support physician discretion over trial recommendations. The nested design is cleverly used to obtain the three necessary marginal outcome distributions. This could have implications for evidence-based medicine debates. The approach avoids parametric assumptions by focusing on sharp bounds.

major comments (1)
  1. Abstract: The claim that sharp bounds are derived from the observed data structure is central, but the description leaves the exact identification assumptions and proof strategy implicit; without explicit verification that the bounds are sharp given only the three marginal distributions (and no additional restrictions), it is difficult to assess whether the result holds under the stated nested design alone.
minor comments (2)
  1. The gain score definition would benefit from an intuitive example or numerical illustration early in the text to clarify how it formalizes the comparison between physician strategies and the trial's better arm.
  2. Consider adding a brief sensitivity discussion on how the bounds change if the nested design assumption (randomized trial within the same target population) is mildly violated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and for recommending minor revision. The single major comment is addressed below.

read point-by-point responses
  1. Referee: Abstract: The claim that sharp bounds are derived from the observed data structure is central, but the description leaves the exact identification assumptions and proof strategy implicit; without explicit verification that the bounds are sharp given only the three marginal distributions (and no additional restrictions), it is difficult to assess whether the result holds under the stated nested design alone.

    Authors: We agree that the abstract would benefit from greater explicitness on this point. The nested design directly supplies the three marginal outcome distributions (under treatment, under control, and under usual care). The sharp bounds on the proportion of physicians whose strategies achieve a gain score at least as high as the better trial arm are obtained by optimizing over all joints consistent with these marginals; sharpness follows from the existence of extremal joints that attain the bound values while respecting the observed marginals and the nested sampling structure, without parametric restrictions or further assumptions. The full identification argument and proof of sharpness appear in the main text and appendix. We will revise the abstract to state explicitly that the bounds are sharp given only the three observed marginal distributions from the nested design. revision: yes

Circularity Check

0 steps flagged

No significant circularity; bounds derived from nested data structure

full rationale

The paper defines a gain score formalizing physician performance relative to the trial's better arm and derives sharp bounds on the proportion of physicians meeting or exceeding it. This uses the explicit nested design (randomized trial inside observational cohort) to obtain the three marginal outcome distributions under treatment, control, and usual care. The partial-identification argument relies directly on these observed distributions as inputs; no step reduces the target quantity to a fitted parameter, self-referential equation, or self-citation chain. The abstract and skeptic analysis confirm the bounds are presented as sharp given that data structure, with no internal redefinition or smuggling of assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the nested data structure and the newly defined gain score; no numerical free parameters are mentioned in the abstract.

axioms (1)
  • domain assumption The randomized trial is nested within an observational cohort from the same target population, with outcomes observed under treatment, control, and usual care.
    This is the core data-generating setting stated in the abstract.
invented entities (1)
  • gain score no independent evidence
    purpose: To formalize the comparison between a physician's personal strategy and the trial's best average treatment.
    Newly introduced quantity used to define the target proportion.

pith-pipeline@v0.9.0 · 5461 in / 1307 out tokens · 54111 ms · 2026-05-12T01:56:40.312901+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 1 internal anchor

  1. [1]

    American Journal of Epidemiology , volume=

    Perspective on ‘harm’ in personalized medicine , author=. American Journal of Epidemiology , volume=. 2025 , publisher=

  2. [2]

    Journal of the Royal Statistical Society Series B: Statistical Methodology , pages=

    Covariate-assisted bounds on causal effects with instrumental variables , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , pages=. 2025 , publisher=

  3. [3]

    American Journal of Epidemiology , year=

    Perspective on ‘Harm’ in Personalized Medicine--An Alternative Perspective , author=. American Journal of Epidemiology , year=

  4. [4]

    American Journal of Epidemiology , volume=

    Rejoinder to ``Perspectives on `harm' in personalized medicine--an alternative perspective'' , author=. American Journal of Epidemiology , volume=. 2025 , publisher=

  5. [5]

    Journal of Causal Inference , volume=

    Personalized decision making--A conceptual introduction , author=. Journal of Causal Inference , volume=. 2023 , publisher=

  6. [6]

    arXiv preprint arXiv:2405.08727 , year=

    Intervention effects based on potential benefit , author=. arXiv preprint arXiv:2405.08727 , year=

  7. [7]

    Advances in Neural Information Processing Systems , volume=

    Counterfactual harm , author=. Advances in Neural Information Processing Systems , volume=

  8. [8]

    Advances in Neural Information Processing Systems , volume=

    What's the harm? sharp bounds on the fraction negatively affected by treatment , author=. Advances in Neural Information Processing Systems , volume=

  9. [9]

    Journal of the American Statistical Association , volume=

    Some probability paradoxes in choice from among random alternatives , author=. Journal of the American Statistical Association , volume=. 1972 , publisher=

  10. [10]

    Undergraduate Review , volume=

    The Mystery of the Non-Transitive Grime Dice , author=. Undergraduate Review , volume=

  11. [11]

    The College Mathematics Journal , volume=

    The bizarre world of nontransitive dice: games for two or more players , author=. The College Mathematics Journal , volume=. 2017 , publisher=

  12. [12]

    Annals of Mathematics and Artificial Intelligence , volume=

    Probabilities of causation: Bounds and identification , author=. Annals of Mathematics and Artificial Intelligence , volume=. 2000 , publisher=

  13. [13]

    arXiv preprint arXiv:2301.11976 , year=

    Personalised decision-making without counterfactuals , author=. arXiv preprint arXiv:2301.11976 , year=

  14. [14]

    Journal of the American statistical Association , volume=

    Causal inference without counterfactuals , author=. Journal of the American statistical Association , volume=. 2000 , publisher=

  15. [15]

    2024 , publisher=

    Causal Inference: What If , author=. 2024 , publisher=

  16. [16]

    BMJ Quality & Safety , year=

    Artificial intelligence-powered chatbots in search engines: a cross-sectional study on the quality and risks of drug information for patients , author=. BMJ Quality & Safety , year=

  17. [17]

    Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence,

    Quantifying harm , author=. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence,. 2023 , month =. doi:10.24963/ijcai.2023/41 , url =

  18. [18]

    Minds and Machines , volume=

    A causal analysis of harm , author=. Minds and Machines , volume=. 2024 , publisher=

  19. [19]

    Journal of Business & Economic Statistics , volume=

    Generalizing the Results from Social Experiments: Theory and Evidence from India , author=. Journal of Business & Economic Statistics , volume=. 2024 , publisher=

  20. [20]

    Econometric Theory , volume=

    Sharp bounds on the distribution of treatment effects and their statistical inference , author=. Econometric Theory , volume=. 2010 , publisher=

  21. [21]

    Journal of the American Statistical Association , volume=

    Decomposing treatment effect variation , author=. Journal of the American Statistical Association , volume=. 2019 , publisher=

  22. [22]

    2008 , publisher=

    Implementing the WHO Stop TB Strategy: a handbook for national TB control programmes , author=. 2008 , publisher=

  23. [23]

    2010 , publisher=

    Treatment of tuberculosis: guidelines , author=. 2010 , publisher=

  24. [24]

    arXiv preprint arXiv:2509.20506 , year=

    Identification and Estimation of Joint Potential Outcome Distributions from a Single Study , author=. arXiv preprint arXiv:2509.20506 , year=

  25. [25]

    2009 , publisher=

    Causality , author=. 2009 , publisher=

  26. [26]

    1947 , publisher=

    Theory of games and economic behavior, 2nd rev , author=. 1947 , publisher=

  27. [27]

    OUP Catalogue , year=

    Foundations of rational choice under risk , author=. OUP Catalogue , year=

  28. [28]

    Social science & medicine , volume=

    Understanding and misunderstanding randomized controlled trials , author=. Social science & medicine , volume=. 2018 , publisher=

  29. [29]

    Analysis , volume=

    Great harms from small benefits grow: how death can be outweighed by headaches , author=. Analysis , volume=. 1998 , publisher=

  30. [30]

    The Lancet Oncology , volume=

    30-day mortality after systemic anticancer treatment for breast and lung cancer in England: a population-based, observational study , author=. The Lancet Oncology , volume=. 2016 , publisher=

  31. [31]

    arXiv preprint arXiv:2110.10961 , year=

    Individualized decision-making under partial identification: Three perspectives, two optimality results, and one paradox , author=. arXiv preprint arXiv:2110.10961 , year=

  32. [32]

    Advances in neural information processing systems , volume=

    Reliable decision support using counterfactual models , author=. Advances in neural information processing systems , volume=

  33. [33]

    Biometrika , volume=

    Optimal regimes for algorithm-assisted human decision-making , author=. Biometrika , volume=. 2024 , publisher=

  34. [34]

    arXiv preprint arXiv:2502.10049 , year=

    The Probability of Tiered Benefit: Partial Identification with Robust and Stable Inference , author=. arXiv preprint arXiv:2502.10049 , year=

  35. [35]

    arXiv preprint arXiv:2411.01234 , year=

    Identifying and bounding the probability of necessity for causes of effects with ordinal outcomes , author=. arXiv preprint arXiv:2411.01234 , year=

  36. [36]

    Center for the Statistics and the Social Sciences, University of Washington Series

    Single world intervention graphs (SWIGs): A unification of the counterfactual and graphical approaches to causality , author=. Center for the Statistics and the Social Sciences, University of Washington Series. Working Paper , volume=. 2013 , publisher=

  37. [37]

    Fr. G. Fundamenta mathematicae , volume=. 1935 , publisher=

  38. [38]

    R. Fr. Advances in Probability Distributions with Given Marginals: beyond the copulas , pages=. 1991 , publisher=

  39. [39]

    Ijcai , volume=

    Incremental utility elicitation with the minimax regret decision criterion , author=. Ijcai , volume=

  40. [40]

    AAAI/IAAI , pages=

    Visual exploration and incremental utility elicitation , author=. AAAI/IAAI , pages=

  41. [41]

    , author=

    Utility Elicitation as a Classification Problem. , author=. UAI , volume=

  42. [42]

    Socio-economic planning sciences , volume=

    Social preferences for health states: an empirical evaluation of three measurement techniques , author=. Socio-economic planning sciences , volume=. 1976 , publisher=

  43. [43]

    2015 , publisher=

    Methods for the economic evaluation of health care programmes , author=. 2015 , publisher=

  44. [44]

    Journal of clinical epidemiology , volume=

    Deriving a preference-based single index from the UK SF-36 Health Survey , author=. Journal of clinical epidemiology , volume=. 1998 , publisher=

  45. [45]

    Mathematics and Computers in Simulation , volume=

    Nontransitivity of tuples of random variables with polynomial density and its effects in Bayesian models , author=. Mathematics and Computers in Simulation , volume=. 2022 , publisher=

  46. [46]

    arXiv preprint arXiv:2407.14635 , year=

    Predicting the Distribution of Treatment Effects: A Covariate-Adjustment Approach , author=. arXiv preprint arXiv:2407.14635 , year=

  47. [47]

    arXiv preprint arXiv:2311.15878 , year=

    Policy learning with distributional welfare , author=. arXiv preprint arXiv:2311.15878 , year=

  48. [48]

    Journal of the American Statistical Association , volume=

    Policy learning with asymmetric counterfactual utilities , author=. Journal of the American Statistical Association , volume=. 2024 , publisher=

  49. [49]

    Biometrika , volume=

    Population intervention models in causal inference , author=. Biometrika , volume=. 2008 , publisher=

  50. [50]

    Journal of the American Statistical Association , pages=

    Improved bounds and inference on optimal regimes , author=. Journal of the American Statistical Association , pages=. 2025 , publisher=

  51. [51]

    Stroke , volume =

    Neurosurgical Clipping Versus Endovascular Coiling of Patients With Ruptured Intracranial Aneurysms , author =. Stroke , volume =. 2003 , doi =

  52. [52]

    American Journal of Epidemiology , pages=

    Counterfactual Harm: A Counter-argument , author=. American Journal of Epidemiology , pages=. 2026 , publisher=

  53. [53]

    American Journal of Epidemiology , volume =

    Combined Analysis of Women's Health Initiative Observational and Clinical Trial Data on Postmenopausal Hormone Treatment and Cardiovascular Disease , author =. American Journal of Epidemiology , volume =. 2006 , doi =

  54. [54]

    2002 , publisher=

    Lectures on Choquet’s theorem , author=. 2002 , publisher=

  55. [55]

    Mathematics of Operations Research , volume=

    Extreme points of moment sets , author=. Mathematics of Operations Research , volume=. 1988 , publisher=

  56. [56]

    Archiv der Mathematik , volume=

    Minimalstellen von funktionen und extremalpunkte , author=. Archiv der Mathematik , volume=. 1958 , publisher=

  57. [57]

    Parameterfreie absch

    Richter, Hans , journal=. Parameterfreie absch. 1957 , publisher=

  58. [58]

    European journal of epidemiology , volume=

    Prospective benchmarking of an observational analysis in the SWEDEHEART registry against the REDUCE-AMI randomized trial , author=. European journal of epidemiology , volume=. 2024 , publisher=

  59. [59]

    New England Journal of Medicine , volume =

    Transcatheter Aortic-Valve Replacement with a Balloon-Expandable Valve in Low-Risk Patients , author =. New England Journal of Medicine , volume =. 2019 , doi =

  60. [60]

    Thrombus Aspiration during ST-Segment Elevation Myocardial Infarction , journal =

    Fr. Thrombus Aspiration during ST-Segment Elevation Myocardial Infarction , journal =. 2013 , volume =

  61. [61]

    Bivalirudin versus Heparin Monotherapy in Myocardial Infarction , journal =

    Erlinge, David and Omerovic, Elmir and Fr. Bivalirudin versus Heparin Monotherapy in Myocardial Infarction , journal =. 2017 , volume =

  62. [62]

    Biometrika , volume=

    Russian roulette: the need for stochastic potential outcomes when utilities depend on counterfactuals , author=. Biometrika , volume=. 2025 , publisher=

  63. [63]

    Quantifying Individual Risk for Binary Outcomes

    Quantifying individual risk for binary outcome , author=. arXiv preprint arXiv:2402.10537 , year=