pith. sign in

arxiv: 2606.21050 · v1 · pith:DZC5JRNLnew · submitted 2026-06-19 · 📊 stat.AP · stat.ME

Triage Score: A Counterfactual Risk Assessment Instrument

Pith reviewed 2026-06-26 13:06 UTC · model grok-4.3

classification 📊 stat.AP stat.ME
keywords triage scorecounterfactual utilityrisk assessmentpretrial decisionsrandomized controlled trialpolicy evaluationdecision making
0
0 comments X

The pith

Triage scores based on additive counterfactual utilities include risk scores as a special case and account for outcomes under intervention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Traditional risk scores predict the chance of an undesired outcome only if no intervention occurs. The paper introduces triage scores that instead add up the utilities of potential outcomes under both intervention and no intervention. This structure lets decision makers encode ethical or practical preferences through the utilities assigned to each counterfactual scenario. The authors apply the approach to data from their own randomized trial of a pretrial risk instrument and show that the resulting scores produce different policy evaluations and learning outcomes. A sympathetic reader cares because high-stakes decisions in medicine and justice routinely require weighing what would happen under alternative actions.

Core claim

Triage scores are based on additive counterfactual utilities and include risk scores as a special case. Unlike risk scores, triage scores can incorporate counterfactual outcomes under alternative decisions, enabling decision makers to incorporate a wide range of ethical and practical factors. We illustrate the use of triage scores with an application to our own randomized controlled trial evaluating a pretrial risk score. Our analysis demonstrates that triage scores are able to capture rich utility structures and yield substantively distinct results regarding policy evaluation and learning.

What carries the argument

Additive counterfactual utilities, which assign values to outcomes that would occur under each possible decision and sum those values into a single score.

If this is right

  • Risk scores emerge when the utility for the intervention outcome is set to zero or ignored.
  • Decision makers can encode a range of ethical and practical considerations by changing the utilities attached to each counterfactual outcome.
  • Policy evaluation and learning produce substantively different results once counterfactual utilities replace simple risk prediction.
  • The framework applies directly to any setting where randomized data on interventions are available, such as pretrial detention or medical treatment choices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Existing risk-score systems could be converted to triage scores if utilities for the intervention arm can be specified or elicited.
  • The additive structure may extend to multi-treatment settings where more than two decisions are possible.
  • If utilities prove difficult to agree upon, sensitivity checks over plausible utility ranges would become a necessary practical step.

Load-bearing premise

Counterfactual outcomes under intervention can be defined, assigned utilities, and estimated from the RCT data in a way that supports the claimed policy distinctions.

What would settle it

Re-estimating the triage scores on the pretrial RCT data and finding that the resulting policy recommendations or learned policies are identical to those produced by the original risk score.

Figures

Figures reproduced from arXiv: 2606.21050 by D. James Greiner, Kosuke Imai, Ryan Halen, Sooahn Shin.

Figure 1
Figure 1. Figure 1: An example of a causal diagram under which Assumptions 1 and 2 hold. When Z = 0, there is no edge R → D. In addition to the single-blinded treatment assignment, we assume the unconfoundedness of the decision, implying that the potential outcomes are independent of the decision, conditional on the observed covariates, treatment assignment, and the AI recommendation. Assumption 2 (Unconfoundedness of decisio… view at source ↗
Figure 2
Figure 2. Figure 2: Estimated utility of different decision-making regimes (NCA outcome). The optimal decision tree policy has maximum depth 2 and minimum leaf size 10. Each panel corresponds to a different regret parameter r ror ¬crime; x-axis = cost of crime under cash bail c cash crime; y-axis = cost of cash bail cd; c ror crime = 1 and r cash ¬crime = r ror ¬crimec cash crime. Blue region at c cash crime = 0: risk score f… view at source ↗
Figure 3
Figure 3. Figure 3: Estimated preference for human decisions over human+PSA recommendations. Each column = regret under release; x-axis = cost of new crime under cash; y-axis = cost of cash bail; c ror crime = 1 and r cash ¬crime = r ror ¬crimec cash crime. Blue boxes: risk score system (when c cash crime = 0). First column: standard decision framework (when r ror ¬crime = r cash ¬crime = 0). 22 [PITH_FULL_IMAGE:figures/full… view at source ↗
Figure 4
Figure 4. Figure 4: Estimated change in cash bail proportion under optimal decision tree. Each column = regret under release; x-axis = cost of new crime under cash; y-axis = cost of cash bail; c ror crime = 1 and r cash ¬crime = r ror ¬crimec cash crime. Red boxes: risk score system (when c cash crime = 0). First column: standard decision framework (when r ror ¬crime = r cash ¬crime = 0). 24 [PITH_FULL_IMAGE:figures/full_fig… view at source ↗
read the original abstract

Risk assessment instruments, also known as "risk scores," are widely used in high-stakes decision-making settings such as medicine and the criminal justice system. A risk score predicts the likelihood of an undesired outcome if no intervention is made. Thus, a sufficiently high score is often interpreted as a recommendation to intervene. However, risk scores fail to account for what would happen if a decision-maker does intervene. This failure is problematic because effective decision making requires consideration of both or multiple potential outcomes. We propose "triage scores," which are based on additive counterfactual utilities and include risk scores as a special case. Unlike risk scores, triage scores can incorporate counterfactual outcomes under alternative decisions, enabling decision makers to incorporate a wide range of ethical and practical factors. We illustrate the use of triage scores with an application to our own randomized controlled trial evaluating a pretrial risk score. Our analysis demonstrates that triage scores are able to capture rich utility structures and yield substantively distinct results regarding policy evaluation and learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes 'triage scores' defined via additive counterfactual utilities as a generalization of conventional risk scores (which predict outcomes under no intervention). Triage scores are claimed to incorporate outcomes under alternative decisions, thereby allowing incorporation of ethical and practical factors. The proposal is illustrated via an application to data from the authors' own RCT that randomized provision of an existing pretrial risk score to judges; the analysis is said to show that triage scores capture rich utility structures and produce substantively distinct results for policy evaluation and learning.

Significance. If the counterfactual utilities can be identified and estimated from the RCT without strong untestable assumptions, the framework would provide a principled way to move beyond purely predictive risk instruments toward decision-theoretic scores that explicitly trade off multiple potential outcomes. The RCT-based illustration is a positive feature that grounds the conceptual proposal in real data.

major comments (1)
  1. [Abstract and RCT application section] The central empirical claim (abstract) that triage scores 'yield substantively distinct results regarding policy evaluation and learning' from the pretrial RCT rests on recovering additive counterfactual utilities for decisions under alternative policies. The RCT randomizes only the provision of the existing risk score; recovering utilities for counterfactual judge decisions (e.g., under a different threshold or utility function) therefore requires an explicit model of judge behavior, information sets, or compliance under unobserved policies. No such model or identifying assumptions are stated or tested, rendering the counterfactual quantities unidentified from the observed data alone.
minor comments (1)
  1. [Introduction/Definition] The precise functional form of the 'additive counterfactual utilities' and how they reduce to a standard risk score as a special case should be stated formally (e.g., via an equation) rather than only conceptually.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and for identifying a key issue in the empirical section. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract and RCT application section] The central empirical claim (abstract) that triage scores 'yield substantively distinct results regarding policy evaluation and learning' from the pretrial RCT rests on recovering additive counterfactual utilities for decisions under alternative policies. The RCT randomizes only the provision of the existing risk score; recovering utilities for counterfactual judge decisions (e.g., under a different threshold or utility function) therefore requires an explicit model of judge behavior, information sets, or compliance under unobserved policies. No such model or identifying assumptions are stated or tested, rendering the counterfactual quantities unidentified from the observed data alone.

    Authors: We agree that the referee's observation is correct: the RCT randomizes only the provision of the existing risk score, and extending the analysis to counterfactual policies (different thresholds or utility functions) requires an explicit model of judge behavior together with identifying assumptions that are not currently stated. The manuscript estimates additive counterfactual utilities from the observed RCT data under the randomized provision but does not supply the additional behavioral model needed for fully counterfactual policy evaluation. We will revise the paper to (i) state the required identifying assumptions explicitly, (ii) describe a simple model of judge compliance with the provided score, and (iii) include sensitivity checks. These changes will be made in the RCT application section and the abstract will be updated to reflect the clarified scope. revision: yes

Circularity Check

0 steps flagged

No circularity: triage score introduced by definition as extension of risk scores

full rationale

The paper defines triage scores directly as additive counterfactual utilities, with risk scores stated as a special case. This is an explicit definitional proposal rather than a derivation, prediction, or fitted quantity that reduces to its own inputs. No self-citation load-bearing step, uniqueness theorem, ansatz smuggling, or renaming of known results is present in the abstract or described claims. The RCT application concerns estimation under additional modeling assumptions, but the core concept does not reduce by construction to fitted parameters or prior self-citations. The derivation chain is self-contained as a conceptual extension.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The proposal rests on the domain assumption that counterfactual outcomes and additive utilities can be meaningfully specified for decision problems.

axioms (1)
  • domain assumption Counterfactual outcomes under intervention can be defined and combined additively via a utility function.
    This premise underpins the definition of triage scores and their claimed advantage over risk scores.

pith-pipeline@v0.9.1-grok · 5703 in / 986 out tokens · 44148 ms · 2026-06-26T13:06:07.587429+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

90 extracted references · 3 linked inside Pith

  1. [1]

    Imai, Kosuke and Nakamura, Kentaro , journal=

  2. [2]

    Transactions of the Association for Computational Linguistics , volume=

    Causal inference in natural language processing: Estimation, prediction, interpretation and beyond , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=

  3. [3]

    Minnesota Law Review , volume=

    Assessing risk assessment in action , author=. Minnesota Law Review , volume=

  4. [4]

    Circulation , volume=

    General cardiovascular risk profile for use in primary care , author=. Circulation , volume=

  5. [5]

    Annual review of biomedical data science , volume=

    Probabilistic machine learning for healthcare , author=. Annual review of biomedical data science , volume=. 2021 , publisher=

  6. [6]

    The RAND Journal of Economics , volume=

    The impact of credit scoring on consumer lending , author=. The RAND Journal of Economics , volume=. 2013 , publisher=

  7. [7]

    The Review of Economic Studies , volume=

    Measuring bias in consumer lending , author=. The Review of Economic Studies , volume=. 2021 , publisher=

  8. [8]

    The quarterly journal of economics , volume=

    Human decisions and machine predictions , author=. The quarterly journal of economics , volume=. 2018 , publisher=

  9. [9]

    International Conference on Machine Learning , pages=

    Characterizing fairness over the set of good models under selective labels , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  10. [10]

    Review of Economic Studies , pages=

    Algorithmic recommendations and human discretion , author=. Review of Economic Studies , pages=. 2025 , publisher=

  11. [11]

    Breakthroughs in Statistics: Foundations and Basic Theory , pages=

    Statistical decision functions , author=. Breakthroughs in Statistics: Foundations and Basic Theory , pages=. 1950 , publisher=

  12. [12]

    Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

    Multiply robust estimation of causal effects under principal ignorability , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2022 , publisher=

  13. [13]

    arXiv preprint arXiv:2410.17864 , year=

    Longitudinal Causal Inference with Selective Eligibility , author=. arXiv preprint arXiv:2410.17864 , year=

  14. [14]

    Journal of Causal Inference , volume=

    Personalized decision making--a conceptual introduction , author=. Journal of Causal Inference , volume=. 2023 , publisher=

  15. [15]

    arXiv preprint arXiv:2407.18206 , year=

    Starting small: Prioritizing safety over efficacy in randomized experiments using the exact finite sample likelihood , author=. arXiv preprint arXiv:2407.18206 , year=

  16. [16]

    Bell , title =

    David E. Bell , title =. Operations Research , year =

  17. [17]

    The Economic Journal , volume =

    Graham Loomes and Robert Sugden , title =. The Economic Journal , volume =. 1982 , publisher =

  18. [18]

    and Rubin, Donald B

    Frangakis, Constantine E. and Rubin, Donald B. , Journal =

  19. [19]

    , Journal =

    Rubin, Donald B. , Journal =. Comments on ``

  20. [20]

    arXiv preprint arXiv:2505.08908 , year=

    Statistical Decision Theory with Counterfactual Loss , author=. arXiv preprint arXiv:2505.08908 , year=

  21. [21]

    Journal of Business & Economic Statistics , volume=

    Statistical inference for heterogeneous treatment effects discovered by generic machine learning in randomized experiments , author=. Journal of Business & Economic Statistics , volume=. 2025 , publisher=

  22. [22]

    Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

    Principal stratification analysis using principal scores , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2017 , publisher=

  23. [23]

    Ben-Michael, Eli and Greiner, D James and Huang, Melody and Imai, Kosuke and Jiang, Zhichao and Shin, Sooahn , journal=. Does

  24. [24]

    arXiv preprint arXiv:2410.00903 , year=

    Causal Representation Learning with Generative Artificial Intelligence: Application to Texts as Treatments , author=. arXiv preprint arXiv:2410.00903 , year=

  25. [25]

    arXiv preprint arXiv:2302.13971 , year=

    Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

  26. [26]

    Journal of the American Statistical Association , year =

    Ben-Michael, Eli and Imai, Kosuke and Jiang, Zhichao , title =. Journal of the American Statistical Association , year =

  27. [27]

    2009 , publisher=

    Identification for prediction and decision , author=. 2009 , publisher=

  28. [28]

    James and Halen, Ryan and Shin, Sooahn , year =

    Imai, Kosuke and Jiang, Zhichao and Greiner, D. James and Halen, Ryan and Shin, Sooahn , year =. Replication Data for: Experimental Evaluation of Algorithm-Assisted Human Decision-Making: Application to Pretrial Public Safety Assessment , journal=

  29. [29]

    , author=

    Learning under Selective Labels with Data from Heterogeneous Decision-makers: An Instrumental Variable Approach. , author=. CoRR , year=

  30. [31]

    arXiv preprint arXiv:2212.09844 , year=

    Robust design and evaluation of predictive algorithms under unobserved confounding , author=. arXiv preprint arXiv:2212.09844 , year=

  31. [32]

    arXiv preprint arXiv:2305.11812 , year=

    Off-policy evaluation beyond overlap: partial identification through smoothness , author=. arXiv preprint arXiv:2305.11812 , year=

  32. [33]

    arXiv preprint arXiv:2202.11886 , year=

    Calibrated inference: statistical inference that accounts for both sampling uncertainty and distributional uncertainty , author=. arXiv preprint arXiv:2202.11886 , year=

  33. [34]

    The International Journal of Biostatistics , volume=

    Increasing the efficiency of randomized trial estimates via linear adjustment for a prognostic score , author=. The International Journal of Biostatistics , volume=. 2021 , publisher=

  34. [35]

    Distribution-free assessment of population overlap in observational studies , author=

  35. [36]

    Journal of Econometrics , volume=

    Overlap in observational studies with high-dimensional covariates , author=. Journal of Econometrics , volume=. 2021 , publisher=

  36. [37]

    Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages=

    The selective labels problem: Evaluating algorithmic predictions in the presence of unobservables , author=. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages=

  37. [38]

    2023 , institution=

    Algorithmic recommendations and human discretion , author=. 2023 , institution=

  38. [39]

    American Economic Review , volume=

    The effects of pre-trial detention on conviction, future crime, and employment: Evidence from randomly assigned judges , author=. American Economic Review , volume=. 2018 , publisher=

  39. [40]

    Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

    Bounded, efficient and multiply robust estimation of average treatment effects using instrumental variables , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2018 , publisher=

  40. [41]

    Journal of the Royal Statistical Society Series A: Statistics in Society , volume=

    Experimental evaluation of algorithm-assisted human decision-making: Application to pretrial public safety assessment , author=. Journal of the Royal Statistical Society Series A: Statistics in Society , volume=. 2023 , publisher=

  41. [42]

    arXiv preprint arXiv:1808.00023 , year=

    The measure and mismeasure of fairness: A critical review of fair machine learning , author=. arXiv preprint arXiv:1808.00023 , year=

  42. [43]

    Annual Review of Statistics and Its Application , volume=

    Algorithmic fairness: Choices, assumptions, and definitions , author=. Annual Review of Statistics and Its Application , volume=. 2021 , publisher=

  43. [44]

    Nips tutorial , volume=

    Fairness in machine learning , author=. Nips tutorial , volume=

  44. [45]

    Communications of the ACM , volume=

    A snapshot of the frontiers of fairness in machine learning , author=. Communications of the ACM , volume=. 2020 , publisher=

  45. [46]

    Statistical Science , volume=

    Principal fairness for human and algorithmic decision-making , author=. Statistical Science , volume=. 2023 , publisher=

  46. [47]

    Essay on principles , author=

    On the application of probability theory to agricultural experiments. Essay on principles , author=. Ann. Agricultural Sciences , pages=

  47. [48]

    , author=

    The design of experiments. , author=. The design of experiments. , year=

  48. [49]

    , author=

    Estimating causal effects of treatments in randomized and nonrandomized studies. , author=. Journal of Educational Psychology , volume=. 1974 , publisher=

  49. [50]

    Journal of the American statistical Association , volume=

    Statistics and causal inference , author=. Journal of the American statistical Association , volume=. 1986 , publisher=

  50. [51]

    Proceedings of the ACM on Human-computer Interaction , volume=

    ``Hello AI'': uncovering the onboarding needs of medical practitioners for human-AI collaborative decision-making , author=. Proceedings of the ACM on Human-computer Interaction , volume=. 2019 , publisher=

  51. [52]

    Human-AI collaboration in healthcare: A review and research agenda , author=

  52. [53]

    Proceedings of the 2018 Chi conference on human factors in computing systems , pages=

    `It's Reducing a Human Being to a Percentage' Perceptions of Justice in Algorithmic Decisions , author=. Proceedings of the 2018 Chi conference on human factors in computing systems , pages=

  53. [54]

    Available at SSRN 3489440 , year=

    Algorithmic risk assessment in the hands of humans , author=. Available at SSRN 3489440 , year=

  54. [55]

    The Quarterly Journal of Economics , volume=

    Discretion in hiring , author=. The Quarterly Journal of Economics , volume=. 2018 , publisher=

  55. [56]

    The Annals of Applied Statistics , volume=

    An algorithm for removing sensitive information , author=. The Annals of Applied Statistics , volume=. 2019 , publisher=

  56. [57]

    Journal of empirical legal studies , volume=

    Forecasting domestic violence: A machine learning approach to help inform arraignment decisions , author=. Journal of empirical legal studies , volume=. 2016 , publisher=

  57. [58]

    American Economic Review , volume=

    Personalized risk assessments in the criminal justice system , author=. American Economic Review , volume=. 2016 , publisher=

  58. [59]

    Science advances , volume=

    The accuracy, fairness, and limits of predicting recidivism , author=. Science advances , volume=. 2018 , publisher=

  59. [60]

    Sociological Methods & Research , volume=

    Fairness in criminal justice risk assessments: The state of the art , author=. Sociological Methods & Research , volume=. 2021 , publisher=

  60. [61]

    Criminal Justice and Behavior , volume=

    Practitioner compliance with risk/needs assessment tools: A theoretical and empirical assessment , author=. Criminal Justice and Behavior , volume=. 2013 , publisher=

  61. [62]

    Law, Economics, and Business Fellows’ Discussion Paper Series , volume=

    If you give a judge a risk score: evidence from Kentucky bail decisions , author=. Law, Economics, and Business Fellows’ Discussion Paper Series , volume=

  62. [63]

    Journal of Experimental Criminology , volume=

    An impact assessment of machine learning risk forecasts on parole board decisions and recidivism , author=. Journal of Experimental Criminology , volume=. 2017 , publisher=

  63. [64]

    Proceedings of the conference on fairness, accountability, and transparency , pages=

    Disparate interactions: An algorithm-in-the-loop analysis of fairness in risk assessments , author=. Proceedings of the conference on fairness, accountability, and transparency , pages=

  64. [65]

    , author=

    Impact of risk assessment on judges’ fairness in sentencing relatively poor defendants. , author=. Law and human behavior , volume=. 2020 , publisher=

  65. [66]

    Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency , pages=

    Ground (less) Truth: A Causal Framework for Proxy Labels in Human-Algorithm Decision-Making , author=. Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency , pages=

  66. [67]

    Proceedings of the ACM on Human-Computer Interaction , volume=

    Heterogeneity in Algorithm-Assisted Decision-Making: A Case Study in Child Abuse Hotline Screening , author=. Proceedings of the ACM on Human-Computer Interaction , volume=. 2022 , publisher=

  67. [68]

    American Economic Review , volume=

    Measuring racial discrimination in bail decisions , author=. American Economic Review , volume=. 2022 , publisher=

  68. [69]

    2023 , institution=

    Combining human expertise with artificial intelligence: Experimental evidence from radiology , author=. 2023 , institution=

  69. [70]

    The Quarterly Journal of Economics , pages=

    Identifying prediction mistakes in observational data , author=. The Quarterly Journal of Economics , pages=. 2024 , publisher=

  70. [71]

    Randomized Control Trial Evaluation of the Implementation of the PSA-DMF System in Dane County , author =

  71. [72]

    Statistics & Probability Letters , volume=

    Sharp lower and upper bounds for the covariance of bounded random variables , author=. Statistics & Probability Letters , volume=. 2022 , publisher=

  72. [73]

    Statistical Science , volume=

    [Covariance Adjustment in Randomized Experiments and Observational Studies]: Comment , author=. Statistical Science , volume=. 2002 , publisher=

  73. [74]

    2002 , publisher=

    Overt bias in observational studies , author=. 2002 , publisher=

  74. [75]

    Proceedings of the 2020 conference on fairness, accountability, and transparency , pages=

    Counterfactual risk assessments, evaluation, and fairness , author=. Proceedings of the 2020 conference on fairness, accountability, and transparency , pages=

  75. [76]

    Econometrica , volume=

    Confidence intervals for partially identified parameters , author=. Econometrica , volume=. 2004 , publisher=

  76. [77]

    Wainwright, Martin J. , year=. High-Dimensional Statistics: A Non-Asymptotic Viewpoint , DOI=

  77. [78]

    Tsybakov , title =

    Jean-Yves Audibert and Alexandre B. Tsybakov , title =. The Annals of Statistics , number =. 2007 , doi =

  78. [79]

    Journal of the American statistical Association , volume=

    Estimation of regression coefficients when some regressors are not always observed , author=. Journal of the American statistical Association , volume=. 1994 , publisher=

  79. [80]

    Advances in Neural Information Processing Systems , volume=

    What's the harm? sharp bounds on the fraction negatively affected by treatment , author=. Advances in Neural Information Processing Systems , volume=

  80. [81]

    arXiv preprint arXiv:2402.09332 , year=

    Nonparametric identification and efficient estimation of causal effects with instrumental variables , author=. arXiv preprint arXiv:2402.09332 , year=

Showing first 80 references.