Counterfactually Safe Reinforcement Learning

Chengchun Shi; Jingyi Li; Peng Wu

arxiv: 2605.25114 · v1 · pith:ETLA5BZ5new · submitted 2026-05-24 · 📊 stat.ML · cs.LG

Counterfactually Safe Reinforcement Learning

Jingyi Li , Peng Wu , Chengchun Shi This is my paper

Pith reviewed 2026-06-29 23:46 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords reinforcement learningcounterfactual harmindividual safetypolicy optimizationfinite-sample analysissub-optimality boundtwo-stage procedure

0 comments

The pith

A two-stage procedure learns RL policies that maximize return while controlling counterfactual individual harm.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines harm in reinforcement learning as the event where a chosen action produces a strictly worse outcome than a baseline alternative, assessed counterfactually for each individual. It introduces a two-stage procedure that first identifies potential harm and then optimizes policies to maximize expected return subject to this control. Finite-sample properties of the resulting policy are derived along with an explicit upper bound on the gap to the optimal policy. The procedure keeps the overall harm rate well-controlled. This matters because policies optimal on average can still produce worse outcomes for specific states or trajectories.

Core claim

The central claim is that a two-stage procedure allows learning policies that maximize expected return while the harm rate, defined as the probability that the chosen action is worse than a baseline counterfactual, remains well-controlled. Finite-sample properties are established and an upper bound on the sub-optimality gap is derived, with effectiveness shown on simulated and real-world datasets.

What carries the argument

The two-stage procedure that first estimates counterfactual harm relative to a baseline and then optimizes the policy under a harm-rate constraint.

If this is right

The learned policy achieves high expected return with the harm rate remaining well-controlled.
Finite-sample properties of the learned policy hold.
An explicit upper bound on the sub-optimality gap is available.
The procedure demonstrates effectiveness on both simulated and real-world datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may transfer to sequential decision settings outside standard RL where individual-level safety matters.
Estimation of counterfactual baselines could be strengthened by combining with observational causal methods.
High-stakes applications such as medical treatment sequences offer natural test beds for the harm control.
Relaxing the baseline requirement to purely observational data would broaden applicability.

Load-bearing premise

That counterfactual outcomes relative to a baseline alternative can be meaningfully defined, estimated, or bounded in the given RL environment so that harm events are identifiable and controllable.

What would settle it

An experiment applying the two-stage procedure yet finding the realized harm rate above the target control level would falsify the claim that harm remains well-controlled.

Figures

Figures reproduced from arXiv: 2605.25114 by Chengchun Shi, Jingyi Li, Peng Wu.

**Figure 2.** Figure 2: Comparisons of outcome versus harm under (a) linear setting and with risk-aversion [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

**Figure 3.** Figure 3: Comparisons of outcome and harm versus sample size [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of outcome and harm across different values of risk-aversion factor [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

read the original abstract

Reinforcement learning algorithms are generally designed to maximize the expected return across a population. However, a policy that is optimal on average may be suboptimal for certain individuals, leading to potential safety concerns. To address this, we first formalize the notion of individual harm from a counterfactual perspective and define harm as the event in which a chosen action results in a strictly worse outcome than a baseline alternative. We then propose a general two-stage procedure for learning policies that maximize the expected return while accounting for individual harm. We further establish the finite-sample properties of the learned policy, derive an upper bound on its sub-optimality gap, and show that the harm rate remains well-controlled. Numerical experiments on both simulated and real-world datasets demonstrate the effectiveness of the proposed approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines individual harm counterfactually in RL and gives a two-stage procedure with finite-sample bounds on harm rate and sub-optimality, but the identifiability of those counterfactuals on trajectories looks like the load-bearing assumption that is not clearly resolved.

read the letter

The main takeaway is that this work formalizes harm as the event where a chosen action produces a strictly worse outcome than a baseline alternative, then uses a two-stage procedure to maximize expected return while keeping the harm rate controlled, along with finite-sample properties and a sub-optimality bound.

What stands out is the focus on individual-level safety rather than just average performance, which is a real gap in many RL applications. The two-stage structure is a clean way to separate the harm control step from the return maximization, and attempting explicit finite-sample analysis is better than most safe RL papers that stop at asymptotic claims.

The soft spot is identifiability. RL outcomes are full trajectories shaped by the policy, transitions, and future actions, so any counterfactual comparison to a baseline requires either an oracle model or strong conditions on confounding and baseline policy. The abstract gives no sign these are stated or relaxed, which means the harm indicator and the derived bounds rest on assumptions that may not hold in standard environments. If the full paper does not supply concrete estimation methods or explicit conditions, the guarantees become hard to apply.

This is aimed at the safe RL community, especially people working on fairness or individual safety in sequential decisions. A reader already familiar with counterfactual inference and RL theory will get the most out of it.

It deserves a serious referee to check whether the full derivations close the identifiability gap and whether the numerical experiments actually test the finite-sample claims under realistic confounding. I would send it to review rather than desk reject.

Referee Report

1 major / 0 minor

Summary. The paper formalizes individual harm in RL as a counterfactual event where the chosen action yields a strictly worse outcome than a baseline alternative. It proposes a two-stage procedure to maximize expected return while controlling the harm rate, claims to establish finite-sample properties of the learned policy, derives an upper bound on the sub-optimality gap, and validates the approach via experiments on simulated and real-world datasets.

Significance. If the counterfactual harm definition and associated bounds can be made rigorous under explicit identifiability conditions, the work would offer a useful framework for individual-level safety in RL beyond average-case optimization. The two-stage procedure and finite-sample guarantees would be notable strengths if supported by the derivations.

major comments (1)

[Abstract] Abstract: The claims of finite-sample properties and an upper bound on the sub-optimality gap rest on the counterfactual harm indicator being well-defined and estimable. However, in an RL setting where outcomes are full trajectories depending on the policy and transition kernel, no conditions are indicated for identifying individual counterfactual outcomes versus the baseline (e.g., no unmeasured confounding or known baseline policy). This renders the harm-rate control and bounds non-operational without additional assumptions, which is load-bearing for all central claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and constructive feedback. We address the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The claims of finite-sample properties and an upper bound on the sub-optimality gap rest on the counterfactual harm indicator being well-defined and estimable. However, in an RL setting where outcomes are full trajectories depending on the policy and transition kernel, no conditions are indicated for identifying individual counterfactual outcomes versus the baseline (e.g., no unmeasured confounding or known baseline policy). This renders the harm-rate control and bounds non-operational without additional assumptions, which is load-bearing for all central claims.

Authors: We agree that the manuscript does not explicitly state identifiability conditions. Our formalization of counterfactual harm and the subsequent finite-sample analysis assume a known baseline policy together with the standard no-unmeasured-confounding condition that permits identification of individual counterfactual outcomes from observed trajectories. The two-stage procedure and the derived bounds on the sub-optimality gap and harm rate are valid conditional on these assumptions. We will revise the paper to add an explicit subsection stating these conditions (with references to the causal-RL literature) in the problem formulation, thereby making the operational scope of the claims transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation builds on explicit formalization and standard RL bounds without reduction to fitted inputs or self-citations.

full rationale

The abstract and provided excerpts show the paper first defines individual harm via a counterfactual comparison to a baseline, then introduces a two-stage procedure to maximize return subject to harm control, followed by finite-sample analysis and a sub-optimality bound. No equations or steps are quoted that equate a derived quantity (e.g., the bound or harm rate) to a fitted parameter or prior self-citation by construction. The reader's assessment of score 2.0 aligns with the absence of self-definitional, fitted-prediction, or load-bearing self-citation patterns; the central claims rest on the new harm formalization plus conventional RL theory rather than circular reduction. Identifiability concerns raised by the skeptic pertain to assumption validity, not derivation circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only abstract available; ledger is therefore minimal and provisional.

axioms (1)

domain assumption Counterfactual outcomes for baseline alternatives can be defined and compared in the RL setting to identify harm events.
Required to make the harm definition operational.

pith-pipeline@v0.9.1-grok · 5647 in / 1041 out tokens · 30527 ms · 2026-06-29T23:46:28.583338+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 10 canonical work pages · 3 internal anchors

[1]

doi: 10.1214/009053606000001217

ISSN 0090-5364. doi: 10.1214/009053606000001217. URL http://dx.doi.org/10.1214/009053606000001217. Jean-Yves Audibert, R´ emi Munos, and Csaba Szepesv´ ari. Tuning bandit algorithms in stochastic environments.Theoretical Computer Science, 410(19):1876–1902,

work page doi:10.1214/009053606000001217 1902
[2]

Marie-Pierre de B´ ethune. Non-nucleoside reverse transcriptase inhibitors (nnrtis), their discovery, development, and use in the treatment of hiv-1 infection: a review of the last 20 years (1989– 2009).Antiviral research, 85(1):75–90,

1989
[3]

Off-policy deep reinforcement learning without exploration

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. InInternational Conference on Machine Learning, pages 2052–2062. PMLR,

2052
[4]

Antiretro- viral drugs for treatment and prevention of hiv infection in adults: 2016 recommendations of the international antiviral society–usa panel.Jama, 316(2):191–210,

Huldrych F G¨ unthard, Michael S Saag, Constance A Benson, Carlos Del Rio, Joseph J Eron, Joel E Gallant, Jennifer F Hoy, Michael J Mugavero, Paul E Sax, Melanie A Thompson, et al. Antiretro- viral drugs for treatment and prevention of hiv infection in adults: 2016 recommendations of the international antiviral society–usa panel.Jama, 316(2):191–210,

2016
[5]

Deep Reinforcement Learning: An Overview

Yuxi Li. Deep reinforcement learning: An overview.arXiv preprint arXiv:1701.07274,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Fairness-aware contextual dynamic pricing with strategic buyers

21 Pangpang Liu and Will Wei Sun. Fairness-aware contextual dynamic pricing with strategic buyers. arXiv preprint arXiv:2501.15338,

work page arXiv
[7]

Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999,

1999
[8]

Combin- ing kernel and model based learning for hiv therapy selection.AMIA Summits on Translational Science Proceedings, 2017:239,

Sonali Parbhoo, Jasmina Bogojeska, Maurizio Zazzi, Volker Roth, and Finale Doshi-Velez. Combin- ing kernel and model based learning for hiv therapy selection.AMIA Summits on Translational Science Proceedings, 2017:239,

2017
[9]

Statistically efficient advantage learning for offline reinforcement learning in infinite horizons.Journal of the American Statistical Association, 119(545):232–245, 2024a

Chengchun Shi, Shikai Luo, Yuan Le, Hongtu Zhu, and Rui Song. Statistically efficient advantage learning for offline reinforcement learning in infinite horizons.Journal of the American Statistical Association, 119(545):232–245, 2024a. Chengchun Shi, Zhengling Qi, Jianing Wang, and Fan Zhou. Value enhancement of reinforcement learning via efficient and rob...

2011
[10]

Worst-case aware policy optimization for robust rein- forcement learning.arXiv preprint arXiv:2002.08033,

Ziyu Tang, Qinqing Zhang, and Wen Sun. Worst-case aware policy optimization for robust rein- forcement learning.arXiv preprint arXiv:2002.08033,

work page arXiv 2002
[11]

Uehara, C

Masatoshi Uehara, Chengchun Shi, and Nathan Kallus. A review of off-policy evaluation in rein- forcement learning.arXiv preprint arXiv:2212.06355,

work page arXiv
[12]

Counterfactually fair reinforcement learning via sequential data preprocessing.arXiv preprint arXiv:2501.06366,

Jitao Wang, Chengchun Shi, John D Piette, Joshua R Loftus, Donglin Zeng, and Zhenke Wu. Counterfactually fair reinforcement learning via sequential data preprocessing.arXiv preprint arXiv:2501.06366,

work page arXiv
[13]

What are the statistical limits of offline rl with linear function approximation?arXiv preprint arXiv:2010.11895,

Ruosong Wang, Dean P Foster, and Sham M Kakade. What are the statistical limits of offline rl with linear function approximation?arXiv preprint arXiv:2010.11895,

work page arXiv 2010
[14]

The Promises of Multiple Experiments: Identifying Joint Distribution of Potential Outcomes

Peng Wu and Xiaojie Mao. The promises of multiple experiments: Identifying joint distribution of potential outcomes.arXiv preprint arXiv:2504.20470,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Quantifying Individual Risk for Binary Outcomes

Peng Wu, Peng Ding, Zhi Geng, and Yue Liu. Quantifying individual risk for binary outcome: Bounds and inference.arXiv preprint arXiv:2402.10537,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Mitigating unwanted biases with adversarial learning

Brian Hu Zhang, Bethany Lemoine, and Margaret Mitchell. Mitigating unwanted biases with adversarial learning. InProceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 335–340,

2018
[17]

Ro c kov \'a , Veronika V

doi: 10.1080/01621459.2022.2138760. 25 Supplementary Material S1. Harm Rate Identifiability under a Gaussian Copula In this section, we derive the identifiability formula for the harm rate presented at the end of Section 2.3 of the manuscript. Under the Gaussian copula assumption with parameterρ, the joint distribution of (Y t(a), Yt(a′)) givenX t =xis a ...

work page doi:10.1080/01621459.2022.2138760 2022
[18]

1997; Chernozhukov and Hansen 2005)

Since the noise termϵ t is shared across different actions, it is not hard to verify that the joint distribution of (Y(a), Y(a ′)) conditional onX t is Gaussian with correlation coefficientρ= 1, a property commonly referred to as rank preservation (Heckman et al. 1997; Chernozhukov and Hansen 2005). S3 Table S1: Performance comparison across different sam...

1997
[19]

Table S1 and S2 report the detailed numerical values for discounted rewards and average harms over 100 replications, under the linear and nonlinear settings, respectively

Numerical detailsTo complement the experimental results in the main text, we provide a comprehensive breakdown of the performance of our algorithm across different dataset sizes, with the values ofN∈ {100,500,1000,2000}. Table S1 and S2 report the detailed numerical values for discounted rewards and average harms over 100 replications, under the linear an...

2000
[20]

Lemma S1.Under the completeness assumption 4(ii) and feature convergence assumption 4(iii) in the manuscript, and assume|ˆrt| ≤M, we have ∥Q∗ − ˆQK∥∞ ≤ K−1X t=0 γt∥ ˆQK−t − T ˆQK−t−1∥∞ + γKM 1−γ .(S6) Proof.This can be shown similarly as Theorem 8 in Hu et al. (2025). 2 Lemma S2.Under the completeness assumption 4(ii) and feature convergence assumption 4(...

2025
[21]

Denote ˆwQ as the OLS estimator for anyQ∈ Q

For the ease of presentation, we use lower letters from now on. Denote ˆwQ as the OLS estimator for anyQ∈ Q. Let ˆΣ =Pn i=1 ϕ(xi, ai)ϕ(xi, ai)⊤ be the empirical design matrix, then ˆwQ should be ˆwQ = arg min w∈Rd nX i=1 (wϕ(xi, ai)−r i −γmax a′∈A Q(x′ i, a′))2 = ˆΣ−1 nX i=1 ϕ(xi, ai) ri −γmax a′∈A Q(x′ i, a′) . And let ˜wQ be the parameter using the esti...

1994

[1] [1]

doi: 10.1214/009053606000001217

ISSN 0090-5364. doi: 10.1214/009053606000001217. URL http://dx.doi.org/10.1214/009053606000001217. Jean-Yves Audibert, R´ emi Munos, and Csaba Szepesv´ ari. Tuning bandit algorithms in stochastic environments.Theoretical Computer Science, 410(19):1876–1902,

work page doi:10.1214/009053606000001217 1902

[2] [2]

Marie-Pierre de B´ ethune. Non-nucleoside reverse transcriptase inhibitors (nnrtis), their discovery, development, and use in the treatment of hiv-1 infection: a review of the last 20 years (1989– 2009).Antiviral research, 85(1):75–90,

1989

[3] [3]

Off-policy deep reinforcement learning without exploration

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. InInternational Conference on Machine Learning, pages 2052–2062. PMLR,

2052

[4] [4]

Antiretro- viral drugs for treatment and prevention of hiv infection in adults: 2016 recommendations of the international antiviral society–usa panel.Jama, 316(2):191–210,

Huldrych F G¨ unthard, Michael S Saag, Constance A Benson, Carlos Del Rio, Joseph J Eron, Joel E Gallant, Jennifer F Hoy, Michael J Mugavero, Paul E Sax, Melanie A Thompson, et al. Antiretro- viral drugs for treatment and prevention of hiv infection in adults: 2016 recommendations of the international antiviral society–usa panel.Jama, 316(2):191–210,

2016

[5] [5]

Deep Reinforcement Learning: An Overview

Yuxi Li. Deep reinforcement learning: An overview.arXiv preprint arXiv:1701.07274,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Fairness-aware contextual dynamic pricing with strategic buyers

21 Pangpang Liu and Will Wei Sun. Fairness-aware contextual dynamic pricing with strategic buyers. arXiv preprint arXiv:2501.15338,

work page arXiv

[7] [7]

Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999,

1999

[8] [8]

Combin- ing kernel and model based learning for hiv therapy selection.AMIA Summits on Translational Science Proceedings, 2017:239,

Sonali Parbhoo, Jasmina Bogojeska, Maurizio Zazzi, Volker Roth, and Finale Doshi-Velez. Combin- ing kernel and model based learning for hiv therapy selection.AMIA Summits on Translational Science Proceedings, 2017:239,

2017

[9] [9]

Statistically efficient advantage learning for offline reinforcement learning in infinite horizons.Journal of the American Statistical Association, 119(545):232–245, 2024a

Chengchun Shi, Shikai Luo, Yuan Le, Hongtu Zhu, and Rui Song. Statistically efficient advantage learning for offline reinforcement learning in infinite horizons.Journal of the American Statistical Association, 119(545):232–245, 2024a. Chengchun Shi, Zhengling Qi, Jianing Wang, and Fan Zhou. Value enhancement of reinforcement learning via efficient and rob...

2011

[10] [10]

Worst-case aware policy optimization for robust rein- forcement learning.arXiv preprint arXiv:2002.08033,

Ziyu Tang, Qinqing Zhang, and Wen Sun. Worst-case aware policy optimization for robust rein- forcement learning.arXiv preprint arXiv:2002.08033,

work page arXiv 2002

[11] [11]

Uehara, C

Masatoshi Uehara, Chengchun Shi, and Nathan Kallus. A review of off-policy evaluation in rein- forcement learning.arXiv preprint arXiv:2212.06355,

work page arXiv

[12] [12]

Counterfactually fair reinforcement learning via sequential data preprocessing.arXiv preprint arXiv:2501.06366,

Jitao Wang, Chengchun Shi, John D Piette, Joshua R Loftus, Donglin Zeng, and Zhenke Wu. Counterfactually fair reinforcement learning via sequential data preprocessing.arXiv preprint arXiv:2501.06366,

work page arXiv

[13] [13]

What are the statistical limits of offline rl with linear function approximation?arXiv preprint arXiv:2010.11895,

Ruosong Wang, Dean P Foster, and Sham M Kakade. What are the statistical limits of offline rl with linear function approximation?arXiv preprint arXiv:2010.11895,

work page arXiv 2010

[14] [14]

The Promises of Multiple Experiments: Identifying Joint Distribution of Potential Outcomes

Peng Wu and Xiaojie Mao. The promises of multiple experiments: Identifying joint distribution of potential outcomes.arXiv preprint arXiv:2504.20470,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Quantifying Individual Risk for Binary Outcomes

Peng Wu, Peng Ding, Zhi Geng, and Yue Liu. Quantifying individual risk for binary outcome: Bounds and inference.arXiv preprint arXiv:2402.10537,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Mitigating unwanted biases with adversarial learning

Brian Hu Zhang, Bethany Lemoine, and Margaret Mitchell. Mitigating unwanted biases with adversarial learning. InProceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 335–340,

2018

[17] [17]

Ro c kov \'a , Veronika V

doi: 10.1080/01621459.2022.2138760. 25 Supplementary Material S1. Harm Rate Identifiability under a Gaussian Copula In this section, we derive the identifiability formula for the harm rate presented at the end of Section 2.3 of the manuscript. Under the Gaussian copula assumption with parameterρ, the joint distribution of (Y t(a), Yt(a′)) givenX t =xis a ...

work page doi:10.1080/01621459.2022.2138760 2022

[18] [18]

1997; Chernozhukov and Hansen 2005)

Since the noise termϵ t is shared across different actions, it is not hard to verify that the joint distribution of (Y(a), Y(a ′)) conditional onX t is Gaussian with correlation coefficientρ= 1, a property commonly referred to as rank preservation (Heckman et al. 1997; Chernozhukov and Hansen 2005). S3 Table S1: Performance comparison across different sam...

1997

[19] [19]

Table S1 and S2 report the detailed numerical values for discounted rewards and average harms over 100 replications, under the linear and nonlinear settings, respectively

Numerical detailsTo complement the experimental results in the main text, we provide a comprehensive breakdown of the performance of our algorithm across different dataset sizes, with the values ofN∈ {100,500,1000,2000}. Table S1 and S2 report the detailed numerical values for discounted rewards and average harms over 100 replications, under the linear an...

2000

[20] [20]

Lemma S1.Under the completeness assumption 4(ii) and feature convergence assumption 4(iii) in the manuscript, and assume|ˆrt| ≤M, we have ∥Q∗ − ˆQK∥∞ ≤ K−1X t=0 γt∥ ˆQK−t − T ˆQK−t−1∥∞ + γKM 1−γ .(S6) Proof.This can be shown similarly as Theorem 8 in Hu et al. (2025). 2 Lemma S2.Under the completeness assumption 4(ii) and feature convergence assumption 4(...

2025

[21] [21]

Denote ˆwQ as the OLS estimator for anyQ∈ Q

For the ease of presentation, we use lower letters from now on. Denote ˆwQ as the OLS estimator for anyQ∈ Q. Let ˆΣ =Pn i=1 ϕ(xi, ai)ϕ(xi, ai)⊤ be the empirical design matrix, then ˆwQ should be ˆwQ = arg min w∈Rd nX i=1 (wϕ(xi, ai)−r i −γmax a′∈A Q(x′ i, a′))2 = ˆΣ−1 nX i=1 ϕ(xi, ai) ri −γmax a′∈A Q(x′ i, a′) . And let ˜wQ be the parameter using the esti...

1994