Wasserstein Policy Learning for Distributional Outcomes
Pith reviewed 2026-06-26 19:40 UTC · model grok-4.3
The pith
Offline policy learning extends to distribution-valued outcomes by optimizing utilities on Wasserstein barycenters, with regret bounds driven by policy class complexity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the one-dimensional Wasserstein setting and under the stated regularity conditions, the finite-sample regret for the policy learning framework based on both IPW and DR estimators has leading dependence ilde O(sqrt(N-dim(Π)/N)). The leading regret rate remains governed by the policy-class complexity even though the quantile domain is infinite-dimensional. A minimax lower bound establishes the sharpness of the leading dependence on N and N-dim(Π).
What carries the argument
Utility functional applied to the Wasserstein barycenter of the outcome distributions induced by a policy, estimated via IPW and DR estimators.
If this is right
- The regret rate depends only on the complexity of the policy class and not on the infinite dimensionality of the outcome distributions.
- Both IPW and DR estimators achieve the same leading regret rate.
- The minimax lower bound confirms that no estimator can improve on the sqrt dependence on policy complexity and sample size.
Where Pith is reading between the lines
- The same uniform-deviation technique could be applied to other distances between distributions provided analogous regularity conditions can be verified.
- The framework suggests that individualized treatment rules could be learned directly from histogram or density data without first reducing each outcome to a scalar summary.
- Empirical tests on real distributional data would reveal how large N must be before the sqrt rate becomes visible.
Load-bearing premise
Unspecified regularity conditions hold that allow uniform deviation to be controlled over the product of the combinatorial policy class and the infinite-dimensional quantile domain.
What would settle it
A concrete data-generating process and policy class satisfying the regularity conditions for which the observed regret exceeds ilde O(sqrt(N-dim(Π)/N)) by more than logarithmic factors.
read the original abstract
Offline policy learning has received growing attention in causal inference. The primary objective is to learn a policy (individualized treatment rule) as a mapping from covariates to treatment that maximizes the empirical welfare defined as the mean of scalar-valued potential outcomes. In this paper, we study offline policy learning with distribution-valued outcomes, where each potential outcome is a probability measure on $\mathbb{R}$ and the reward is defined through a utility functional applied to the Wasserstein barycenter of induced outcome distributions. We establish statistical guarantees for the policy learning framework based on both Inverse Probability Weighting (IPW) and Doubly Robust (DR) estimators. By handling the challenging uniform deviation over the product of the combinatorial policy class and the infinite-dimensional quantile domain, we prove that the finite-sample regret has leading dependence $\widetilde{\mathcal{O}}(\sqrt{\mathrm{N\text{-}dim}(\Pi)/N})$. In the one-dimensional Wasserstein setting and under the stated regularity conditions, the leading regret rate is still governed by the policy-class complexity. Moreover, we provide a minimax lower bound establishing the sharpness of the leading dependence on $N$ and $\mathrm{N\text{-}dim}(\Pi)$.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript develops an offline policy learning framework for distribution-valued outcomes, where the objective is to maximize a utility functional of the Wasserstein barycenter of induced distributions. It introduces IPW and DR estimators for this setting and claims finite-sample regret bounds of leading order ilde{\mathcal{O}}(\sqrt{ m N-dim}(\Pi)/N) together with a matching minimax lower bound, with the rate governed by policy-class complexity under regularity conditions that control uniform deviations over the product of the policy class and the quantile domain.
Significance. If the claimed bounds hold under explicitly verifiable conditions, the work meaningfully extends scalar-outcome policy learning to distributional outcomes while preserving sharp dependence on policy complexity rather than outcome dimension. The matching upper and lower bounds constitute a clear strength.
major comments (2)
- [Abstract] Abstract and the paragraph on statistical guarantees: the regret bound ilde{\mathcal{O}}(\sqrt{\rm N-dim}(\Pi)/N) and its minimax sharpness are asserted to follow from controlling the uniform deviation over the product of the combinatorial policy class \Pi and the infinite-dimensional quantile domain, yet the required regularity conditions (entropy integrability, Lipschitz constants on the utility, moment bounds on outcome measures, etc.) are referenced but never explicitly enumerated or shown to be sufficient for the chaining argument to close.
- [Statistical guarantees section] Section deriving the finite-sample regret (IPW/DR estimators): the leading term is obtained by applying standard IPW/DR theory to the new functional, but without explicit error-bar details, the precise statement of the regularity conditions, or verification that they hold uniformly over the product space, the claimed rate cannot be confirmed.
minor comments (1)
- [Notation] Clarify the precise definition of N-dim(\Pi) (e.g., whether it is the Natarajan dimension) and ensure consistent notation between the abstract and the body.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which correctly identify areas where greater explicitness will strengthen the manuscript. We will revise to enumerate the regularity conditions and provide the missing derivation details while preserving the core results on the regret bounds.
read point-by-point responses
-
Referee: [Abstract] Abstract and the paragraph on statistical guarantees: the regret bound ilde{ m O}(\sqrt{N-dim}(\Pi)/N) and its minimax sharpness are asserted to follow from controlling the uniform deviation over the product of the combinatorial policy class \Pi and the infinite-dimensional quantile domain, yet the required regularity conditions (entropy integrability, Lipschitz constants on the utility, moment bounds on outcome measures, etc.) are referenced but never explicitly enumerated or shown to be sufficient for the chaining argument to close.
Authors: We agree that the conditions are referenced rather than enumerated. In the revision we will add a dedicated subsection listing them explicitly: (i) finite entropy integral of \Pi with respect to the covering metric on the quantile domain, (ii) Lipschitz continuity of the utility functional w.r.t. the 1-Wasserstein distance with constant independent of the policy, (iii) uniform fourth-moment bounds on the outcome measures, and (iv) overlap and boundedness conditions on the propensity scores. We will then include a short chaining argument showing that these conditions suffice to control the supremum over \Pi \times [0,1] and thereby close the proof of the stated regret rate. revision: yes
-
Referee: [Statistical guarantees section] Section deriving the finite-sample regret (IPW/DR estimators): the leading term is obtained by applying standard IPW/DR theory to the new functional, but without explicit error-bar details, the precise statement of the regularity conditions, or verification that they hold uniformly over the product space, the claimed rate cannot be confirmed.
Authors: The observation is accurate. The revised section will contain (a) an explicit error decomposition separating the IPW/DR bias term from the stochastic term with explicit constants, (b) a uniform deviation lemma that states the bound under the enumerated conditions, and (c) a verification paragraph confirming that the moment and Lipschitz assumptions propagate uniformly over the product space because the quantile functions remain controlled. These additions will make the derivation of the leading \tilde O(\sqrt{N-dim(\Pi)/N}) term directly verifiable without changing the stated results. revision: yes
Circularity Check
No significant circularity; standard IPW/DR theory applied to new functional
full rationale
The paper derives finite-sample regret bounds of order ilde O(sqrt(N-dim(Π)/N)) and a matching minimax lower bound by applying existing IPW and DR estimation theory to the Wasserstein-barycenter utility functional. The leading term is governed by the combinatorial complexity of the policy class Π, not by any fitted parameter, self-referential normalization, or self-citation chain. The abstract explicitly invokes 'stated regularity conditions' for the uniform deviation argument over the policy-quantile product space, but these are external to the derivation itself and do not reduce the claimed result to a tautology. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the provided text. The derivation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Regularity conditions on outcome distributions and utility functional that enable uniform deviation bounds over policy class times quantile domain
Reference graph
Works this paper leans on
-
[1]
Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=
Causal inference on distribution functions , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2023 , publisher=
2023
-
[2]
Journal of Machine Learning Research , volume=
Causal effect of functional treatment , author=. Journal of Machine Learning Research , volume=
-
[3]
Journal of the Royal Statistical Society Series C: Applied Statistics , volume=
Causal inference with a functional outcome , author=. Journal of the Royal Statistical Society Series C: Applied Statistics , volume=. 2024 , publisher=
2024
-
[4]
Journal of the American Statistical Association , pages=
Policy learning with distributional welfare , author=. Journal of the American Statistical Association , pages=. 2025 , publisher=
2025
-
[5]
arXiv preprint arXiv:2501.06024 , year=
Doubly-robust functional average treatment effect estimation , author=. arXiv preprint arXiv:2501.06024 , year=
-
[6]
Econometrica , volume=
Who should be treated? empirical welfare maximization methods for treatment choice , author=. Econometrica , volume=. 2018 , publisher=
2018
-
[7]
Econometrica , volume=
Policy learning with observational data , author=. Econometrica , volume=. 2021 , publisher=
2021
-
[8]
Econometrica , volume=
Statistical treatment rules for heterogeneous populations , author=. Econometrica , volume=. 2004 , publisher=
2004
-
[9]
Econometrica , volume=
Asymptotics for statistical treatment rules , author=. Econometrica , volume=. 2009 , publisher=
2009
-
[10]
Journal of Econometrics , volume=
Minimax regret treatment choice with finite samples , author=. Journal of Econometrics , volume=. 2009 , publisher=
2009
-
[11]
Operations Research , volume=
Offline multi-action policy learning: Generalization and optimization , author=. Operations Research , volume=. 2023 , publisher=
2023
-
[12]
Journal of the American Statistical Association , volume=
Estimating individualized treatment rules using outcome weighted learning , author=. Journal of the American Statistical Association , volume=. 2012 , publisher=
2012
-
[13]
The Journal of Machine Learning Research , volume=
Batch learning from logged bandit feedback through counterfactual risk minimization , author=. The Journal of Machine Learning Research , volume=. 2015 , publisher=
2015
-
[14]
Journal of the American Statistical Association , volume=
Residual weighted learning for estimating individualized treatment rules , author=. Journal of the American Statistical Association , volume=. 2017 , publisher=
2017
-
[15]
The Annals of Statistics , volume=
Policy learning “without” overlap: Pessimism and generalized empirical Bernstein’s inequality , author=. The Annals of Statistics , volume=. 2025 , publisher=
2025
-
[16]
Advances in neural information processing systems , volume=
Balanced policy evaluation and learning , author=. Advances in neural information processing systems , volume=
-
[17]
Management Science , volume=
Minimax-optimal policy learning under unobserved confounding , author=. Management Science , volume=. 2021 , publisher=
2021
-
[18]
arXiv preprint arXiv:2305.11812 , year=
Off-policy evaluation beyond overlap: partial identification through smoothness , author=. arXiv preprint arXiv:2305.11812 , year=
-
[19]
International Conference on Artificial Intelligence and Statistics , pages=
Positivity-free policy learning with observational data , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=
2024
-
[20]
International conference on artificial intelligence and statistics , pages=
Policy evaluation and optimization with continuous treatments , author=. International conference on artificial intelligence and statistics , pages=. 2018 , organization=
2018
-
[21]
Journal of Econometrics , volume=
Data-driven policy learning for continuous treatments , author=. Journal of Econometrics , volume=. 2026 , publisher=
2026
-
[22]
Advances in Neural Information Processing Systems , volume=
Semi-parametric efficient policy learning with continuous actions , author=. Advances in Neural Information Processing Systems , volume=
-
[23]
arXiv preprint arXiv:2512.19230 , year=
Semiparametric Efficiency in Policy Learning with General Treatments , author=. arXiv preprint arXiv:2512.19230 , year=
-
[24]
Advances in neural information processing systems , volume=
Confounding-robust policy improvement , author=. Advances in neural information processing systems , volume=
-
[25]
Management Science , volume=
Policy learning with adaptively collected data , author=. Management Science , volume=. 2024 , publisher=
2024
-
[26]
Advances in neural information processing systems , volume=
Risk minimization from adaptively collected data: Guarantees for supervised and policy learning , author=. Advances in neural information processing systems , volume=
-
[27]
Journal of the American Statistical Association , volume=
Quantile-optimal treatment regimes , author=. Journal of the American Statistical Association , volume=. 2018 , publisher=
2018
-
[28]
Journal of Econometrics , volume=
Treatment recommendation with distributional targets , author=. Journal of Econometrics , volume=. 2023 , publisher=
2023
-
[29]
arXiv preprint arXiv:2401.17909 , year=
Regularizing Discrimination in Optimal Policy Learning with Distributional Targets , author=. arXiv preprint arXiv:2401.17909 , year=
-
[30]
The Japanese Economic Review , volume=
Treatment choice, mean square regret and partial identification , author=. The Japanese Economic Review , volume=. 2023 , publisher=
2023
-
[31]
The Japanese Economic Review , volume=
Statistical decision theory respecting stochastic dominance , author=. The Japanese Economic Review , volume=. 2023 , publisher=
2023
-
[32]
arXiv preprint arXiv:2406.19604 , year=
Geodesic causal inference , author=. arXiv preprint arXiv:2406.19604 , year=
-
[33]
arXiv preprint arXiv:2503.05024 , year=
Kernel-based estimators for functional causal effects , author=. arXiv preprint arXiv:2503.05024 , year=
-
[34]
arXiv preprint arXiv:2506.22754 , year=
Doubly robust estimation of causal effects for random object outcomes with continuous treatments , author=. arXiv preprint arXiv:2506.22754 , year=
-
[35]
Journal of the American Statistical Association , volume=
Learning optimal distributionally robust individualized treatment rules , author=. Journal of the American Statistical Association , volume=. 2021 , publisher=
2021
-
[36]
International Conference on Machine Learning , pages=
Doubly robust distributionally robust off-policy evaluation and learning , author=. International Conference on Machine Learning , pages=. 2022 , organization=
2022
-
[37]
Advances in Neural Information Processing Systems , volume=
Factored DRO: Factored distributionally robust policies for contextual bandits , author=. Advances in Neural Information Processing Systems , volume=
-
[38]
Management Science , volume=
Distributionally robust batch contextual bandits , author=. Management Science , volume=. 2023 , publisher=
2023
-
[39]
arXiv preprint arXiv:2205.05561 , volume=
Externally valid treatment choice , author=. arXiv preprint arXiv:2205.05561 , volume=
-
[40]
arXiv preprint arXiv:2205.04637 , year=
Distributionally robust policy learning with wasserstein distance , author=. arXiv preprint arXiv:2205.04637 , year=
-
[41]
Transactions on Machine Learning Research , issn=
Wasserstein Distributionally Robust Policy Evaluation and Learning for Contextual Bandits , author=. Transactions on Machine Learning Research , issn=. 2024 , url=
2024
-
[42]
arXiv preprint arXiv:2402.02535 , year=
Data-driven Policy Learning for a Continuous Treatment , author=. arXiv preprint arXiv:2402.02535 , year=
-
[43]
Handbook of econometrics , volume=
Empirical process methods in econometrics , author=. Handbook of econometrics , volume=. 1994 , publisher=
1994
-
[44]
2013 , publisher=
Probability theory: a comprehensive course , author=. 2013 , publisher=
2013
-
[45]
Publications Math
Concentration of measure and isoperimetric inequalities in product spaces , author=. Publications Math. 1995 , publisher=
1995
-
[46]
Econometrica , volume=
Model selection for treatment choice: Penalized welfare maximization , author=. Econometrica , volume=. 2021 , publisher=
2021
-
[47]
Causal Inference on Distribution Functions , publisher =
Lin, Zhenhua and Kong, Dehan and Wang, Linbo , keywords =. Causal Inference on Distribution Functions , publisher =. 2021 , copyright =. doi:10.48550/ARXIV.2101.01599 , url =
-
[48]
Stat , volume=
Variable selection in function-on-scalar regression , author=. Stat , volume=. 2016 , publisher=
2016
-
[49]
Journal of the American statistical association , volume=
Functional data analysis for sparse longitudinal data , author=. Journal of the American statistical association , volume=. 2005 , publisher=
2005
-
[50]
Journal of the American Statistical Association , volume=
An accelerated-time model for response curves , author=. Journal of the American Statistical Association , volume=. 1997 , publisher=
1997
-
[51]
arXiv preprint arXiv:1410.8516 , year=
Nice: Non-linear independent components estimation , author=. arXiv preprint arXiv:1410.8516 , year=
-
[52]
International conference on machine learning , pages=
Variational inference with normalizing flows , author=. International conference on machine learning , pages=. 2015 , organization=
2015
-
[53]
Advances in neural information processing systems , volume=
Neural ordinary differential equations , author=. Advances in neural information processing systems , volume=
-
[54]
International Conference on Learning Representations , year=
FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models , author=. International Conference on Learning Representations , year=
-
[55]
The Econometrics Journal , volume=
Double/debiased machine learning for treatment and structural parameters: Double/debiased machine learning , author=. The Econometrics Journal , volume=. 2018 , publisher=
2018
-
[56]
International Conference on Machine Learning , pages=
Orthogonal machine learning: Power and limitations , author=. International Conference on Machine Learning , pages=. 2018 , organization=
2018
-
[57]
Advances in neural information processing systems , volume=
Optimization over continuous and multi-dimensional decisions with observational data , author=. Advances in neural information processing systems , volume=
-
[58]
The Econometrics Journal , volume=
Debiased machine learning of conditional average treatment effects and other causal functions , author=. The Econometrics Journal , volume=. 2021 , publisher=
2021
-
[59]
2017 , institution=
Efficient Policy Learning , author=. 2017 , institution=
2017
-
[60]
Operations Research , year=
Offline multi-action policy learning: Generalization and optimization , author=. Operations Research , year=
-
[61]
Journal of Machine Learning Research , volume=
Rademacher and Gaussian complexities: Risk bounds and structural results , author=. Journal of Machine Learning Research , volume=
-
[62]
Advances in neural information processing systems , volume=
On the complexity of linear prediction: Risk bounds, margin bounds, and regularization , author=. Advances in neural information processing systems , volume=
-
[63]
Optimization Online , volume=
Kullback-Leibler divergence constrained distributionally robust optimization , author=. Optimization Online , volume=
-
[64]
2013 , publisher=
Perturbation analysis of optimization problems , author=. 2013 , publisher=
2013
-
[65]
1980 , publisher=
The Central Limit Theorem for Real and Banach Valued Random Variables , author=. 1980 , publisher=
1980
-
[66]
2019 , publisher=
High-dimensional statistics: A non-asymptotic viewpoint , author=. 2019 , publisher=
2019
-
[67]
2000 , publisher=
Asymptotic statistics , author=. 2000 , publisher=
2000
-
[68]
Journal of Machine Learning Research , volume=
Covering number bounds of certain regularized linear function classes , author=. Journal of Machine Learning Research , volume=
-
[69]
Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=
Multinomial goodness-of-fit tests , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 1984 , publisher=
1984
-
[70]
The Annals of Statistics , volume=
Learning models with uniform performance via distributionally robust optimization , author=. The Annals of Statistics , volume=. 2021 , publisher=
2021
-
[71]
Journal of Machine Learning Research , volume=
Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks , author=. Journal of Machine Learning Research , volume=
-
[72]
The Review of Economic Studies , volume =
Schennach, Susanne M , title =. The Review of Economic Studies , volume =. 2020 , month =. doi:10.1093/restud/rdz065 , url =
-
[73]
Proceedings of the 29th International Coference on International Conference on Machine Learning , pages=
Hypothesis testing using pairwise distances and associated kernels , author=. Proceedings of the 29th International Coference on International Conference on Machine Learning , pages=
-
[74]
Machine Learning , volume=
On learning sets and functions , author=. Machine Learning , volume=. 1989 , publisher=
1989
-
[75]
2005 , publisher=
Introduction to nonparametric regression , author=. 2005 , publisher=
2005
-
[76]
Journal of Combinatorial Theory, Series A , volume=
A generalization of Sauer's lemma , author=. Journal of Combinatorial Theory, Series A , volume=. 1995 , publisher=
1995
-
[77]
2015 , publisher=
Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling, volume 87 of Progress in Nonlinear Differential Equations and Their Applications , author=. 2015 , publisher=
2015
-
[78]
2022 , institution =
The Dynamics of the Racial Wealth Gap , author =. 2022 , institution =
2022
-
[79]
Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=
Dynamic modelling of sparse longitudinal data and functional snippets with stochastic differential equations , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2025 , publisher=
2025
-
[80]
Annual review of statistics and its application , volume=
Statistical aspects of Wasserstein distances , author=. Annual review of statistics and its application , volume=. 2019 , publisher=
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.