Scalable Bi-causal Optimal Transport via KL Relaxation and Policy Gradients

arxiv: 2605.17271 · v1 · pith:RPVPFU2Jnew · submitted 2026-05-17 · 🧮 math.OC · cs.LG

Scalable Bi-causal Optimal Transport via KL Relaxation and Policy Gradients

Haoyang Cao , Jesse Hoekstra , Renyuan Xu , Yumin Xu , Ruixun Zhang This is my paper

Pith reviewed 2026-05-19 23:06 UTC · model grok-4.3

classification 🧮 math.OC cs.LG

keywords bi-causal optimal transportKL relaxationpolicy gradientsdynamic programmingstochastic optimizationrobust financetime series downscaling

0 comments p. Extension

pith:RPVPFU2J Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{RPVPFU2J}

Prints a linked pith:RPVPFU2J badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

KL relaxation turns bi-causal optimal transport into a policy-gradient problem

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a scalable stochastic optimization framework for bi-causal optimal transport by introducing a Kullback-Leibler penalized relaxation. This replaces hard marginal constraints with divergence penalties while preserving the recursive structure needed for dynamic programming. The authors prove that the relaxed problem converges to the original bi-causal OT as the penalty grows and derive explicit policy-gradient representations for the relaxed objective. They then propose a practical algorithm equipped with unbiased mini-batch estimators, variance reduction, and nonasymptotic regret guarantees. The approach matters because it makes nonanticipative couplings feasible for long-horizon problems in robust finance and sequential uncertainty quantification.

Core claim

We establish dynamic programming principles for both the original and relaxed formulations, prove that the relaxed problem converges to the original bi-causal OT problem as the penalty grows, and derive explicit policy-gradient representations for the relaxed objective. Building on these results, we propose a practical policy-gradient algorithm with unbiased mini-batch estimators, variance reduction, and nonasymptotic regret guarantees.

What carries the argument

The KL-penalized relaxation of the bi-causal optimal transport problem, which replaces hard marginal constraints with tractable divergence penalties while preserving recursive structure for dynamic programming and policy gradients.

If this is right

Dynamic programming principles hold for the relaxed formulation.
The relaxed problem converges to the original bi-causal OT as the penalty parameter grows.
Explicit policy-gradient representations exist for the relaxed objective.
The algorithm supplies unbiased mini-batch estimators, variance reduction, and nonasymptotic regret guarantees.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could support generation of joint sample paths for simulation tasks that must respect both prescribed marginals and information flow over long horizons.
Similar KL relaxations might extend to other sequential optimal transport problems that inherit a recursive structure.
Integration with existing reinforcement learning optimizers could further reduce variance in high-dimensional path spaces.

Load-bearing premise

The KL relaxation preserves the recursive structure of the bi-causal problem sufficiently for dynamic programming and policy gradient methods to apply directly.

What would settle it

A calculation showing that the optimal value or couplings of the relaxed problem fail to approach those of the original bi-causal OT as the KL penalty tends to infinity would falsify the convergence claim.

Figures

Figures reproduced from arXiv: 2605.17271 by Haoyang Cao, Jesse Hoekstra, Renyuan Xu, Ruixun Zhang, Yumin Xu.

**Figure 2.** Figure 2: Marginal distributions across different timesteps for the bimodal setting ( [PITH_FULL_IMAGE:figures/full_fig_p030_2.png] view at source ↗

**Figure 3.** Figure 3: Temporal correlations between adjacent time steps ( [PITH_FULL_IMAGE:figures/full_fig_p031_3.png] view at source ↗

**Figure 4.** Figure 4: Marginal distributions across time for the martingale setting ( [PITH_FULL_IMAGE:figures/full_fig_p034_4.png] view at source ↗

**Figure 5.** Figure 5: Adjacent-step correlations over n = 1, . . . , N (d = 1, N = 5). the generated and ground-truth adjacent-step correlations differ for n ∈ [N]. Overall, the method preserves the martingale dynamics while maintaining a close fit to the ground truth. Finally, it is worth noting that our implementation of Algorithm 1 runs faster than the method reported by Bayraktar and Han (2025b). 9 [PITH_FULL_IMAGE:figures… view at source ↗

**Figure 6.** Figure 6: Relationship between relative error (log scale) and [PITH_FULL_IMAGE:figures/full_fig_p035_6.png] view at source ↗

**Figure 7.** Figure 7: Marginal distributions across time for the downscaling setting ( [PITH_FULL_IMAGE:figures/full_fig_p039_7.png] view at source ↗

**Figure 8.** Figure 8: Adjacent-step correlations over n = 1, 10, 20, . . . , Nx (d = 2, Nx = 299). 7 Discussion This paper develops a scalable computational framework for bi-causal OT under general marginals. By combining a KL-penalized relaxation with the recursive structure of bi-causal OT, we obtain an end-to-end stochastic-optimization method that approximates the original path-space problem 36 [PITH_FULL_IMAGE:figures/ful… view at source ↗

read the original abstract

Bi-causal optimal transport (OT) is a natural framework for comparing and coupling stochastic processes under nonanticipative information constraints, with important applications in robust finance, sequential uncertainty quantification, and multistage stochastic optimization. In particular, a learned bi-causal coupling naturally serves as a simulator for generating joint sample paths that respect both prescribed marginal laws and the underlying information flow. Its practical use, however, is limited by the computational difficulty of enforcing bi-causal coupling constraints over path space, especially for continuous distributions and long horizons. We develop a scalable stochastic-optimization framework for computing bi-causal OT couplings under general marginals. Our approach introduces a Kullback--Leibler (KL)-penalized relaxation that replaces hard marginal constraints with tractable divergence penalties while preserving the recursive structure of the problem. We establish dynamic programming principles for both the original and relaxed formulations, prove that the relaxed problem converges to the original bi-causal OT problem as the penalty grows, and derive explicit policy-gradient representations for the relaxed objective. Building on these results, we propose a practical policy-gradient algorithm with unbiased mini-batch estimators, variance reduction, and nonasymptotic regret guarantees. Numerical experiments show that the method accurately captures marginal laws and temporal dependence, and performs well in applications including robust subhedging and time series statistical downscaling. These results provide a scalable computational approach to bi-causal OT and broaden its applicability in settings where nonanticipative information constraints are essential.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a KL-relaxed bi-causal OT formulation that supports dynamic programming and a policy-gradient algorithm with regret bounds, plus some finance numerics.

read the letter

The core contribution is a KL-penalized relaxation of bi-causal OT that keeps the recursive structure intact enough for dynamic programming and explicit policy-gradient formulas. They prove the relaxed problem converges to the hard-constrained version as the penalty increases, derive the gradients, and build a mini-batch algorithm with variance reduction and nonasymptotic regret bounds. The experiments on robust subhedging and time-series downscaling show it matches marginals and dependence reasonably well under the tested conditions. That combination of relaxation, DP, and stochastic optimization is not a direct copy of earlier OT work, so the algorithmic piece is genuinely new. The theoretical claims rest on standard dynamic programming and stochastic approximation arguments once the relaxation is in place, which is fine as long as the decomposition holds. The main soft spot is exactly the one the stress-test note flags: a global path-space KL term can create non-local dependencies that prevent the value function from depending only on the current state and conditional marginals. If the paper only assumes the decomposition without an explicit check for general filtrations and long horizons, the DP principle and the regret bounds could be narrower than stated. The numerics are limited to a few examples, so they do not yet rule out scaling issues when the marginals are far from the reference measure. This work is aimed at people who already care about bi-causal couplings in robust finance or multistage stochastic optimization and need a practical way to compute them. A reader looking for scalable simulation tools under information constraints would get concrete value from the algorithm and the reported runs. The paper has enough formal structure and empirical support to deserve a serious referee rather than a desk reject, though the proofs will need careful verification on the decomposition step. I would send it out for peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a scalable stochastic optimization framework for bi-causal optimal transport by replacing hard marginal constraints with a KL-penalized relaxation. It claims to establish dynamic programming principles for both the original and relaxed problems, prove convergence of the relaxed formulation to the original bi-causal OT as the penalty parameter tends to infinity, derive explicit policy-gradient representations for the relaxed objective, and introduce a practical algorithm using unbiased mini-batch estimators, variance reduction, and nonasymptotic regret guarantees. Numerical experiments illustrate accurate recovery of marginal laws and temporal dependence, with applications to robust subhedging and time-series downscaling.

Significance. If the central claims are rigorously established, the work offers a computationally tractable route to bi-causal couplings for continuous distributions and long horizons, which is valuable for robust finance, sequential uncertainty quantification, and multistage stochastic optimization. The combination of DP principles, convergence results, and regret bounds with a policy-gradient implementation would strengthen the practical utility of bi-causal OT beyond current limitations imposed by path-space constraints.

major comments (2)

[Dynamic programming section for the relaxed formulation] The central derivation of dynamic programming for the relaxed problem (around the statement of the Bellman equation for the KL-penalized objective) assumes that the path-space KL divergence decomposes recursively with respect to the underlying filtration. A global KL term generally couples future and past information in a non-local manner; an explicit decomposition showing that the value function at each step depends only on the current state and conditional marginals is required to close the recursion and justify the subsequent policy-gradient and regret results.
[Regret analysis and algorithm section] The nonasymptotic regret guarantees for the proposed policy-gradient algorithm (stated after the derivation of the gradient representations) depend on the unbiasedness of the mini-batch estimators and the variance reduction technique. The dependence of the regret bound on the time horizon T and the dimension of the state space should be made explicit, as this directly affects the claimed scalability for long-horizon problems.

minor comments (2)

[Notation and preliminaries] Notation for the policy and the conditional marginals could be standardized between the original and relaxed formulations to improve readability.
[Numerical experiments] The numerical experiments section would benefit from additional details on the choice of the KL penalty strength and its sensitivity in the reported results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment point by point below and are prepared to revise the manuscript to incorporate the requested clarifications and strengthen the presentation.

read point-by-point responses

Referee: [Dynamic programming section for the relaxed formulation] The central derivation of dynamic programming for the relaxed problem (around the statement of the Bellman equation for the KL-penalized objective) assumes that the path-space KL divergence decomposes recursively with respect to the underlying filtration. A global KL term generally couples future and past information in a non-local manner; an explicit decomposition showing that the value function at each step depends only on the current state and conditional marginals is required to close the recursion and justify the subsequent policy-gradient and regret results.

Authors: We thank the referee for this observation. The KL relaxation in our formulation is constructed so that the divergence penalty is applied to the conditional distributions induced by the bi-causal policies with respect to the underlying filtration. This permits an exact recursive decomposition of the path-space KL term via the chain rule for relative entropy. Consequently, the value function at each time step depends only on the current state and the relevant conditional marginals, allowing the Bellman equation to close. We will add an explicit lemma (with full proof) detailing this decomposition immediately before the statement of the dynamic programming principle for the relaxed problem. revision: yes
Referee: [Regret analysis and algorithm section] The nonasymptotic regret guarantees for the proposed policy-gradient algorithm (stated after the derivation of the gradient representations) depend on the unbiasedness of the mini-batch estimators and the variance reduction technique. The dependence of the regret bound on the time horizon T and the dimension of the state space should be made explicit, as this directly affects the claimed scalability for long-horizon problems.

Authors: We agree that an explicit display of the dependence on the horizon T and state dimension is necessary for a transparent assessment of scalability. In the revised version we will restate the main regret theorem with all factors involving T and the dimension written out explicitly (the bound grows linearly in T and polynomially in dimension, modulated by the mini-batch size and variance-reduction parameters). We will also add a short remark discussing the implications for long-horizon regimes. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivations rest on standard DP and stochastic approximation

full rationale

The paper introduces a KL-penalized relaxation of bi-causal OT and claims to establish DP principles, convergence as penalty grows, and explicit policy-gradient representations. These steps are presented as following from the recursive structure preserved by the relaxation and from standard dynamic programming and policy gradient theory. No quoted equations reduce target quantities to fitted parameters defined within the paper, no load-bearing self-citations close the central claims, and no ansatz or uniqueness result is smuggled in via prior author work. The framework is therefore self-contained against external benchmarks for the purposes of this circularity check.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the existence of dynamic programming principles for the relaxed problem and the ability to sample from general marginals; the penalty strength acts as a tunable parameter controlling approximation quality.

free parameters (1)

KL penalty strength
Controls the trade-off between tractability and fidelity to the original marginal constraints; its value must be chosen or increased to recover the hard-constrained solution.

axioms (1)

domain assumption Dynamic programming principle holds for both the original bi-causal OT and its KL relaxation
Invoked to obtain recursive structure and explicit policy-gradient representations.

pith-pipeline@v0.9.0 · 5806 in / 1226 out tokens · 49923 ms · 2026-05-19T23:06:42.062019+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We establish dynamic programming principles for both the original and relaxed formulations... derive explicit policy-gradient representations for the relaxed objective.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

KL-penalized relaxation that replaces hard marginal constraints with tractable divergence penalties while preserving the recursive structure

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

[1]

Here, the first inequality follows fromJπ n+1 ≥V n+1 and the monotonicity of operatorQπ n+1, and the last inequality follows from (15)

For anyπ∈ΠΠΠbc, we haveQπ n+1 ∈Π n+1 for alln∈[N−1] 0, and J π n =c n +Q π n+1[J π n+1]≥c n +Q π n+1[Vn+1] = Γ Qπ n+1 n [Vn+1]≥Γ n[Vn+1]. Here, the first inequality follows fromJπ n+1 ≥V n+1 and the monotonicity of operatorQπ n+1, and the last inequality follows from (15). Then, by the definition ofVn in (13) as the infimum overπ∈ΠΠΠbc, we obtainVn ≥Γ n[Vn+1]

work page
[2]

NX n=0 cn(Yn, Y ′ n) # =E π

For anyϵ >0, there exist kernelsQ ϵ 1 ∈Π 1, . . . , Qϵ N ∈Π N such that for anyn∈[N−1] 0, Γn[Vn+1]≤c n +Q ϵ n+1[Vn+1]≤Γ n[Vn+1] +ϵ2 −(n+1) (A.1) Then, for anyπ0 ∈Π(µ 0, µ′ 0), we construct a Markov bi-causal coupling πϵ :=π π0,QQQϵ withQQQϵ = (Qϵ 1, . . . , Qϵ N).(A.2) 42 Next, we prove by backward induction that forn∈[N−1] 0, J πϵ n ≤Γ n[Vn+1] +ϵ N−1X k=...

work page
[3]

43 The definition ofWbc in (5) as the infimum overπ∈ΠΠΠbc yieldsW bc(µ, µ′)≥inf π0∈Π(µ0,µ′ 0) Eπ0[V0]

For anyπ∈ΠΠΠbc withπ 0 ∈Π(µ 0, µ′ 0),J π 0 ≥V 0 and thus Eπ J π 0 (Y0, Y ′ 0) ≥E π V0(Y0, Y ′ 0) =E π0 V0(Y0, Y ′ 0) ≥inf π0∈Π(µ0,µ′ 0) Eπ0 V0(Y0, Y ′ 0) . 43 The definition ofWbc in (5) as the infimum overπ∈ΠΠΠbc yieldsW bc(µ, µ′)≥inf π0∈Π(µ0,µ′ 0) Eπ0[V0]

work page
[4]

Notice that (A.3) holds for any initial coupling inΠ(µ0, µ′ 0)

For anyϵ >0, there existsγ ϵ 0 ∈Π(µ 0, µ′ 0)such that Eγϵ 0[V0]≤inf π0∈Π(µ0,µ′ 0) Eπ0[V0] +ϵ.(A.4) UsingQQQε defined in (A.2), we construct a Markov bi-causal couplingγϵ :=π γϵ 0,QQQε ∈ M bc. Notice that (A.3) holds for any initial coupling inΠ(µ0, µ′ 0). Therefore, Part I and (A.3) imply thatV 0 ≤J γϵ 0 < V0 +ϵ, and using (A.4), we have Eγϵ[J γϵ 0 ]≤E γϵ...

work page
[5]

To establish the second equivalent representation, notice thatγ ϵ ∈ M bc(µ, µ′)⊂Π ΠΠbc and Wbc(µ, µ′) = inf π0∈Π(µ0,µ′

Eπ0[V0]. To establish the second equivalent representation, notice thatγ ϵ ∈ M bc(µ, µ′)⊂Π ΠΠbc and Wbc(µ, µ′) = inf π0∈Π(µ0,µ′

work page
[6]

B.1 Proof of Theorem 1 Proof.In the following proof, we work on theP(Y × Y)with the weak convergence topology, and omit the “weak” thereafter

Eπ0[V0], we have Wbc(µ, µ′) = inf π∈ΠΠΠbc Eπ " NX n=0 cn(Yn, Y ′ n) # ≤inf π∈Mbc(µ,µ′) Eπ J π 0 (Y0, Y ′ 0) ≤E γϵ[J πϵ 0 (Y0, Y ′ 0)]≤W bc(µ, µ′) + 2ϵ.(A.5) Sinceϵ >0is arbitrary, all inequalities in (A.5) become equalities, and hence Wbc(µ, µ′) = inf π0∈Π(µ0,µ′ 0) Eπ0[V0(Y0, Y ′ 0)] = inf π∈Mbc(µ,µ′) Eπ " NX n=0 cn(Yn, Y ′ n) # .□ B Omitted proofs in Sec...

work page 2013
[7]

Here, the first inequality follows fromeJ π n+1 ≥ eVn+1 and the monotonicity of the operatorQπ n+1, and the last inequality follows from the definition ofeΓn in (25)

For anyπ∈Π rel, we considerQπ n+1 ∈ Urel for alln∈[N−1] 0, and eJ π n =c n +β h KL(1) n+1(µ, π) + KL(2) n+1(µ′, π) i +Q π n+1[eJ π n+1] ≥c n +β h KL(1) n+1(µ, π) + KL(2) n+1(µ′, π) i +Q π n+1[eVn+1] =eΓ Qπ n+1 n [eVn+1]≥ eΓn[eVn+1]. Here, the first inequality follows fromeJ π n+1 ≥ eVn+1 and the monotonicity of the operatorQπ n+1, and the last inequality ...

work page
[8]

For anyϵ >0, there exist kernels eQϵ 1, . . . ,eQϵ N ∈ Urel such that for anyn∈[N−1] 0, eΓn[eVn+1]≤c n+β h KL(1) n+1(µ,eπϵ)+KL(2) n+1(µ′,eπϵ) i +eQϵ n+1[eVn+1]≤ eΓn[eVn+1]+ϵ2 −(n+1).(B.10) Then, for anyeπ0 ∈ P(S×S), we construct a relaxed Markov policy eπϵ :=π eπ0, eQeQeQϵ with eQeQeQϵ = (eQϵ 1, . . . ,eQϵ N).(B.11) 48 Next, we prove by backward induction...

work page
[9]

Therefore, (17) is equivalent to Wrel(µ, µ′) = inf π∈Πrel Eπ0 h eJ π 0 (Y0, Y ′

+β KL(1) 0 (µ, π) + KL(2) 0 (µ′, π) i . Therefore, (17) is equivalent to Wrel(µ, µ′) = inf π∈Πrel Eπ0 h eJ π 0 (Y0, Y ′

work page
[10]

To show the equivalent representation forWrel(µ, µ′), we combine the following two results

+β KL(1) 0 (µ, π) + KL(2) 0 (µ′, π) i . To show the equivalent representation forWrel(µ, µ′), we combine the following two results

work page
[11]

For anyπ∈Π rel, we haveeJ π 0 ≥ eV0, and thus Eπ0 h eJ π 0 (Y0, Y ′

work page
[12]

+β KL(1) 0 (µ, π) + KL(2) 0 (µ′, π) i ≥E π0 h eV0(Y0, Y ′

work page
[13]

+β KL(1) 0 (µ, π) + KL(2) 0 (µ′, π) i ≥inf π0∈P(S×S) Eπ0 h eV0(Y0, Y ′

work page
[14]

The definition ofWrel(µ, µ′)in (17) as the infimum overπ∈Π rel yields Wrel(µ, µ′)≥inf π0∈P(S×S) Eπ0 h eV0(Y0, Y ′

+β KL(1) 0 (µ, π) + KL(2) 0 (µ′, π) i . The definition ofWrel(µ, µ′)in (17) as the infimum overπ∈Π rel yields Wrel(µ, µ′)≥inf π0∈P(S×S) Eπ0 h eV0(Y0, Y ′

work page
[15]

+β KL(1) 0 (µ, π) + KL(2) 0 (µ′, π) i

work page
[16]

For anyϵ >0, there existseγϵ 0 ∈ P(S×S)such that Eeγϵ 0 h eV0(Y0, Y ′

work page
[17]

+β KL(1) 0 (µ,eγϵ) + KL(2) 0 (µ′,eγϵ) i (B.14) ≤inf π0∈P(S×S) Eπ0 h eV0(Y0, Y ′

work page
[18]

Using eQeQeQϵ defined in (B.11), we construct a relaxed Markov policy eγϵ :=π eγϵ 0, eQeQeQϵ ∈Π rel

+β KL(1) 0 (µ, π) + KL(2) 0 (µ′, π) i +ϵ. Using eQeQeQϵ defined in (B.11), we construct a relaxed Markov policy eγϵ :=π eγϵ 0, eQeQeQϵ ∈Π rel. Notice that (B.12) holds for any initial coupling inP(S×S). Therefore, Part I and (B.12) imply that eV0 ≤ eJ eγϵ 0 < eV0 +ϵ, which further gives that Eeγϵ 0 h eJ eγϵ 0 (Y0, Y ′

work page
[19]

+β KL(1) 0 (µ,eγϵ) + KL(2) 0 (µ′,eγϵ) i ≤E eγϵ 0 h eV0(Y0, Y ′

work page
[20]

+β KL(1) 0 (µ,eγϵ) + KL(2) 0 (µ′,eγϵ) i +ϵ 50 ≤inf π0∈P(S×S) Eπ0 h eV0(Y0, Y ′

work page
[21]

Therefore, by the definition ofWrel(µ, µ′)in (17) as the infimum overπ∈Π rel and the arbitrarily chosenϵ, we obtain Wrel(µ, µ′)≤inf π0∈P(S×S) Eπ0 h eV0(Y0, Y ′

+β KL(1) 0 (µ, π) + KL(2) 0 (µ′, π) i + 2ϵ, where the last inequality follows from (B.14). Therefore, by the definition ofWrel(µ, µ′)in (17) as the infimum overπ∈Π rel and the arbitrarily chosenϵ, we obtain Wrel(µ, µ′)≤inf π0∈P(S×S) Eπ0 h eV0(Y0, Y ′

work page
[22]

Combining the two inequalities in both directions, we conclude that Wrel(µ, µ′) = inf π0∈P(S×S) Eπ0 h eV0(Y0, Y ′

+β KL(1) 0 (µ, π) + KL(2) 0 (µ′, π) i . Combining the two inequalities in both directions, we conclude that Wrel(µ, µ′) = inf π0∈P(S×S) Eπ0 h eV0(Y0, Y ′

work page
[23]

NX k=0 ck( pYk, pY ′ k) ! ∇θn logq θn n ( pYn, pY ′ n | pYn−1, pY ′ n−1) # =E πθθθ

+β KL(1) 0 (µ, π) + KL(2) 0 (µ′, π) i .□ C Omitted proofs in Section 4 In this section, we provide the formal proofs of Theorems 2 and 3, Lemma 4, Corollary 1, and other supporting lemmas used in the proof. C.1 Proof of Theorem 2 Proof.In the following proof, we treatJval(θθθ)in (31) andJ KL(θθθ)in (32) separately. We consider the case with arbitraryn∈[N]...

work page 2021

[1] [1]

Here, the first inequality follows fromJπ n+1 ≥V n+1 and the monotonicity of operatorQπ n+1, and the last inequality follows from (15)

For anyπ∈ΠΠΠbc, we haveQπ n+1 ∈Π n+1 for alln∈[N−1] 0, and J π n =c n +Q π n+1[J π n+1]≥c n +Q π n+1[Vn+1] = Γ Qπ n+1 n [Vn+1]≥Γ n[Vn+1]. Here, the first inequality follows fromJπ n+1 ≥V n+1 and the monotonicity of operatorQπ n+1, and the last inequality follows from (15). Then, by the definition ofVn in (13) as the infimum overπ∈ΠΠΠbc, we obtainVn ≥Γ n[Vn+1]

work page

[2] [2]

NX n=0 cn(Yn, Y ′ n) # =E π

For anyϵ >0, there exist kernelsQ ϵ 1 ∈Π 1, . . . , Qϵ N ∈Π N such that for anyn∈[N−1] 0, Γn[Vn+1]≤c n +Q ϵ n+1[Vn+1]≤Γ n[Vn+1] +ϵ2 −(n+1) (A.1) Then, for anyπ0 ∈Π(µ 0, µ′ 0), we construct a Markov bi-causal coupling πϵ :=π π0,QQQϵ withQQQϵ = (Qϵ 1, . . . , Qϵ N).(A.2) 42 Next, we prove by backward induction that forn∈[N−1] 0, J πϵ n ≤Γ n[Vn+1] +ϵ N−1X k=...

work page

[3] [3]

43 The definition ofWbc in (5) as the infimum overπ∈ΠΠΠbc yieldsW bc(µ, µ′)≥inf π0∈Π(µ0,µ′ 0) Eπ0[V0]

For anyπ∈ΠΠΠbc withπ 0 ∈Π(µ 0, µ′ 0),J π 0 ≥V 0 and thus Eπ J π 0 (Y0, Y ′ 0) ≥E π V0(Y0, Y ′ 0) =E π0 V0(Y0, Y ′ 0) ≥inf π0∈Π(µ0,µ′ 0) Eπ0 V0(Y0, Y ′ 0) . 43 The definition ofWbc in (5) as the infimum overπ∈ΠΠΠbc yieldsW bc(µ, µ′)≥inf π0∈Π(µ0,µ′ 0) Eπ0[V0]

work page

[4] [4]

Notice that (A.3) holds for any initial coupling inΠ(µ0, µ′ 0)

For anyϵ >0, there existsγ ϵ 0 ∈Π(µ 0, µ′ 0)such that Eγϵ 0[V0]≤inf π0∈Π(µ0,µ′ 0) Eπ0[V0] +ϵ.(A.4) UsingQQQε defined in (A.2), we construct a Markov bi-causal couplingγϵ :=π γϵ 0,QQQε ∈ M bc. Notice that (A.3) holds for any initial coupling inΠ(µ0, µ′ 0). Therefore, Part I and (A.3) imply thatV 0 ≤J γϵ 0 < V0 +ϵ, and using (A.4), we have Eγϵ[J γϵ 0 ]≤E γϵ...

work page

[5] [5]

To establish the second equivalent representation, notice thatγ ϵ ∈ M bc(µ, µ′)⊂Π ΠΠbc and Wbc(µ, µ′) = inf π0∈Π(µ0,µ′

Eπ0[V0]. To establish the second equivalent representation, notice thatγ ϵ ∈ M bc(µ, µ′)⊂Π ΠΠbc and Wbc(µ, µ′) = inf π0∈Π(µ0,µ′

work page

[6] [6]

B.1 Proof of Theorem 1 Proof.In the following proof, we work on theP(Y × Y)with the weak convergence topology, and omit the “weak” thereafter

Eπ0[V0], we have Wbc(µ, µ′) = inf π∈ΠΠΠbc Eπ " NX n=0 cn(Yn, Y ′ n) # ≤inf π∈Mbc(µ,µ′) Eπ J π 0 (Y0, Y ′ 0) ≤E γϵ[J πϵ 0 (Y0, Y ′ 0)]≤W bc(µ, µ′) + 2ϵ.(A.5) Sinceϵ >0is arbitrary, all inequalities in (A.5) become equalities, and hence Wbc(µ, µ′) = inf π0∈Π(µ0,µ′ 0) Eπ0[V0(Y0, Y ′ 0)] = inf π∈Mbc(µ,µ′) Eπ " NX n=0 cn(Yn, Y ′ n) # .□ B Omitted proofs in Sec...

work page 2013

[7] [7]

Here, the first inequality follows fromeJ π n+1 ≥ eVn+1 and the monotonicity of the operatorQπ n+1, and the last inequality follows from the definition ofeΓn in (25)

For anyπ∈Π rel, we considerQπ n+1 ∈ Urel for alln∈[N−1] 0, and eJ π n =c n +β h KL(1) n+1(µ, π) + KL(2) n+1(µ′, π) i +Q π n+1[eJ π n+1] ≥c n +β h KL(1) n+1(µ, π) + KL(2) n+1(µ′, π) i +Q π n+1[eVn+1] =eΓ Qπ n+1 n [eVn+1]≥ eΓn[eVn+1]. Here, the first inequality follows fromeJ π n+1 ≥ eVn+1 and the monotonicity of the operatorQπ n+1, and the last inequality ...

work page

[8] [8]

For anyϵ >0, there exist kernels eQϵ 1, . . . ,eQϵ N ∈ Urel such that for anyn∈[N−1] 0, eΓn[eVn+1]≤c n+β h KL(1) n+1(µ,eπϵ)+KL(2) n+1(µ′,eπϵ) i +eQϵ n+1[eVn+1]≤ eΓn[eVn+1]+ϵ2 −(n+1).(B.10) Then, for anyeπ0 ∈ P(S×S), we construct a relaxed Markov policy eπϵ :=π eπ0, eQeQeQϵ with eQeQeQϵ = (eQϵ 1, . . . ,eQϵ N).(B.11) 48 Next, we prove by backward induction...

work page

[9] [9]

Therefore, (17) is equivalent to Wrel(µ, µ′) = inf π∈Πrel Eπ0 h eJ π 0 (Y0, Y ′

+β KL(1) 0 (µ, π) + KL(2) 0 (µ′, π) i . Therefore, (17) is equivalent to Wrel(µ, µ′) = inf π∈Πrel Eπ0 h eJ π 0 (Y0, Y ′

work page

[10] [10]

To show the equivalent representation forWrel(µ, µ′), we combine the following two results

+β KL(1) 0 (µ, π) + KL(2) 0 (µ′, π) i . To show the equivalent representation forWrel(µ, µ′), we combine the following two results

work page

[11] [11]

For anyπ∈Π rel, we haveeJ π 0 ≥ eV0, and thus Eπ0 h eJ π 0 (Y0, Y ′

work page

[12] [12]

+β KL(1) 0 (µ, π) + KL(2) 0 (µ′, π) i ≥E π0 h eV0(Y0, Y ′

work page

[13] [13]

+β KL(1) 0 (µ, π) + KL(2) 0 (µ′, π) i ≥inf π0∈P(S×S) Eπ0 h eV0(Y0, Y ′

work page

[14] [14]

The definition ofWrel(µ, µ′)in (17) as the infimum overπ∈Π rel yields Wrel(µ, µ′)≥inf π0∈P(S×S) Eπ0 h eV0(Y0, Y ′

+β KL(1) 0 (µ, π) + KL(2) 0 (µ′, π) i . The definition ofWrel(µ, µ′)in (17) as the infimum overπ∈Π rel yields Wrel(µ, µ′)≥inf π0∈P(S×S) Eπ0 h eV0(Y0, Y ′

work page

[15] [15]

+β KL(1) 0 (µ, π) + KL(2) 0 (µ′, π) i

work page

[16] [16]

For anyϵ >0, there existseγϵ 0 ∈ P(S×S)such that Eeγϵ 0 h eV0(Y0, Y ′

work page

[17] [17]

+β KL(1) 0 (µ,eγϵ) + KL(2) 0 (µ′,eγϵ) i (B.14) ≤inf π0∈P(S×S) Eπ0 h eV0(Y0, Y ′

work page

[18] [18]

Using eQeQeQϵ defined in (B.11), we construct a relaxed Markov policy eγϵ :=π eγϵ 0, eQeQeQϵ ∈Π rel

+β KL(1) 0 (µ, π) + KL(2) 0 (µ′, π) i +ϵ. Using eQeQeQϵ defined in (B.11), we construct a relaxed Markov policy eγϵ :=π eγϵ 0, eQeQeQϵ ∈Π rel. Notice that (B.12) holds for any initial coupling inP(S×S). Therefore, Part I and (B.12) imply that eV0 ≤ eJ eγϵ 0 < eV0 +ϵ, which further gives that Eeγϵ 0 h eJ eγϵ 0 (Y0, Y ′

work page

[19] [19]

+β KL(1) 0 (µ,eγϵ) + KL(2) 0 (µ′,eγϵ) i ≤E eγϵ 0 h eV0(Y0, Y ′

work page

[20] [20]

+β KL(1) 0 (µ,eγϵ) + KL(2) 0 (µ′,eγϵ) i +ϵ 50 ≤inf π0∈P(S×S) Eπ0 h eV0(Y0, Y ′

work page

[21] [21]

Therefore, by the definition ofWrel(µ, µ′)in (17) as the infimum overπ∈Π rel and the arbitrarily chosenϵ, we obtain Wrel(µ, µ′)≤inf π0∈P(S×S) Eπ0 h eV0(Y0, Y ′

+β KL(1) 0 (µ, π) + KL(2) 0 (µ′, π) i + 2ϵ, where the last inequality follows from (B.14). Therefore, by the definition ofWrel(µ, µ′)in (17) as the infimum overπ∈Π rel and the arbitrarily chosenϵ, we obtain Wrel(µ, µ′)≤inf π0∈P(S×S) Eπ0 h eV0(Y0, Y ′

work page

[22] [22]

Combining the two inequalities in both directions, we conclude that Wrel(µ, µ′) = inf π0∈P(S×S) Eπ0 h eV0(Y0, Y ′

+β KL(1) 0 (µ, π) + KL(2) 0 (µ′, π) i . Combining the two inequalities in both directions, we conclude that Wrel(µ, µ′) = inf π0∈P(S×S) Eπ0 h eV0(Y0, Y ′

work page

[23] [23]

NX k=0 ck( pYk, pY ′ k) ! ∇θn logq θn n ( pYn, pY ′ n | pYn−1, pY ′ n−1) # =E πθθθ

+β KL(1) 0 (µ, π) + KL(2) 0 (µ′, π) i .□ C Omitted proofs in Section 4 In this section, we provide the formal proofs of Theorems 2 and 3, Lemma 4, Corollary 1, and other supporting lemmas used in the proof. C.1 Proof of Theorem 2 Proof.In the following proof, we treatJval(θθθ)in (31) andJ KL(θθθ)in (32) separately. We consider the case with arbitraryn∈[N]...

work page 2021