arxiv: 2605.08515 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.RO

Recognition: 2 theorem links

· Lean Theorem

Quantile-Coupled Flow Matching for Distributional Reinforcement Learning

James Wilson, Lars Kunze, Michael Groom, Nick Hawes, Victor-Alexandru Darvariu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:04 UTC · model grok-4.3

classification 💻 cs.LG cs.RO

keywords distributional reinforcement learningconditional flow matchingWasserstein distancequantile couplingreturn distributionsoffline RLflow-based critics

0 comments

The pith

Quantile coupling aligns conditional flow matching with Wasserstein distances in distributional RL

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Distributional RL models the full return distribution rather than its expectation to enable uncertainty-aware decisions. Prior conditional flow matching critics paired source and target samples arbitrarily, creating a mismatch with the Wasserstein contraction property that underpins DRL theory. FlowIQN addresses this by sorting the samples in each mini-batch to approximate the monotone optimal transport coupling. The resulting flow-matching loss then functions as a Wasserstein-aligned approximate projection. This provides a theoretically grounded flow-based critic that fits within standard DRL frameworks.

Core claim

We prove that the loss of our quantile-coupled CFM critic yields a Wasserstein-aligned approximate projection compatible with the foundations of DRL. To our knowledge, FlowIQN is the first flow-matching distributional critic with an explicit Wasserstein-aligned projection guarantee.

What carries the argument

Quantile-coupled conditional flow matching, where source and Bellman target samples are sorted within each mini-batch to approximate the monotone optimal transport coupling and replace arbitrary pairings.

If this is right

The method improves Wasserstein return-distribution accuracy compared to other CFM critics.
It achieves competitive performance on offline RL benchmarks across multiple policy extraction methods.
Shortcut models enable efficient inference while retaining the theoretical properties.
The approach is readily compatible with existing DRL pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This coupling strategy may extend to other distance metrics or generative models used in reinforcement learning.
Improved distributional accuracy could enhance policy performance in tasks sensitive to return variance or risk.
The simplicity of batch sorting suggests the method can be adopted with minimal changes to current training code.

Load-bearing premise

That sorting source and Bellman target samples within each mini-batch sufficiently approximates the monotone optimal transport coupling.

What would settle it

Experiments showing that the Wasserstein distance between the critic's predicted return distribution and the Bellman target fails to improve over uncoupled CFM variants on standard benchmarks would refute the alignment claim.

Figures

Figures reproduced from arXiv: 2605.08515 by James Wilson, Lars Kunze, Michael Groom, Nick Hawes, Victor-Alexandru Darvariu.

**Figure 1.** Figure 1: FlowIQN aligns flow matching with Wasserstein geometry. (Left) Standard CFM arbitrarily couples source and target quantiles (τ, τ ′ ). This metric mismatch produces crossing paths and a double-integral expected cost that is not a Wasserstein projection. (Center & Right) FlowIQN enforces a sorted coupling (τ ≈ τ ′ ) to approximate the 1D monotone optimal transport map. Because sorting pairs samples by their… view at source ↗

**Figure 2.** Figure 2: Quantile coupling improves flow-based return modelling. Performance profiles of negative empirical W2 distance to Monte Carlo (MC) return targets in the fixed-policy evaluation setting; higher curves indicate lower Wasserstein error. FlowIQN closes much of the gap between flow-based return models and classical distributional critics, substantially improving over Value Flows while remaining competitive with… view at source ↗

**Figure 3.** Figure 3: OGBench [24] tasks, including manipulation and locomotion tasks. Offline RL tasks and datasets. Following recent works in offline RL [2, 9, 25], we use the reward-based singletask variants of OGBench tasks as our primary benchmark [24]. OGBench provides a range of long-horizon robotics tasks, with sparse or semi-sparse rewards, and highly multimodal action distributions, making it a strong testbed for c… view at source ↗

**Figure 4.** Figure 4: Ablations on FlowIQN integration. Performance profiles of negative empirical W2 distance in the fixed-policy evaluation setting; higher curves indicate lower Wasserstein error. (a) The adaptive schedule improves accuracy at matched Euler step counts, while increasing the number of uniform steps alone does not recover the same performance. (b) Increasing the number of Euler steps yields only marginal gains … view at source ↗

read the original abstract

Unlike standard expected-return Reinforcement Learning (RL), Distributional RL (DRL) models the full return distribution, making it better-suited for uncertainty-aware and risk-sensitive decision-making. Conditional Flow Matching (CFM) critics have recently attracted attention for modelling continuous, multi-modal return distributions. Despite this interest, there remains a substantial metric mismatch: DRL theory relies on the distributional Bellman operator being contractive in the $p$-Wasserstein distance, yet existing CFM critics are trained with arbitrary source-target couplings, so their flow-matching losses are not Wasserstein-aligned surrogates for matching Bellman target return distributions. In this work, we address this mismatch by proposing FlowIQN, a CFM critic that sorts source and Bellman target samples within each mini-batch to approximate the monotone optimal transport coupling, replacing arbitrary pairings with quantile-aligned flow paths. We prove that the loss of our quantile-coupled CFM critic yields a Wasserstein-aligned approximate projection compatible with the foundations of DRL. To our knowledge, FlowIQN is the first flow-matching distributional critic with an explicit Wasserstein-aligned projection guarantee. We further extend FlowIQN with shortcut models for efficient inference. Empirical results show that FlowIQN improves Wasserstein return-distribution accuracy over other CFM critics. It also yields competitive performance on offline RL benchmarks across multiple policy extraction methods, providing a theoretically grounded CFM critic that is readily compatible with DRL pipelines. Code: https://github.com/ori-goals/flowIQN.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FlowIQN sorts mini-batch samples to align CFM losses with Wasserstein distances in DRL, delivering a claimed projection guarantee and modest empirical gains, but the approximation error from sorting lacks bounds or ablations.

read the letter

The core contribution is a simple change to conditional flow matching for distributional critics: sort the source and Bellman target samples inside each mini-batch so the flow paths follow the monotone optimal transport map instead of arbitrary pairings. This is presented as the first explicit Wasserstein-aligned projection for a flow-matching DRL critic, and the abstract states they prove compatibility with the contractive properties of the distributional Bellman operator. They also add shortcut models for faster sampling and release code. That is the actual novelty and the part that could matter for people building uncertainty-aware or risk-sensitive agents. The empirical side shows improved Wasserstein accuracy on return distributions and competitive offline RL scores across a few policy extraction methods, which is useful to see even if the gains are not dramatic. The soft spot is exactly the one the stress-test flags. Sorting gives the exact 1D OT map between the two empirical measures in a batch, but the paper does not supply a quantitative bound on how far this deviates from the true Wasserstein distance as a function of batch size, support size, or tail behavior. Without that, or an ablation that isolates the sorting step from other implementation choices, the theoretical guarantee applies cleanly only in the idealized case; the practical algorithm inherits it only if the Monte-Carlo error stays negligible. The abstract does not indicate they close this gap with additional analysis. This paper is for researchers already working on continuous distributional critics who want a metric-consistent alternative to standard CFM or IQN-style methods. A reader who cares about the interface between generative modeling and DRL theory will find the idea and the code worth looking at. I would send it to peer review; the targeted fix is clear enough and the implementation is public, so referees can check the details and the approximation question directly.

Referee Report

2 major / 2 minor

Summary. The paper introduces FlowIQN, a conditional flow matching (CFM) critic for distributional reinforcement learning that sorts source and Bellman target samples within each mini-batch to approximate the monotone optimal transport coupling. It claims to prove that the resulting quantile-coupled CFM loss yields a Wasserstein-aligned approximate projection compatible with DRL foundations (i.e., contractivity of the distributional Bellman operator in the p-Wasserstein metric), and reports empirical gains in Wasserstein return-distribution accuracy plus competitive performance on offline RL benchmarks across policy extraction methods. Code is provided.

Significance. If the approximation error from mini-batch sorting can be controlled and the projection guarantee shown to preserve non-expansiveness, the result would be significant: it would supply the first flow-matching distributional critic with an explicit Wasserstein-aligned projection property, closing the metric mismatch between CFM training objectives and DRL theory. The reproducible code and benchmark results strengthen the practical contribution.

major comments (2)

[Abstract and proof of projection guarantee] Abstract and the section presenting the projection guarantee: the claim that quantile-coupled CFM yields a Wasserstein-aligned approximate projection compatible with DRL foundations rests on mini-batch sorting serving as a sufficient proxy for the monotone OT map. While sorting yields the exact OT map between the two empirical measures in 1D, the manuscript provides no quantitative bound on the deviation of the resulting objective from the true W_p distance (as a function of batch size, support cardinality, or tail behavior), nor does it verify that the approximate projection operator remains non-expansive. This is load-bearing for the compatibility with fixed-point arguments.
[Empirical results] Experimental section reporting Wasserstein accuracy: the claimed improvements over other CFM critics are presented without ablations that isolate the effect of the quantile-sorting coupling (e.g., comparison to random or learned couplings, or sensitivity to batch size). Without such controls it is unclear whether the gains are attributable to the Wasserstein alignment or to other implementation choices.

minor comments (2)

[Method] Notation for the flow paths and coupling should be made fully explicit (e.g., how the sorted pairs define the time-dependent vector field) to facilitate reproduction and extension.
[Extensions] The shortcut-model extension for inference efficiency is mentioned but its interaction with the quantile coupling is not detailed; a brief complexity or error analysis would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the two major comments point by point below, clarifying the theoretical claims and committing to additional empirical controls.

read point-by-point responses

Referee: [Abstract and proof of projection guarantee] Abstract and the section presenting the projection guarantee: the claim that quantile-coupled CFM yields a Wasserstein-aligned approximate projection compatible with DRL foundations rests on mini-batch sorting serving as a sufficient proxy for the monotone OT map. While sorting yields the exact OT map between the two empirical measures in 1D, the manuscript provides no quantitative bound on the deviation of the resulting objective from the true W_p distance (as a function of batch size, support cardinality, or tail behavior), nor does it verify that the approximate projection operator remains non-expansive. This is load-bearing for the compatibility with fixed-point arguments.

Authors: We agree that an explicit quantitative bound would make the approximation guarantee more complete. The manuscript proves that mini-batch sorting realizes the exact monotone optimal transport map between the empirical source and target measures, so the resulting CFM objective is exactly the Wasserstein distance between those finite-support measures. We do not derive a new finite-sample concentration bound on the deviation from the population W_p distance (which would indeed depend on batch size, support size, and tail behavior), but we will add a discussion paragraph in the revised version that invokes existing 1D empirical Wasserstein concentration results to quantify the approximation rate. On non-expansiveness, the proof shows that the quantile-coupled operator is a consistent approximation to the true (non-expansive) Wasserstein projection; we will insert a short remark clarifying that the fixed-point property is preserved in the large-batch limit and that the operator remains contractive in expectation under standard assumptions on the return distributions. revision: partial
Referee: [Empirical results] Experimental section reporting Wasserstein accuracy: the claimed improvements over other CFM critics are presented without ablations that isolate the effect of the quantile-sorting coupling (e.g., comparison to random or learned couplings, or sensitivity to batch size). Without such controls it is unclear whether the gains are attributable to the Wasserstein alignment or to other implementation choices.

Authors: We concur that isolating the contribution of the quantile-sorting coupling is necessary. In the revised manuscript we will add two sets of controls: (i) direct comparisons of FlowIQN against identical architectures trained with random source-target pairings and with a learned coupling network, and (ii) a batch-size sensitivity sweep (e.g., 32, 64, 128, 256) reporting both Wasserstein accuracy and downstream policy performance. These ablations will be placed in a new subsection of the experimental results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; proof of approximate projection is independent of inputs

full rationale

The paper claims a proof that the quantile-coupled CFM loss produces a Wasserstein-aligned approximate projection compatible with DRL foundations (contractivity in W_p). This is presented as a separate mathematical argument relying on the exactness of sorting for the monotone coupling between empirical measures in 1D, with the approximation error from mini-batch sorting explicitly noted as such. No equation reduces to another by construction, no parameter is fitted and then relabeled as a prediction, and no load-bearing self-citation or uniqueness theorem imported from prior author work is used to force the result. The empirical evaluation is kept separate from the theoretical guarantee. The derivation chain therefore remains self-contained rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard DRL theory and introduces quantile coupling as the key innovation; no free parameters or new entities are indicated in the provided abstract.

axioms (1)

domain assumption The distributional Bellman operator is contractive in the p-Wasserstein distance.
This is invoked as the foundation of DRL theory that the new critic must remain compatible with.

pith-pipeline@v0.9.0 · 5580 in / 1153 out tokens · 21282 ms · 2026-05-12T02:04:14.877687+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We prove that the loss of our quantile-coupled CFM critic yields a Wasserstein-aligned approximate projection... sorting source and Bellman target samples within each mini-batch to approximate the monotone optimal transport coupling
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear
Wp(µ, ν) = (∫ |F⁻¹_µ(τ) − F⁻¹_ν(τ)|^p dτ)^{1/p}

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 4 internal anchors

[1]

What Does Flow Matching Bring To TD Learning?

B. Agrawalla, M. Nauman, and A. Kumar. What does flow matching bring to td learning?arXiv preprint arXiv:2603.04333, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

B. K. Agrawalla, M. Nauman, K. Agrawal, and A. Kumar. floq: Training critics via flow- matching for scaling compute in value-based RL. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=m14YNdmPAh

work page 2026
[3]

Albergo, N

M. Albergo, N. M. Boffi, and E. Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.Journal of Machine Learning Research, 26(209):1–80, 2025

work page 2025
[4]

M. G. Bellemare, W. Dabney, and R. Munos. A distributional perspective on reinforcement learning. InInternational conference on machine learning, pages 449–458. Pmlr, 2017

work page 2017
[5]

M. G. Bellemare, W. Dabney, and M. Rowland.Distributional Reinforcement Learning. MIT Press, 2023.http://www.distributional-rl.org

work page 2023
[6]

D. Chen, Y . Liu, Z. Zhou, C. Qu, and Y . Qi. Unleashing flow policies with distributional critics. arXiv preprint arXiv:2509.23087, 2025

work page arXiv 2025
[7]

Dabney, G

W. Dabney, G. Ostrovski, D. Silver, and R. Munos. Implicit quantile networks for distributional reinforcement learning. In J. Dy and A. Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1096–1105. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/ dab...

work page 2018
[8]

Dabney, M

W. Dabney, M. Rowland, M. Bellemare, and R. Munos. Distributional reinforcement learning with quantile regression. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

work page 2018
[9]

P. Dong, C. Zheng, C. Finn, D. Sadigh, and B. Eysenbach. Value flows. InInternational Conference on Learning Representations (ICLR), 2026

work page 2026
[10]

arXiv preprint arXiv:2510.08218 , year=

N. Espinosa-Dice, K. Brantley, and W. Sun. Expressive value learning for scalable offline reinforcement learning.arXiv preprint arXiv:2510.08218, 2025

work page arXiv 2025
[11]

Frans, D

K. Frans, D. Hafner, S. Levine, and P. Abbeel. One step diffusion via shortcut models. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=OlzB6LnXcS

work page 2025
[12]

Fujimoto, D

S. Fujimoto, D. Meger, and D. Precup. Off-policy deep reinforcement learning without explo- ration. InInternational conference on machine learning, pages 2052–2062. PMLR, 2019

work page 2052
[13]

Hyvärinen

A. Hyvärinen. Estimation of non-normalized statistical models by score matching.Journal of Machine Learning Research, 6(24):695–709, 2005. URL http://jmlr.org/papers/v6/ hyvarinen05a.html

work page 2005
[14]

Jo and S

S. Jo and S. Choi. Formalizing the sampling design space of diffusion-based generative models via adaptive solvers and wasserstein-bounded timesteps.arXiv preprint arXiv:2602.12624, 2026

work page arXiv 2026
[15]

Kumar, A

A. Kumar, A. Zhou, G. Tucker, and S. Levine. Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems, 33:1179–1191, 2020

work page 2020
[16]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

S. Levine, A. Kumar, G. Tucker, and J. Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020. 10

work page internal anchor Pith review Pith/arXiv arXiv 2005
[17]

Lipman, R

Y . Lipman, R. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. 2023. 11th International Conference on Learning Representations, ICLR 2023 ; Conference date: 01-05-2023 Through 05-05-2023

work page 2023
[18]

Flow Matching Guide and Code

Y . Lipman, M. Havasi, P. Holderrieth, N. Shaul, M. Le, B. Karrer, R. T. Chen, D. Lopez-Paz, H. Ben-Hamu, and I. Gat. Flow matching guide and code.arXiv preprint arXiv:2412.06264, 2024

work page internal anchor Pith review arXiv 2024
[19]

X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations (ICLR), 2023. URL https://arxiv.org/abs/2209.03003

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

X. Ma, J. Chen, L. Xia, J. Yang, Q. Zhao, and Z. Zhou. Dsac: Distributional soft actor-critic for risk-sensitive reinforcement learning.Journal of Artificial Intelligence Research, 83, 2025

work page 2025
[21]

Y . Ma, D. Jayaraman, and O. Bastani. Conservative offline distributional reinforcement learning. Advances in neural information processing systems, 34:19235–19247, 2021

work page 2021
[22]

Mavrin, H

B. Mavrin, H. Yao, L. Kong, K. Wu, and Y . Yu. Distributional reinforcement learning for efficient exploration. In K. Chaudhuri and R. Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 4424–4434. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr. press/v...

work page 2019
[23]

V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep rein- forcement learning.nature, 518(7540):529–533, 2015

work page 2015
[24]

S. Park, K. Frans, B. Eysenbach, and S. Levine. Ogbench: Benchmarking offline goal- conditioned rl. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[25]

S. Park, Q. Li, and S. Levine. Flow q-learning. InInternational Conference on Machine Learning (ICML), 2025

work page 2025
[26]

Santambrogio.Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling

F. Santambrogio.Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling. Progress in Nonlinear Differential Equations and Their Applications. Springer International Publishing, 2015. ISBN 9783319208282. URL https://books.google.co. uk/books?id=UOHHCgAAQBAJ

work page 2015
[27]

M. G. Silveri, A. O. Durmus, and G. Conforti. Theoretical guarantees in KL for diffusion flow matching. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,

work page
[28]

URLhttps://openreview.net/forum?id=ia4WUCwHA9

work page
[29]

A. Tong, K. Fatras, N. Malkin, G. Huguet, Y . Zhang, J. Rector-Brooks, G. Wolf, and Y . Bengio. Improving and generalizing flow-based generative models with minibatch optimal transport. Transactions on Machine Learning Research, pages 1–34, Mar. 2024. ISSN 2835-8856

work page 2024
[30]

K. Wang, K. Zhou, R. Wu, N. Kallus, and W. Sun. The benefits of being distributional: Small- loss bounds for reinforcement learning.Advances in neural information processing systems, 36: 2275–2312, 2023

work page 2023
[31]

K. Wang, O. Oertell, A. Agarwal, N. Kallus, and W. Sun. More benefits of being distributional: Second-order bounds for reinforcement learning.arXiv preprint arXiv:2402.07198, 2024

work page arXiv 2024
[32]

K. Wang, I. Javali, M. Bortkiewicz, T. Trzcinski, and B. Eysenbach. 1000 layer networks for self-supervised RL: Scaling depth can enable new goal-reaching capabilities. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2026. URL https: //openreview.net/forum?id=s0JVsx3bx1

work page 2026
[33]

P. R. Wurman, S. Barrett, K. Kawamoto, J. MacGlashan, K. Subramanian, T. J. Walsh, R. Capo- bianco, A. Devlic, F. Eckert, F. Fuchs, et al. Outracing champion gran turismo drivers with deep reinforcement learning.Nature, 602(7896):223–228, 2022

work page 2022
[34]

Zhong, S

S. Zhong, S. Ding, H. Diao, X. Wang, K. C. Teh, and B. Peng. Flowcritic: Bridging value estimation with flow matching in reinforcement learning.arXiv preprint arXiv:2510.22686, 2025. 11 A Additional Background and Preliminaries This appendix collects supplementary definitions used throughout the paper. The main text contains the background needed to follo...

work page arXiv 2025