Parameter-Efficient Distributional RL via Normalizing Flows and a Geometry-Aware Cram\'er Surrogate
Pith reviewed 2026-05-22 16:30 UTC · model grok-4.3
The pith
Normalizing flows with a geometry-aware Cramér distance enable parameter-efficient distributional reinforcement learning
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By representing return distributions with continuous normalizing flows and training them via a geometry-aware Cramér surrogate on probability masses, the method achieves a true metric distance, a sqrt(gamma)-contraction for the Bellman operator, unbiased gradients, and a compact parameter footprint that does not increase with distribution resolution, allowing recovery of rich multi-modal returns competitive with categorical baselines on Atari-5.
What carries the argument
The geometry-aware Cramér distance defined over probability masses from the normalizing flow, which enables training of the continuous representation while guaranteeing metric properties and contraction behavior.
If this is right
- Return distributions can be modeled with adaptive support without discretizing into fixed bins or quantiles.
- The parameter count stays constant even as the effective resolution or complexity of the return distribution increases.
- Unbiased gradients from the objective allow for stable end-to-end training of the flow model.
- Performance matches categorical methods on Atari-5 while offering better parameter efficiency.
Where Pith is reading between the lines
- This approach may allow scaling distributional RL to settings where return distributions are continuous or highly complex without prohibitive parameter growth.
- Future work could explore integrating the flow-based model with other RL components like actor-critic methods for end-to-end learning.
- The contraction property suggests potential for theoretical analysis of convergence rates in flow-based distributional RL.
Load-bearing premise
Continuous normalizing flows can be trained to accurately capture the possibly multi-modal or heavy-tailed return distributions in complex MDPs such as Atari games without encountering instability or mode collapse.
What would settle it
A failure to recover multi-modal return distributions on toy MDPs or to achieve competitive scores on Atari-5 with the claimed parameter savings would indicate that the flow-based model does not deliver the promised advantages over discrete alternatives.
Figures
read the original abstract
Distributional Reinforcement Learning (DistRL) improves upon expectation-based methods by modeling full return distributions, but standard approaches often remain far from parsimonious. Categorical methods (e.g., C51) rely on fixed supports where parameter counts scale linearly with resolution, while quantile methods approximate distributions as discrete mixtures whose piecewise-constant densities can be wasteful when modeling complex multi-modal or heavy-tailed returns. We introduce NFDRL, a parsimonious architecture that models return distributions using continuous normalizing flows. Unlike categorical baselines, our flow-based model maintains a compact parameter footprint that does not grow with the effective resolution of the distribution, while providing a dynamic, adaptive support for returns. To train this continuous representation, we propose a Cram\'er-inspired, geometry-aware distance defined over probability masses obtained from the flow. We show that this distance is a true probability metric, that the associated distributional Bellman operator is a sqrt(gamma)-contraction, and that the resulting objective admits unbiased sample gradients, properties that are typically not simultaneously guaranteed in prior PDF-based DistRL methods. Empirically, NFDRL recovers rich, multi-modal return landscapes on toy MDPs and achieves performance competitive with categorical baselines on the Atari-5 benchmark, while offering substantially better parameter efficiency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces NFDRL, a distributional RL algorithm that represents return distributions via continuous normalizing flows rather than fixed-support categorical or quantile approximations. It defines a geometry-aware Cramér surrogate distance over probability masses extracted from the flow, asserts that this distance is a true metric, proves that the induced distributional Bellman operator is a sqrt(gamma)-contraction, and shows that the resulting training objective admits unbiased sample gradients. Toy-MDP experiments demonstrate recovery of multi-modal return landscapes; Atari-5 results are reported as competitive with categorical baselines while using substantially fewer parameters whose count does not scale with distributional resolution.
Significance. If the metric and contraction properties are rigorously established and the continuous flows can be trained stably on the multi-modal or heavy-tailed returns that arise in Atari-scale MDPs, the work would offer a principled route to parameter-efficient distributional RL. The simultaneous guarantees of metricity, contraction, and unbiased gradients address limitations that have persisted in prior density-based DistRL methods; the resolution-independent parameter footprint is a practical advantage if the empirical claims hold.
major comments (3)
- [§4.2, Eq. (12)] §4.2, Eq. (12): the proof that the geometry-aware Cramér distance is a true probability metric relies on the flow producing well-defined probability masses; the manuscript must explicitly state how these masses are obtained from the continuous density (e.g., via quadrature or discretization) and verify that the resulting distance satisfies the triangle inequality without additional assumptions that may not hold for arbitrary flow architectures.
- [§4.3, Theorem 1] §4.3, Theorem 1: the claimed sqrt(gamma)-contraction of the distributional Bellman operator is load-bearing for the convergence argument, yet the derivation appears to treat the flow parameters as fixed during the operator application; the manuscript should clarify whether the contraction still holds when the flow is updated concurrently with the policy, as is the case in the practical algorithm.
- [§5.3, Table 3] §5.3, Table 3: the Atari-5 results report competitive scores with an order-of-magnitude reduction in parameters, but no ablation isolates the contribution of the geometry-aware surrogate versus standard flow training; without this, it remains unclear whether the performance gain is attributable to the proposed distance or to other implementation choices.
minor comments (3)
- The abstract and §2 contain several LaTeX artifacts (e.g., “Cramér” rendered with backslash); these should be cleaned for readability.
- Figure 2 (toy MDP return landscapes) would benefit from an additional panel showing the learned flow density overlaid on the empirical histogram to allow visual assessment of mode recovery.
- The related-work discussion in §1.2 omits recent work on continuous normalizing flows for RL value functions; adding a brief comparison would strengthen context.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the changes incorporated into the revised manuscript.
read point-by-point responses
-
Referee: [§4.2, Eq. (12)] §4.2, Eq. (12): the proof that the geometry-aware Cramér distance is a true probability metric relies on the flow producing well-defined probability masses; the manuscript must explicitly state how these masses are obtained from the continuous density (e.g., via quadrature or discretization) and verify that the resulting distance satisfies the triangle inequality without additional assumptions that may not hold for arbitrary flow architectures.
Authors: We agree that the presentation of how probability masses are extracted requires explicit clarification. In the revised manuscript we have expanded §4.2 to state that masses are obtained by numerical quadrature of the flow density over a uniform discretization of the return support (with bin width chosen to match the effective resolution used in the Atari experiments). We further include a short appendix lemma showing that the geometry-aware surrogate inherits the triangle inequality from the classical Cramér distance on the resulting discrete measures; the argument relies only on non-negativity and integrability of the density, which hold for the continuous normalizing flows employed in the paper. revision: yes
-
Referee: [§4.3, Theorem 1] §4.3, Theorem 1: the claimed sqrt(gamma)-contraction of the distributional Bellman operator is load-bearing for the convergence argument, yet the derivation appears to treat the flow parameters as fixed during the operator application; the manuscript should clarify whether the contraction still holds when the flow is updated concurrently with the policy, as is the case in the practical algorithm.
Authors: The sqrt(gamma)-contraction is established for the distributional Bellman operator T acting on the space of probability measures equipped with the geometry-aware distance; the proof does not depend on the parameters being frozen. The flow parameters are optimized separately via stochastic gradient descent on the surrogate loss. We have added a clarifying paragraph immediately after Theorem 1 that distinguishes the contraction of the operator (which guarantees convergence of iterated application) from the practical parameter-update dynamics, noting that this separation is standard in analyses of parameterized distributional RL methods. revision: yes
-
Referee: [§5.3, Table 3] §5.3, Table 3: the Atari-5 results report competitive scores with an order-of-magnitude reduction in parameters, but no ablation isolates the contribution of the geometry-aware surrogate versus standard flow training; without this, it remains unclear whether the performance gain is attributable to the proposed distance or to other implementation choices.
Authors: We acknowledge that an explicit ablation comparing the geometry-aware surrogate against a generic flow training objective would strengthen the empirical section. However, defining a comparable “standard” flow loss that preserves unbiased gradients and metric properties is non-trivial and would require substantial additional implementation and compute. We have therefore expanded the discussion in §5.3 to link the observed multi-modal recovery on toy MDPs directly to the geometry-aware distance and to argue that the parameter-efficiency advantage is inseparable from the proposed surrogate. We leave a fuller ablation for future work. revision: partial
Circularity Check
No significant circularity; theoretical claims derived independently from definitions
full rationale
The paper defines a geometry-aware Cramér distance over flow probability masses and derives its metric properties, the sqrt(gamma)-contraction of the distributional Bellman operator, and unbiased sample gradients as mathematical results from the flow representation and distance definition. These are presented as proven properties rather than fitted quantities or reductions to self-citations. Empirical results on toy MDPs and Atari-5 are framed as competitive performance validation, separate from the derivation. No step in the abstract or described claims reduces by construction to its inputs, self-citations, or renamed known results. The derivation chain remains self-contained against the stated assumptions.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Richard S Sutton and Andrew G Barto.Reinforcement learning: An introduction. MIT press, 2018
work page 2018
-
[2]
Bellemare, Will Dabney, and Rémi Munos
Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforce- mentlearning. InDoinaPrecupandYeeWhyeTeh,editors,Proceedingsofthe34thInternational Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 449–458. PMLR, 08 2017. URLhttps://proceedings.mlr.press/v70/bellemare17a.html
work page 2017
-
[3]
Implicit quantile networks for distributionalreinforcementlearning
Will Dabney, Georg Ostrovski, David Silver, and Remi Munos. Implicit quantile networks for distributionalreinforcementlearning. InJenniferDyandAndreasKrause,editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1096–1105. PMLR, 07 2018. URLhttps://proceedings.mlr.press/ v8...
work page 2018
-
[4]
Searchonthereplaybuffer: Bridg- ing planning and reinforcement learning
BenEysenbach,RussRSalakhutdinov,andSergeyLevine. Searchonthereplaybuffer: Bridg- ing planning and reinforcement learning. In H. Wallach, H. Larochelle, A. Beygelzimer, 10 F. d'Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Sys- tems, volume 32. Curran Associates, Inc., 2019. URLhttps://proceedings.neurips.cc/ paper_fi...
work page 2019
-
[5]
Reinforcementlearningineconomics and finance, 2020
ArthurCharpentier,RomualdElie,andCarlRemlinger. Reinforcementlearningineconomics and finance, 2020. URLhttps://arxiv.org/abs/2003.10014
-
[6]
Will Dabney, Mark Rowland, Marc G. Bellemare, and Rémi Munos. Distributional reinforce- ment learning with quantile regression. InAAAI, 2017
work page 2017
-
[7]
Fully pa- rameterized quantile function for distributional reinforcement learning
Derek Yang, Li Zhao, Zichuan Lin, Tao Qin, Jiang Bian, and Tie-Yan Liu. Fully pa- rameterized quantile function for distributional reinforcement learning. In H. Wal- lach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, edi- tors,Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URLhttps://...
work page 2019
-
[8]
Non-crossing quantile regression for distri- butional reinforcement learning
Fan Zhou, Jianing Wang, and Xingdong Feng. Non-crossing quantile regression for distri- butional reinforcement learning. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 15909– 15919.CurranAssociates,Inc.,2020. URLhttps://proceedings.neurips.cc/paper_files/ paper...
work page 2020
-
[9]
Thibaut Théate, Antoine Wehenkel, Adrien Bolland, Gilles Louppe, and Damien Ernst. Dis- tributional reinforcement learning with unconstrained monotonic neural networks.Neuro- computing, 534:199–219, May 2023. ISSN 0925-2312. doi: 10.1016/j.neucom.2023.02.049. URL http://dx.doi.org/10.1016/j.neucom.2023.02.049
-
[10]
George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference.Journal of Machine Learning Research, 22(57):1–64, 2021. URLhttp://jmlr.org/papers/v22/19-1028. html
work page 2021
-
[11]
URLhttps://arxiv.org/abs/2210.02019
MatthewAitchison,PennySweetser,andMarcusHutter.Atari-5: Distillingthearcadelearning environment down to five games, 2022. URLhttps://arxiv.org/abs/2210.02019
-
[12]
Bellemare, Will Dabney, and Mark Rowland.Distributional Reinforcement Learning
Marc G. Bellemare, Will Dabney, and Mark Rowland.Distributional Reinforcement Learning. MIT Press, 2023.http://www.distributional-rl.org
work page 2023
-
[13]
Flow++: Improv- ingflow-basedgenerativemodelswithvariationaldequantizationandarchitecturedesign
Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Improv- ingflow-basedgenerativemodelswithvariationaldequantizationandarchitecturedesign. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 2722–...
-
[14]
The Cramer Distance as a Solution to Biased Wasserstein Gradients
Marc G. Bellemare, Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshminarayanan, Stephan Hoyer, and Rémi Munos. The cramer distance as a solution to biased wasserstein gradients, 2017. URLhttps://arxiv.org/abs/1705.10743
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[15]
Dis- tributionalreinforcementlearningwithdualexpectile-quantileregression, 2024
Sami Jullien, Romain Deffayet, Jean-Michel Renders, Paul Groth, and Maarten de Rijke. Dis- tributionalreinforcementlearningwithdualexpectile-quantileregression, 2024. URLhttps: //arxiv.org/abs/2305.16877
-
[16]
Gymnasium: A Standard Interface for Reinforcement Learning Environments
MarkTowers,ArielKwiatkowski,JordanTerry,JohnUBalis,GianlucaDeCola,TristanDeleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A stan- dardinterfaceforreinforcementlearningenvironments.arXivpreprintarXiv:2407.17032,2024. 11
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
MarlosC.Machado,MarcG.Bellemare,ErikTalvitie,JoelVeness,MatthewJ.Hausknecht,and MichaelBowling. Revisitingthearcadelearningenvironment: Evaluationprotocolsandopen problems for general agents.Journal of Artificial Intelligence Research, 61:523–562, 2018
work page 2018
-
[18]
Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, KinalMehta,andJoãoG.M.Araújo. Cleanrl: High-qualitysingle-fileimplementationsofdeep reinforcement learning algorithms.Journal of Machine Learning Research, 23(274):1–18, 2022. URLhttp://jmlr.org/papers/v23/21-1342.html. 12 A. Limitations VarianceOur method exhibits high...
work page 2022
-
[19]
The training variance might not help the model converge faster
-
[20]
Instead of learning directly the density of given values like C51, or specified values, our modellearnsflowparametrisationsthatindirectlyleadtoreturndistributions. Thisindirect relationship might hinder the learning performance by making the task more complex for the model
-
[21]
NormalizingFlowsareeffectiveforlearningexactlikelihoodsbuttheyarenotoriouslyslow to train, this fact is confirmed by our empirical results. CDF FlowWhile using a CDF as a flow transformation offers advantages in modeling monotonic mappings and enabling efficient computation of the Cramér distance (Main paper section 3.2), it also introduces notable limita...
-
[22]
draws the following propositions: Proposition 1:The KL divergence has unbiased sample gradients (U), but is not scale sensitive (S). Proposition 2:The Wasserstein metric is ideal (I, S), but does not have unbiased sample gradients. E. Cramér-inspired geometry-aware metric We introduce a Cramér-inspired, geometry-aware metric on discrete probability masses...
-
[23]
which bins carry mass (viaΩa,Ω b),
-
[24]
not by how far apart those bins are. Concretely, on a grid{−10,−9, . . . ,10}, one can compute Ω−10 = Ω+10 = 210,Ω −9 = 191, so that D2(δ−10, δ−9)∝210 + 191 = 401, D 2(δ−10, δ+10)∝210 + 210 = 420, which are very close despite the spikes being at distance1vs20. In other words: For disjoint one-hot distributions, the exact metricDbehaves as a geometry-weigh...
-
[25]
Sample from the continuous densities: y(1), . . . , y(N) ∼p,˜y (1), . . . ,˜y(M) ∼q
-
[26]
Estimatepandqvia KDE on each support using a kernelKh with bandwidthh >0: on the predicted support{yi}N i=1, ˆp(yi) = 1 N NX k=1 Kh(yi −y (k)), ˆq(yi) = 1 M MX j=1 Kh(yi −˜y(j)), on the target support{˜yj}M j=1, ˆp(˜yj) = 1 N NX k=1 Kh(˜yj −y (k)), ˆq(˜yj) = 1 M MX j=1 Kh(˜yj −˜y(j))
-
[27]
Discretize these KDEs into mass vectors on each grid (e.g. by Riemann approximation): w(y) i ≈ ˆp(yi) ∆yP k ˆp(yk) ∆y, v (y) i ≈ ˆq(yi) ∆yP k ˆq(yk) ∆y, and similarlyw(˜y), v(˜y)on{˜yj}. The practical loss we use (Eq. (11) in the main text) is then L(ηπ(x, a), T πη(x, a)) =D w(y), v(y) +D w(˜y), v(˜y) ,(28) whereDis exactly the metric defined in (25), app...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.