Parameter-Efficient Distributional RL via Normalizing Flows and a Geometry-Aware Cram\'er Surrogate

Jesse Read; Marie-Paule Cani; Rim Kaddah; Simo Alami C.

arxiv: 2505.04310 · v2 · submitted 2025-05-07 · 💻 cs.AI · cs.LG· math.OC

Parameter-Efficient Distributional RL via Normalizing Flows and a Geometry-Aware Cram\'er Surrogate

Simo Alami C. , Rim Kaddah , Jesse Read , Marie-Paule Cani This is my paper

Pith reviewed 2026-05-22 16:30 UTC · model grok-4.3

classification 💻 cs.AI cs.LGmath.OC

keywords distributional reinforcement learningnormalizing flowsCramér distanceparameter efficiencyAtari benchmarkBellman operatorreturn distributionsmulti-modal returns

0 comments

The pith

Normalizing flows with a geometry-aware Cramér distance enable parameter-efficient distributional reinforcement learning

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops NFDRL to model full return distributions in reinforcement learning using continuous normalizing flows rather than discrete categorical or quantile representations. It introduces a geometry-aware Cramér distance over the flow-derived probability masses. This distance qualifies as a true probability metric, makes the distributional Bellman operator a contraction with factor sqrt(gamma), and supports unbiased sample gradients for training. The resulting model captures multi-modal and heavy-tailed returns adaptively without parameter counts scaling with resolution. Experiments show it recovers complex distributions on toy tasks and performs competitively with standard methods on Atari-5 while using substantially fewer parameters.

Core claim

By representing return distributions with continuous normalizing flows and training them via a geometry-aware Cramér surrogate on probability masses, the method achieves a true metric distance, a sqrt(gamma)-contraction for the Bellman operator, unbiased gradients, and a compact parameter footprint that does not increase with distribution resolution, allowing recovery of rich multi-modal returns competitive with categorical baselines on Atari-5.

What carries the argument

The geometry-aware Cramér distance defined over probability masses from the normalizing flow, which enables training of the continuous representation while guaranteeing metric properties and contraction behavior.

If this is right

Return distributions can be modeled with adaptive support without discretizing into fixed bins or quantiles.
The parameter count stays constant even as the effective resolution or complexity of the return distribution increases.
Unbiased gradients from the objective allow for stable end-to-end training of the flow model.
Performance matches categorical methods on Atari-5 while offering better parameter efficiency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach may allow scaling distributional RL to settings where return distributions are continuous or highly complex without prohibitive parameter growth.
Future work could explore integrating the flow-based model with other RL components like actor-critic methods for end-to-end learning.
The contraction property suggests potential for theoretical analysis of convergence rates in flow-based distributional RL.

Load-bearing premise

Continuous normalizing flows can be trained to accurately capture the possibly multi-modal or heavy-tailed return distributions in complex MDPs such as Atari games without encountering instability or mode collapse.

What would settle it

A failure to recover multi-modal return distributions on toy MDPs or to achieve competitive scores on Atari-5 with the claimed parameter savings would indicate that the flow-based model does not deliver the promised advantages over discrete alternatives.

Figures

Figures reproduced from arXiv: 2505.04310 by Jesse Read, Marie-Paule Cani, Rim Kaddah, Simo Alami C..

**Figure 2.** Figure 2: Return distributions learnt for the final state of [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: NFDRL Parameter Efficiency. 5. Conclusion We introduced a new DistRL method that models return distributions as mixtures of Gaussians, with parameters learned via normalizing flows. By optimizing the Cramér loss—exactly or through a surrogate—we capture richer uncertainty and learn more precise value distributions. Empirically, our method achieves competitive or better performance on Atari games while bei… view at source ↗

read the original abstract

Distributional Reinforcement Learning (DistRL) improves upon expectation-based methods by modeling full return distributions, but standard approaches often remain far from parsimonious. Categorical methods (e.g., C51) rely on fixed supports where parameter counts scale linearly with resolution, while quantile methods approximate distributions as discrete mixtures whose piecewise-constant densities can be wasteful when modeling complex multi-modal or heavy-tailed returns. We introduce NFDRL, a parsimonious architecture that models return distributions using continuous normalizing flows. Unlike categorical baselines, our flow-based model maintains a compact parameter footprint that does not grow with the effective resolution of the distribution, while providing a dynamic, adaptive support for returns. To train this continuous representation, we propose a Cram\'er-inspired, geometry-aware distance defined over probability masses obtained from the flow. We show that this distance is a true probability metric, that the associated distributional Bellman operator is a sqrt(gamma)-contraction, and that the resulting objective admits unbiased sample gradients, properties that are typically not simultaneously guaranteed in prior PDF-based DistRL methods. Empirically, NFDRL recovers rich, multi-modal return landscapes on toy MDPs and achieves performance competitive with categorical baselines on the Atari-5 benchmark, while offering substantially better parameter efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NFDRL swaps categorical supports for continuous normalizing flows in distributional RL and adds a geometry-aware Cramér surrogate with claimed metric, contraction, and unbiased-gradient properties, but the practical training stability on shifting multi-modal returns is the part that needs checking.

read the letter

The main thing to know is that this paper replaces the usual fixed-support or quantile representations in distributional RL with a continuous normalizing flow. They define a geometry-aware Cramér distance over the flow's probability masses and assert that it is a true metric, that the associated Bellman operator contracts at rate sqrt(gamma), and that the objective yields unbiased sample gradients. Those three properties together are not common in prior PDF-based methods, so the combination is the actual novelty here. Empirically they recover multi-modal returns on toy MDPs and report competitive scores against categorical baselines on Atari-5 while using far fewer parameters that do not grow with resolution. That parameter-efficiency angle is the practical selling point and looks like a direct response to the scaling issues in C51-style approaches. The proofs and the flow architecture are the parts that feel like real work rather than incremental tuning. The soft spot is whether the continuous flow can be trained reliably on the kinds of non-stationary, possibly heavy-tailed return distributions that appear in Atari-scale problems. Flows are known to be sensitive to ODE tolerances, base-distribution choice, and mode coverage, and RL targets move as the policy improves. The abstract states that the method works without collapse on the tested cases, but any referee would want to see training curves, ablation on solver settings, and checks that the learned densities actually match the claimed multi-modality rather than smoothing over it. This is for people already working on parameter-efficient distributional methods or on alternatives to discrete supports. It has enough new architecture plus claimed theory to be worth a serious referee's time, even if the empirical section will probably need more controls and sensitivity analysis before acceptance.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces NFDRL, a distributional RL algorithm that represents return distributions via continuous normalizing flows rather than fixed-support categorical or quantile approximations. It defines a geometry-aware Cramér surrogate distance over probability masses extracted from the flow, asserts that this distance is a true metric, proves that the induced distributional Bellman operator is a sqrt(gamma)-contraction, and shows that the resulting training objective admits unbiased sample gradients. Toy-MDP experiments demonstrate recovery of multi-modal return landscapes; Atari-5 results are reported as competitive with categorical baselines while using substantially fewer parameters whose count does not scale with distributional resolution.

Significance. If the metric and contraction properties are rigorously established and the continuous flows can be trained stably on the multi-modal or heavy-tailed returns that arise in Atari-scale MDPs, the work would offer a principled route to parameter-efficient distributional RL. The simultaneous guarantees of metricity, contraction, and unbiased gradients address limitations that have persisted in prior density-based DistRL methods; the resolution-independent parameter footprint is a practical advantage if the empirical claims hold.

major comments (3)

[§4.2, Eq. (12)] §4.2, Eq. (12): the proof that the geometry-aware Cramér distance is a true probability metric relies on the flow producing well-defined probability masses; the manuscript must explicitly state how these masses are obtained from the continuous density (e.g., via quadrature or discretization) and verify that the resulting distance satisfies the triangle inequality without additional assumptions that may not hold for arbitrary flow architectures.
[§4.3, Theorem 1] §4.3, Theorem 1: the claimed sqrt(gamma)-contraction of the distributional Bellman operator is load-bearing for the convergence argument, yet the derivation appears to treat the flow parameters as fixed during the operator application; the manuscript should clarify whether the contraction still holds when the flow is updated concurrently with the policy, as is the case in the practical algorithm.
[§5.3, Table 3] §5.3, Table 3: the Atari-5 results report competitive scores with an order-of-magnitude reduction in parameters, but no ablation isolates the contribution of the geometry-aware surrogate versus standard flow training; without this, it remains unclear whether the performance gain is attributable to the proposed distance or to other implementation choices.

minor comments (3)

The abstract and §2 contain several LaTeX artifacts (e.g., “Cramér” rendered with backslash); these should be cleaned for readability.
Figure 2 (toy MDP return landscapes) would benefit from an additional panel showing the learned flow density overlaid on the empirical histogram to allow visual assessment of mode recovery.
The related-work discussion in §1.2 omits recent work on continuous normalizing flows for RL value functions; adding a brief comparison would strengthen context.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the changes incorporated into the revised manuscript.

read point-by-point responses

Referee: [§4.2, Eq. (12)] §4.2, Eq. (12): the proof that the geometry-aware Cramér distance is a true probability metric relies on the flow producing well-defined probability masses; the manuscript must explicitly state how these masses are obtained from the continuous density (e.g., via quadrature or discretization) and verify that the resulting distance satisfies the triangle inequality without additional assumptions that may not hold for arbitrary flow architectures.

Authors: We agree that the presentation of how probability masses are extracted requires explicit clarification. In the revised manuscript we have expanded §4.2 to state that masses are obtained by numerical quadrature of the flow density over a uniform discretization of the return support (with bin width chosen to match the effective resolution used in the Atari experiments). We further include a short appendix lemma showing that the geometry-aware surrogate inherits the triangle inequality from the classical Cramér distance on the resulting discrete measures; the argument relies only on non-negativity and integrability of the density, which hold for the continuous normalizing flows employed in the paper. revision: yes
Referee: [§4.3, Theorem 1] §4.3, Theorem 1: the claimed sqrt(gamma)-contraction of the distributional Bellman operator is load-bearing for the convergence argument, yet the derivation appears to treat the flow parameters as fixed during the operator application; the manuscript should clarify whether the contraction still holds when the flow is updated concurrently with the policy, as is the case in the practical algorithm.

Authors: The sqrt(gamma)-contraction is established for the distributional Bellman operator T acting on the space of probability measures equipped with the geometry-aware distance; the proof does not depend on the parameters being frozen. The flow parameters are optimized separately via stochastic gradient descent on the surrogate loss. We have added a clarifying paragraph immediately after Theorem 1 that distinguishes the contraction of the operator (which guarantees convergence of iterated application) from the practical parameter-update dynamics, noting that this separation is standard in analyses of parameterized distributional RL methods. revision: yes
Referee: [§5.3, Table 3] §5.3, Table 3: the Atari-5 results report competitive scores with an order-of-magnitude reduction in parameters, but no ablation isolates the contribution of the geometry-aware surrogate versus standard flow training; without this, it remains unclear whether the performance gain is attributable to the proposed distance or to other implementation choices.

Authors: We acknowledge that an explicit ablation comparing the geometry-aware surrogate against a generic flow training objective would strengthen the empirical section. However, defining a comparable “standard” flow loss that preserves unbiased gradients and metric properties is non-trivial and would require substantial additional implementation and compute. We have therefore expanded the discussion in §5.3 to link the observed multi-modal recovery on toy MDPs directly to the geometry-aware distance and to argue that the parameter-efficiency advantage is inseparable from the proposed surrogate. We leave a fuller ablation for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; theoretical claims derived independently from definitions

full rationale

The paper defines a geometry-aware Cramér distance over flow probability masses and derives its metric properties, the sqrt(gamma)-contraction of the distributional Bellman operator, and unbiased sample gradients as mathematical results from the flow representation and distance definition. These are presented as proven properties rather than fitted quantities or reductions to self-citations. Empirical results on toy MDPs and Atari-5 are framed as competitive performance validation, separate from the derivation. No step in the abstract or described claims reduces by construction to its inputs, self-citations, or renamed known results. The derivation chain remains self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

With only the abstract available, the ledger is necessarily incomplete. The paper appears to rely on standard properties of normalizing flows and contraction mappings in metric spaces, but no explicit free parameters, ad-hoc axioms, or new invented entities are named in the provided text.

pith-pipeline@v0.9.0 · 5770 in / 1284 out tokens · 41070 ms · 2026-05-22T16:30:44.772347+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 2 internal anchors

[1]

MIT press, 2018

Richard S Sutton and Andrew G Barto.Reinforcement learning: An introduction. MIT press, 2018

work page 2018
[2]

Bellemare, Will Dabney, and Rémi Munos

Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforce- mentlearning. InDoinaPrecupandYeeWhyeTeh,editors,Proceedingsofthe34thInternational Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 449–458. PMLR, 08 2017. URLhttps://proceedings.mlr.press/v70/bellemare17a.html

work page 2017
[3]

Implicit quantile networks for distributionalreinforcementlearning

Will Dabney, Georg Ostrovski, David Silver, and Remi Munos. Implicit quantile networks for distributionalreinforcementlearning. InJenniferDyandAndreasKrause,editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1096–1105. PMLR, 07 2018. URLhttps://proceedings.mlr.press/ v8...

work page 2018
[4]

Searchonthereplaybuffer: Bridg- ing planning and reinforcement learning

BenEysenbach,RussRSalakhutdinov,andSergeyLevine. Searchonthereplaybuffer: Bridg- ing planning and reinforcement learning. In H. Wallach, H. Larochelle, A. Beygelzimer, 10 F. d'Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Sys- tems, volume 32. Curran Associates, Inc., 2019. URLhttps://proceedings.neurips.cc/ paper_fi...

work page 2019
[5]

Reinforcementlearningineconomics and finance, 2020

ArthurCharpentier,RomualdElie,andCarlRemlinger. Reinforcementlearningineconomics and finance, 2020. URLhttps://arxiv.org/abs/2003.10014

work page arXiv 2020
[6]

Bellemare, and Rémi Munos

Will Dabney, Mark Rowland, Marc G. Bellemare, and Rémi Munos. Distributional reinforce- ment learning with quantile regression. InAAAI, 2017

work page 2017
[7]

Fully pa- rameterized quantile function for distributional reinforcement learning

Derek Yang, Li Zhao, Zichuan Lin, Tao Qin, Jiang Bian, and Tie-Yan Liu. Fully pa- rameterized quantile function for distributional reinforcement learning. In H. Wal- lach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, edi- tors,Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URLhttps://...

work page 2019
[8]

Non-crossing quantile regression for distri- butional reinforcement learning

Fan Zhou, Jianing Wang, and Xingdong Feng. Non-crossing quantile regression for distri- butional reinforcement learning. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 15909– 15919.CurranAssociates,Inc.,2020. URLhttps://proceedings.neurips.cc/paper_files/ paper...

work page 2020
[9]

Dis- tributional reinforcement learning with unconstrained monotonic neural networks.Neuro- computing, 534:199–219, May 2023

Thibaut Théate, Antoine Wehenkel, Adrien Bolland, Gilles Louppe, and Damien Ernst. Dis- tributional reinforcement learning with unconstrained monotonic neural networks.Neuro- computing, 534:199–219, May 2023. ISSN 0925-2312. doi: 10.1016/j.neucom.2023.02.049. URL http://dx.doi.org/10.1016/j.neucom.2023.02.049

work page doi:10.1016/j.neucom.2023.02.049 2023
[10]

Normalizing flows for probabilistic modeling and inference.Journal of Machine Learning Research, 22(57):1–64, 2021

George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference.Journal of Machine Learning Research, 22(57):1–64, 2021. URLhttp://jmlr.org/papers/v22/19-1028. html

work page 2021
[11]

URLhttps://arxiv.org/abs/2210.02019

MatthewAitchison,PennySweetser,andMarcusHutter.Atari-5: Distillingthearcadelearning environment down to five games, 2022. URLhttps://arxiv.org/abs/2210.02019

work page arXiv 2022
[12]

Bellemare, Will Dabney, and Mark Rowland.Distributional Reinforcement Learning

Marc G. Bellemare, Will Dabney, and Mark Rowland.Distributional Reinforcement Learning. MIT Press, 2023.http://www.distributional-rl.org

work page 2023
[13]

Flow++: Improv- ingflow-basedgenerativemodelswithvariationaldequantizationandarchitecturedesign

Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Improv- ingflow-basedgenerativemodelswithvariationaldequantizationandarchitecturedesign. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 2722–...

work page
[14]

The Cramer Distance as a Solution to Biased Wasserstein Gradients

Marc G. Bellemare, Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshminarayanan, Stephan Hoyer, and Rémi Munos. The cramer distance as a solution to biased wasserstein gradients, 2017. URLhttps://arxiv.org/abs/1705.10743

work page internal anchor Pith review Pith/arXiv arXiv 2017
[15]

Dis- tributionalreinforcementlearningwithdualexpectile-quantileregression, 2024

Sami Jullien, Romain Deffayet, Jean-Michel Renders, Paul Groth, and Maarten de Rijke. Dis- tributionalreinforcementlearningwithdualexpectile-quantileregression, 2024. URLhttps: //arxiv.org/abs/2305.16877

work page arXiv 2024
[16]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

MarkTowers,ArielKwiatkowski,JordanTerry,JohnUBalis,GianlucaDeCola,TristanDeleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A stan- dardinterfaceforreinforcementlearningenvironments.arXivpreprintarXiv:2407.17032,2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Revisitingthearcadelearningenvironment: Evaluationprotocolsandopen problems for general agents.Journal of Artificial Intelligence Research, 61:523–562, 2018

MarlosC.Machado,MarcG.Bellemare,ErikTalvitie,JoelVeness,MatthewJ.Hausknecht,and MichaelBowling. Revisitingthearcadelearningenvironment: Evaluationprotocolsandopen problems for general agents.Journal of Artificial Intelligence Research, 61:523–562, 2018

work page 2018
[18]

Cleanrl: High-qualitysingle-fileimplementationsofdeep reinforcement learning algorithms.Journal of Machine Learning Research, 23(274):1–18, 2022

Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, KinalMehta,andJoãoG.M.Araújo. Cleanrl: High-qualitysingle-fileimplementationsofdeep reinforcement learning algorithms.Journal of Machine Learning Research, 23(274):1–18, 2022. URLhttp://jmlr.org/papers/v23/21-1342.html. 12 A. Limitations VarianceOur method exhibits high...

work page 2022
[19]

The training variance might not help the model converge faster

work page
[20]

Thisindirect relationship might hinder the learning performance by making the task more complex for the model

Instead of learning directly the density of given values like C51, or specified values, our modellearnsflowparametrisationsthatindirectlyleadtoreturndistributions. Thisindirect relationship might hinder the learning performance by making the task more complex for the model

work page
[21]

NormalizingFlowsareeffectiveforlearningexactlikelihoodsbuttheyarenotoriouslyslow to train, this fact is confirmed by our empirical results. CDF FlowWhile using a CDF as a flow transformation offers advantages in modeling monotonic mappings and enabling efficient computation of the Cramér distance (Main paper section 3.2), it also introduces notable limita...

work page
[22]

geometric weight

draws the following propositions: Proposition 1:The KL divergence has unbiased sample gradients (U), but is not scale sensitive (S). Proposition 2:The Wasserstein metric is ideal (I, S), but does not have unbiased sample gradients. E. Cramér-inspired geometry-aware metric We introduce a Cramér-inspired, geometry-aware metric on discrete probability masses...

work page
[23]

which bins carry mass (viaΩa,Ω b),

work page
[24]

near” from “far

not by how far apart those bins are. Concretely, on a grid{−10,−9, . . . ,10}, one can compute Ω−10 = Ω+10 = 210,Ω −9 = 191, so that D2(δ−10, δ−9)∝210 + 191 = 401, D 2(δ−10, δ+10)∝210 + 210 = 420, which are very close despite the spikes being at distance1vs20. In other words: For disjoint one-hot distributions, the exact metricDbehaves as a geometry-weigh...

work page
[25]

, y(N) ∼p,˜y (1),

Sample from the continuous densities: y(1), . . . , y(N) ∼p,˜y (1), . . . ,˜y(M) ∼q

work page
[26]

Estimatepandqvia KDE on each support using a kernelKh with bandwidthh >0: on the predicted support{yi}N i=1, ˆp(yi) = 1 N NX k=1 Kh(yi −y (k)), ˆq(yi) = 1 M MX j=1 Kh(yi −˜y(j)), on the target support{˜yj}M j=1, ˆp(˜yj) = 1 N NX k=1 Kh(˜yj −y (k)), ˆq(˜yj) = 1 M MX j=1 Kh(˜yj −˜y(j))

work page
[27]

twospikes

Discretize these KDEs into mass vectors on each grid (e.g. by Riemann approximation): w(y) i ≈ ˆp(yi) ∆yP k ˆp(yk) ∆y, v (y) i ≈ ˆq(yi) ∆yP k ˆq(yk) ∆y, and similarlyw(˜y), v(˜y)on{˜yj}. The practical loss we use (Eq. (11) in the main text) is then L(ηπ(x, a), T πη(x, a)) =D w(y), v(y) +D w(˜y), v(˜y) ,(28) whereDis exactly the metric defined in (25), app...

work page arXiv

[1] [1]

MIT press, 2018

Richard S Sutton and Andrew G Barto.Reinforcement learning: An introduction. MIT press, 2018

work page 2018

[2] [2]

Bellemare, Will Dabney, and Rémi Munos

Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforce- mentlearning. InDoinaPrecupandYeeWhyeTeh,editors,Proceedingsofthe34thInternational Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 449–458. PMLR, 08 2017. URLhttps://proceedings.mlr.press/v70/bellemare17a.html

work page 2017

[3] [3]

Implicit quantile networks for distributionalreinforcementlearning

Will Dabney, Georg Ostrovski, David Silver, and Remi Munos. Implicit quantile networks for distributionalreinforcementlearning. InJenniferDyandAndreasKrause,editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1096–1105. PMLR, 07 2018. URLhttps://proceedings.mlr.press/ v8...

work page 2018

[4] [4]

Searchonthereplaybuffer: Bridg- ing planning and reinforcement learning

BenEysenbach,RussRSalakhutdinov,andSergeyLevine. Searchonthereplaybuffer: Bridg- ing planning and reinforcement learning. In H. Wallach, H. Larochelle, A. Beygelzimer, 10 F. d'Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Sys- tems, volume 32. Curran Associates, Inc., 2019. URLhttps://proceedings.neurips.cc/ paper_fi...

work page 2019

[5] [5]

Reinforcementlearningineconomics and finance, 2020

ArthurCharpentier,RomualdElie,andCarlRemlinger. Reinforcementlearningineconomics and finance, 2020. URLhttps://arxiv.org/abs/2003.10014

work page arXiv 2020

[6] [6]

Bellemare, and Rémi Munos

Will Dabney, Mark Rowland, Marc G. Bellemare, and Rémi Munos. Distributional reinforce- ment learning with quantile regression. InAAAI, 2017

work page 2017

[7] [7]

Fully pa- rameterized quantile function for distributional reinforcement learning

Derek Yang, Li Zhao, Zichuan Lin, Tao Qin, Jiang Bian, and Tie-Yan Liu. Fully pa- rameterized quantile function for distributional reinforcement learning. In H. Wal- lach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, edi- tors,Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URLhttps://...

work page 2019

[8] [8]

Non-crossing quantile regression for distri- butional reinforcement learning

Fan Zhou, Jianing Wang, and Xingdong Feng. Non-crossing quantile regression for distri- butional reinforcement learning. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 15909– 15919.CurranAssociates,Inc.,2020. URLhttps://proceedings.neurips.cc/paper_files/ paper...

work page 2020

[9] [9]

Dis- tributional reinforcement learning with unconstrained monotonic neural networks.Neuro- computing, 534:199–219, May 2023

Thibaut Théate, Antoine Wehenkel, Adrien Bolland, Gilles Louppe, and Damien Ernst. Dis- tributional reinforcement learning with unconstrained monotonic neural networks.Neuro- computing, 534:199–219, May 2023. ISSN 0925-2312. doi: 10.1016/j.neucom.2023.02.049. URL http://dx.doi.org/10.1016/j.neucom.2023.02.049

work page doi:10.1016/j.neucom.2023.02.049 2023

[10] [10]

Normalizing flows for probabilistic modeling and inference.Journal of Machine Learning Research, 22(57):1–64, 2021

George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference.Journal of Machine Learning Research, 22(57):1–64, 2021. URLhttp://jmlr.org/papers/v22/19-1028. html

work page 2021

[11] [11]

URLhttps://arxiv.org/abs/2210.02019

MatthewAitchison,PennySweetser,andMarcusHutter.Atari-5: Distillingthearcadelearning environment down to five games, 2022. URLhttps://arxiv.org/abs/2210.02019

work page arXiv 2022

[12] [12]

Bellemare, Will Dabney, and Mark Rowland.Distributional Reinforcement Learning

Marc G. Bellemare, Will Dabney, and Mark Rowland.Distributional Reinforcement Learning. MIT Press, 2023.http://www.distributional-rl.org

work page 2023

[13] [13]

Flow++: Improv- ingflow-basedgenerativemodelswithvariationaldequantizationandarchitecturedesign

Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Improv- ingflow-basedgenerativemodelswithvariationaldequantizationandarchitecturedesign. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 2722–...

work page

[14] [14]

The Cramer Distance as a Solution to Biased Wasserstein Gradients

Marc G. Bellemare, Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshminarayanan, Stephan Hoyer, and Rémi Munos. The cramer distance as a solution to biased wasserstein gradients, 2017. URLhttps://arxiv.org/abs/1705.10743

work page internal anchor Pith review Pith/arXiv arXiv 2017

[15] [15]

Dis- tributionalreinforcementlearningwithdualexpectile-quantileregression, 2024

Sami Jullien, Romain Deffayet, Jean-Michel Renders, Paul Groth, and Maarten de Rijke. Dis- tributionalreinforcementlearningwithdualexpectile-quantileregression, 2024. URLhttps: //arxiv.org/abs/2305.16877

work page arXiv 2024

[16] [16]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

MarkTowers,ArielKwiatkowski,JordanTerry,JohnUBalis,GianlucaDeCola,TristanDeleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A stan- dardinterfaceforreinforcementlearningenvironments.arXivpreprintarXiv:2407.17032,2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Revisitingthearcadelearningenvironment: Evaluationprotocolsandopen problems for general agents.Journal of Artificial Intelligence Research, 61:523–562, 2018

MarlosC.Machado,MarcG.Bellemare,ErikTalvitie,JoelVeness,MatthewJ.Hausknecht,and MichaelBowling. Revisitingthearcadelearningenvironment: Evaluationprotocolsandopen problems for general agents.Journal of Artificial Intelligence Research, 61:523–562, 2018

work page 2018

[18] [18]

Cleanrl: High-qualitysingle-fileimplementationsofdeep reinforcement learning algorithms.Journal of Machine Learning Research, 23(274):1–18, 2022

Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, KinalMehta,andJoãoG.M.Araújo. Cleanrl: High-qualitysingle-fileimplementationsofdeep reinforcement learning algorithms.Journal of Machine Learning Research, 23(274):1–18, 2022. URLhttp://jmlr.org/papers/v23/21-1342.html. 12 A. Limitations VarianceOur method exhibits high...

work page 2022

[19] [19]

The training variance might not help the model converge faster

work page

[20] [20]

Thisindirect relationship might hinder the learning performance by making the task more complex for the model

Instead of learning directly the density of given values like C51, or specified values, our modellearnsflowparametrisationsthatindirectlyleadtoreturndistributions. Thisindirect relationship might hinder the learning performance by making the task more complex for the model

work page

[21] [21]

NormalizingFlowsareeffectiveforlearningexactlikelihoodsbuttheyarenotoriouslyslow to train, this fact is confirmed by our empirical results. CDF FlowWhile using a CDF as a flow transformation offers advantages in modeling monotonic mappings and enabling efficient computation of the Cramér distance (Main paper section 3.2), it also introduces notable limita...

work page

[22] [22]

geometric weight

draws the following propositions: Proposition 1:The KL divergence has unbiased sample gradients (U), but is not scale sensitive (S). Proposition 2:The Wasserstein metric is ideal (I, S), but does not have unbiased sample gradients. E. Cramér-inspired geometry-aware metric We introduce a Cramér-inspired, geometry-aware metric on discrete probability masses...

work page

[23] [23]

which bins carry mass (viaΩa,Ω b),

work page

[24] [24]

near” from “far

not by how far apart those bins are. Concretely, on a grid{−10,−9, . . . ,10}, one can compute Ω−10 = Ω+10 = 210,Ω −9 = 191, so that D2(δ−10, δ−9)∝210 + 191 = 401, D 2(δ−10, δ+10)∝210 + 210 = 420, which are very close despite the spikes being at distance1vs20. In other words: For disjoint one-hot distributions, the exact metricDbehaves as a geometry-weigh...

work page

[25] [25]

, y(N) ∼p,˜y (1),

Sample from the continuous densities: y(1), . . . , y(N) ∼p,˜y (1), . . . ,˜y(M) ∼q

work page

[26] [26]

Estimatepandqvia KDE on each support using a kernelKh with bandwidthh >0: on the predicted support{yi}N i=1, ˆp(yi) = 1 N NX k=1 Kh(yi −y (k)), ˆq(yi) = 1 M MX j=1 Kh(yi −˜y(j)), on the target support{˜yj}M j=1, ˆp(˜yj) = 1 N NX k=1 Kh(˜yj −y (k)), ˆq(˜yj) = 1 M MX j=1 Kh(˜yj −˜y(j))

work page

[27] [27]

twospikes

Discretize these KDEs into mass vectors on each grid (e.g. by Riemann approximation): w(y) i ≈ ˆp(yi) ∆yP k ˆp(yk) ∆y, v (y) i ≈ ˆq(yi) ∆yP k ˆq(yk) ∆y, and similarlyw(˜y), v(˜y)on{˜yj}. The practical loss we use (Eq. (11) in the main text) is then L(ηπ(x, a), T πη(x, a)) =D w(y), v(y) +D w(˜y), v(˜y) ,(28) whereDis exactly the metric defined in (25), app...

work page arXiv