arxiv: 2604.20381 · v1 · submitted 2026-04-22 · 💻 cs.LG · cs.NE· cs.RO

Recognition: unknown

Distributional Value Estimation Without Target Networks for Robust Quality-Diversity

Behrad Koohy, Jamie Bayne

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:07 UTC · model grok-4.3

classification 💻 cs.LG cs.NEcs.RO

keywords quality-diversityreinforcement learningdistributional rltarget-freehigh utdsample efficiencybraxdominated novelty search

0 comments

The pith

Target-free distributional critics enable stable high-UTD training in quality-diversity reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces QDHUAC, a quality-diversity reinforcement learning algorithm that replaces target networks with distributional value estimation to generate dense low-variance gradients. These gradients support training at high update-to-data ratios inside a dominance-based novelty search procedure. The method reaches competitive coverage and fitness on high-dimensional Brax locomotion tasks while using roughly an order of magnitude fewer environment steps than prior approaches. A sympathetic reader would care because conventional QD algorithms have long been limited by poor sample efficiency, often requiring tens of millions of interactions before producing useful skill repertoires. The central advance is showing that the combination of distributional critics and dominance selection can remove the computational overhead of target networks without sacrificing training stability.

Core claim

QDHUAC is a sample-efficient, target-free and distributional QD-RL algorithm that provides dense and low-variance gradient signals, which enables high-UTD training for Dominated Novelty Search whilst requiring an order of magnitude fewer environment steps. We demonstrate that our method enables stable training at high UTD ratios, achieving competitive coverage and fitness on high-dimensional Brax environments with an order of magnitude fewer samples than baselines.

What carries the argument

Target-free distributional critic paired with Dominated Novelty Search, supplying stable low-variance gradients for high-UTD actor-critic updates across the QD population.

Load-bearing premise

Distributional value estimation without target networks supplies sufficiently stable and low-variance gradients to support high-UTD training inside Dominated Novelty Search without introducing instability or bias in the QD population.

What would settle it

Observing gradient instability, training divergence, or substantially lower QD coverage and fitness when the target-free distributional critic is used at high UTD ratios on the same Brax environments.

Figures

Figures reproduced from arXiv: 2604.20381 by Behrad Koohy, Jamie Bayne.

**Figure 1.** Figure 1: The Optimisation Loop of QDHUAC. The algorithm maintains a shared Replay Buffer D populated by the evolutionary search. A global Target-Free Distributional Critic learns continuously from this diverse data to guide a gradient-based policy, which accelerates the discovery of high-performing solutions. To stabilise learning at high Update-to-Data (UTD) ratios without a target network, the critic employs comp… view at source ↗

**Figure 2.** Figure 2: Benchmarking Sample Efficiency and Performance. We compare QDHUAC (Ours) against baselines across five continuous control locomotion tasks. QDHUAC demonstrates improvements in sample efficiency, achieving higher scores earlier than baselines in complex environments such as Humanoid and Ant. Coverage is measured as % of projected archive covered, as seen in Bahlous-Boldi et al. [1]. batch and raw environmen… view at source ↗

**Figure 4.** Figure 4: Scaling Robustness and the Role of Hybrid Normalisation. We evaluate the stability of our critic under increasing Update-to-Data (UTD) ratios, ranging from 5𝑘 to 40𝑘 updates per iteration on HalfCheetah (𝑁 = 5). While the standard unnormalised critic suffers from performance degradation at high UTD ratios, losing ≈ 30% of its peak fitness as updates increase, the batch and weight normalised layers signifi… view at source ↗

**Figure 5.** Figure 5: Ablation of Critic Normalisation Components. Relative performance of normalisation schemes on the Humanoid environment, normalised against the full QDHUAC architecture (BN + WN, solid blue line). While coverage is minimally affected, combining both Batch and Weight Normalisation is uniquely required to prevent long-term value divergence and maintain peak maximum fitness. effect on overall descriptor Cov… view at source ↗

**Figure 6.** Figure 6: Evaluating Archive Comparability. Relative performance of a structured MAP-Elites archive compared against the unstructured DNS archive (normalized to a baseline of 1.0, solid blue line) within the QDHUAC framework. Shaded regions represent the standard deviation across 5 independent seeds. The lack of significant divergence in Max Fitness confirms that the sample efficiency gains of QDHUAC are robust to t… view at source ↗

read the original abstract

Quality-Diversity (QD) algorithms excel at discovering diverse repertoires of skills, but are hindered by poor sample efficiency and often require tens of millions of environment steps to solve complex locomotion tasks. Recent advances in Reinforcement Learning (RL) have shown that high Update-to-Data (UTD) ratios accelerate Actor-Critic learning. While effective, standard high-UTD algorithms typically utilise target networks to stabilise training. This requirement introduces a significant computational bottleneck, rendering them impractical for resource-intensive Quality-Diversity (QD) tasks where sample efficiency and rapid population adaptation are critical. In this paper, we introduce QDHUAC, a sample-efficient, target-free and distributional QD-RL algorithm that provides dense and low-variance gradient signals, which enables high-UTD training for Dominated Novelty Search whilst requiring an order of magnitude fewer environment steps. We demonstrate that our method enables stable training at high UTD ratios, achieving competitive coverage and fitness on high-dimensional Brax environments with an order of magnitude fewer samples than baselines. Our results suggest that combining target-free distributional critics with dominance-based selection is a key enabler for the next generation of sample-efficient evolutionary RL algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QDHUAC claims target-free distributional critics enable stable high-UTD training inside Dominated Novelty Search for big sample savings in QD-RL, but the abstract supplies no ablations or stats to support that attribution.

read the letter

The core idea is that distributional value estimates without target networks give low-variance gradients that let you run high update-to-data ratios inside a dominance-based QD loop, cutting environment steps by roughly 10x on Brax locomotion tasks while holding coverage and fitness. That specific combination of target-free distributional critics plus Dominated Novelty Search is the new piece relative to prior QD-RL work. High-UTD training and distributional RL are both established elsewhere, but their integration here for population-based skill discovery looks fresh on the surface. The motivation is also solid: QD algorithms already struggle with sample cost, and target networks add extra overhead when you have to maintain a whole population. Removing them could matter in practice for robotics-style tasks. The main weakness is the missing evidence. The abstract states competitive results with an order-of-magnitude improvement but shows no UTD ratios, no baseline descriptions, no run counts, no statistical tests, and no ablation that holds the dominance selection fixed while swapping in a standard TD critic. The stress-test concern lands: dominance filtering on fitness and novelty already prunes noisy updates, so it is possible the selection mechanism, not the distributional critic, is doing most of the stabilization. Without that isolation, the headline claim about distributional estimation remains unproven. This paper is aimed at people working on sample-efficient QD-RL or evolutionary RL. A reader already running Brax experiments might pick up the algorithmic sketch and try the combination, but only after the full paper supplies the controls. I would send it to peer review. The problem is real and the proposed fix is concrete enough to deserve referee time, even if the current draft needs substantial experimental strengthening to be convincing.

Referee Report

2 major / 2 minor

Summary. The paper introduces QDHUAC, a target-free distributional Quality-Diversity RL algorithm that combines distributional value estimation with Dominated Novelty Search to enable stable high-UTD training. It claims this yields competitive coverage and fitness on high-dimensional Brax environments using an order of magnitude fewer environment steps than baselines, with the combination of target-free distributional critics and dominance-based selection as the key enabler.

Significance. If the results hold after proper validation, the approach could improve sample efficiency in QD-RL by removing target network overhead and supporting higher UTD ratios, with potential to scale evolutionary RL methods. The work highlights a promising direction for robust gradient signals in population-based algorithms, but the attribution to distributional estimation specifically requires stronger evidence.

major comments (2)

[Experimental Results / Ablation Studies] The central claim that distributional value estimation without target networks supplies sufficiently stable and low-variance gradients for high-UTD training inside Dominated Novelty Search is load-bearing, yet no ablation is described that isolates this component (e.g., standard TD critic without targets at identical UTD ratios under the same dominance selection). Dominated Novelty Search's Pareto filtering may itself reduce effective variance, undermining the headline attribution.
[Abstract and §4 (Experiments)] The abstract asserts empirical success on Brax tasks with specific performance gains, but the manuscript supplies no experimental details, baseline descriptions, statistical tests, ablation results, or hyperparameter settings for the high-UTD regime. This prevents verification of the stability claim and the 'order of magnitude fewer samples' result.

minor comments (2)

[Method Description] Clarify the exact UTD ratios tested and how they compare to prior high-UTD RL methods that rely on targets.
[§3 (Algorithm)] The notation for the distributional critic and dominance-based selection could be made more explicit with pseudocode or equations to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address the major comments point by point below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: The central claim that distributional value estimation without target networks supplies sufficiently stable and low-variance gradients for high-UTD training inside Dominated Novelty Search is load-bearing, yet no ablation is described that isolates this component (e.g., standard TD critic without targets at identical UTD ratios under the same dominance selection). Dominated Novelty Search's Pareto filtering may itself reduce effective variance, undermining the headline attribution.

Authors: We agree that an ablation isolating the distributional critic from a standard TD critic (both without target networks) at identical high UTD ratios and under the same Dominated Novelty Search selection would provide stronger evidence for the specific contribution of distributional estimation. While the current results demonstrate stable high-UTD training with the integrated method, we will add this ablation study in the revised manuscript to directly address the potential confounding role of Pareto filtering and to clarify the source of gradient stability. revision: yes
Referee: The abstract asserts empirical success on Brax tasks with specific performance gains, but the manuscript supplies no experimental details, baseline descriptions, statistical tests, ablation results, or hyperparameter settings for the high-UTD regime. This prevents verification of the stability claim and the 'order of magnitude fewer samples' result.

Authors: We acknowledge that the experimental section would benefit from greater detail to support reproducibility and verification of the claims. The manuscript contains baseline descriptions and high-level experimental setup in Section 4, but we will expand it to include full hyperparameter tables for the high-UTD regime, statistical reporting (means and standard deviations across multiple seeds), additional ablation results, and explicit environment-step comparisons that substantiate the reported sample-efficiency gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical algorithmic proposal without self-referential derivations

full rationale

The paper presents QDHUAC as a novel combination of target-free distributional critics and Dominated Novelty Search to enable high-UTD training. No equations, parameter fittings, uniqueness theorems, or derivation chains are described that reduce to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. Claims rest on empirical results for coverage and fitness on Brax environments rather than any closed-loop mathematical reduction. This is a standard non-circular presentation of an algorithmic method.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5508 in / 1008 out tokens · 34226 ms · 2026-05-10T00:07:46.924815+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 14 canonical work pages · 1 internal anchor

[1]

Ryan Bahlous-Boldi, Maxence Faldor, Luca Grillotti, Hannah Janmohamed, Lisa Coiffard, Lee Spector, and Antoine Cully. 2025. Dominated novelty search: Rethinking local competition in quality-diversity. InProceedings of the Genetic and Evolutionary Computation Conference. 104–112

2025
[2]

David Balduzzi, Marcus Frean, Lennox Leary, JP Lewis, Kurt Wan-Duo Ma, and Brian McWilliams. 2017. The shattered gradients problem: If resnets are the answer, then what is the question?. InInternational conference on machine learning. PMLR, 342–350

2017
[3]

Marc G Bellemare, Will Dabney, and Rémi Munos. 2017. A distributional perspec- tive on reinforcement learning. InInternational conference on machine learning. PMLR, 449–458

2017
[4]

2023.Distributional rein- forcement learning

Marc G Bellemare, Will Dabney, and Mark Rowland. 2023.Distributional rein- forcement learning. MIT Press

2023
[5]

Aditya Bhatt, Daniel Palenicek, Boris Belousov, Max Argus, Artemij Ami- ranashvili, Thomas Brox, and Jan Peters. 2019. Crossq: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity.arXiv preprint arXiv:1902.05605(2019)

work page arXiv 2019
[6]

Felix Chalumeau, Bryan Lim, Raphael Boige, Maxime Allard, Luca Grillotti, Manon Flageat, Valentin Macé, Guillaume Richard, Arthur Flajolet, Thomas Pierrot, et al. 2024. QDax: A library for quality-diversity and population-based algorithms with hardware acceleration.Journal of Machine Learning Research25, 108 (2024), 1–16

2024
[7]

Xinyue Chen, Che Wang, Zijian Zhou, and Keith Ross. 2021. Randomized ensembled double q-learning: Learning fast without a model.arXiv preprint arXiv:2101.05982(2021)

work page arXiv 2021
[8]

Tyler Clark, Mark Towers, Christine Evers, and Jonathon Hare. 2024. Beyond the rainbow: High performance deep reinforcement learning on a desktop pc. arXiv preprint arXiv:2411.03820(2024)

work page arXiv 2024
[9]

Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. 2018. Distri- butional reinforcement learning with quantile regression. InProceedings of the AAAI conference on artificial intelligence, Vol. 32

2018
[10]

Pierluca D’Oro, Max Schwarzer, Evgenii Nikishin, Pierre-Luc Bacon, Marc G Bellemare, and Aaron Courville. 2022. Sample-efficient reinforcement learning by breaking the replay ratio barrier. InDeep Reinforcement Learning Workshop NeurIPS 2022

2022
[11]

Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. 2018. Diversity is all you need: Learning skills without a reward function.arXiv preprint arXiv:1802.06070(2018)

work page arXiv 2018
[12]

Jesse Farebrother, Jordi Orbay, Quan Vuong, Adrien Ali Taïga, Yevgen Chebotar, Ted Xiao, Alex Irpan, Sergey Levine, Pablo Samuel Castro, Aleksandra Faust, et al
[13]

Stop regressing: Training value functions via classification for scalable deep rl.arXiv preprint arXiv:2403.03950(2024)

work page arXiv 2024
[14]

William Fedus, Prajit Ramachandran, Rishabh Agarwal, Yoshua Bengio, Hugo Larochelle, Mark Rowland, and Will Dabney. 2020. Revisiting fundamentals of experience replay. InInternational conference on machine learning. PMLR, 3061–3071

2020
[15]

C Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem. 2021. Brax–a differentiable physics engine for large scale rigid body simulation.arXiv preprint arXiv:2106.13281(2021)

work page arXiv 2021
[16]

Scott Fujimoto, Herke Hoof, and David Meger. 2018. Addressing function ap- proximation error in actor-critic methods. InInternational conference on machine learning. PMLR, 1587–1596

2018
[17]

Florin Gogianu, Tudor Berariu, Mihaela C Rosca, Claudia Clopath, Lucian Buso- niu, and Razvan Pascanu. 2021. Spectral normalisation for deep reinforcement learning: an optimisation perspective. InInternational Conference on Machine Learning. PMLR, 3734–3744

2021
[18]

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. 2017. Improved training of wasserstein gans.Advances in neural information processing systems30 (2017)

2017
[19]

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning. Pmlr, 1861–1870

2018
[20]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity mappings in deep residual networks. InEuropean conference on computer vision. Springer, 630–645

2016
[21]

Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostro- vski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver
[22]

In Proceedings of the AAAI conference on artificial intelligence, Vol

Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32
[23]

Takuya Hiraoka, Takahisa Imagawa, Taisei Hashimoto, Takashi Onishi, and Yoshi- masa Tsuruoka. 2021. Dropout q-functions for doubly efficient reinforcement learning.arXiv preprint arXiv:2110.02034(2021)

work page arXiv 2021
[24]

Marcel Hussing, Claas Voelcker, Igor Gilitschenski, Amir-massoud Farahmand, and Eric Eaton. 2024. Dissecting deep rl with high update ratios: Combatting value divergence.arXiv preprint arXiv:2403.05996(2024)

work page arXiv 2024
[25]

Sergey Ioffe. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift.arXiv preprint arXiv:1502.03167(2015)

work page internal anchor Pith review arXiv 2015
[26]

Hannah Janmohamed and Antoine Cully. 2025. Multi-objective quality-diversity in unstructured and unbounded spaces. InProceedings of the Genetic and Evolu- tionary Computation Conference. 149–157

2025
[27]

Bryan Lim, Manon Flageat, and Antoine Cully. 2023. Understanding the synergies between quality-diversity and deep reinforcement learning. InProceedings of the Genetic and Evolutionary Computation Conference. 1212–1220

2023
[28]

Clare Lyle, Marc G Bellemare, and Pablo Samuel Castro. 2019. A comparative analysis of expected and distributional reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 4504–4511

2019
[29]

Guozheng Ma, Lu Li, Sen Zhang, Zixuan Liu, Zhen Wang, Yixin Chen, Li Shen, Xueqian Wang, and Dacheng Tao. 2023. Revisiting plasticity in visual reinforce- ment learning: Data, modules and training stages.arXiv preprint arXiv:2310.07418 (2023)

work page arXiv 2023
[30]

Jean-Baptiste Mouret and Jeff Clune. 2015. Illuminating search spaces by mapping elites.arXiv preprint arXiv:1504.04909(2015)

work page Pith review arXiv 2015
[31]

Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Miłoś, and Marek Cygan. 2024. Bigger, regularized, optimistic: scaling for compute and sample efficient continuous control.Advances in neural information processing systems37 (2024), 113038–113071

2024
[32]

Evgenii Nikishin, Max Schwarzer, Pierluca D’Oro, Pierre-Luc Bacon, and Aaron Courville. 2022. The primacy bias in deep reinforcement learning. InInternational conference on machine learning. PMLR, 16828–16847

2022
[33]

Olle Nilsson and Antoine Cully. 2021. Policy gradient assisted map-elites. In Proceedings of the Genetic and Evolutionary Computation Conference. 866–875

2021
[34]

Eleni Nisioti, Erwan Plantec, Milton Montero, Joachim Pedersen, and Sebastian Risi. 2025. When Does Neuroevolution Outcompete Reinforcement Learning in Transfer Learning Tasks?. InProceedings of the Genetic and Evolutionary Compu- tation Conference. 48–57

2025
[35]

Thomas Pierrot, Valentin Macé, Felix Chalumeau, Arthur Flajolet, Geoffrey Cideron, Karim Beguir, Antoine Cully, Olivier Sigaud, and Nicolas Perrin-Gilbert
[36]

InProceedings of the Genetic and Evolutionary Computation Conference

Diversity policy gradient for sample efficient quality-diversity optimiza- tion. InProceedings of the Genetic and Evolutionary Computation Conference. 1075–1083
[37]

Justin K Pugh, Lisa B Soros, and Kenneth O Stanley. 2016. Quality diversity: A new frontier for evolutionary computation.Frontiers in Robotics and AI3 (2016), 40

2016
[38]

Carlo Romeo, Girolamo Macaluso, Alessandro Sestini, and Andrew D Bagdanov
[39]

SPEQ: Offline Stabilization Phases for Efficient Q-Learning in High Update- To-Data Ratio Reinforcement Learning.arXiv preprint arXiv:2501.08669(2025)

work page arXiv 2025
[40]

Tim Salimans and Durk P Kingma. 2016. Weight normalization: A simple repa- rameterization to accelerate training of deep neural networks.Advances in neural information processing systems29 (2016)

2016
[41]

Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. 2018. How does batch normalization help optimization?Advances in neural information processing systems31 (2018)

2018
[42]

Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. 2015. Prioritized experience replay.arXiv preprint arXiv:1511.05952(2015)

work page arXiv 2015
[43]

Max Schwarzer, Johan Samir Obando Ceron, Aaron Courville, Marc G Bellemare, Rishabh Agarwal, and Pablo Samuel Castro. 2023. Bigger, better, faster: Human- level atari with human-level efficiency. InInternational Conference on Machine Learning. PMLR, 30365–30380

2023
[44]

Samarth Sinha, Homanga Bharadhwaj, Aravind Srinivas, and Animesh Garg
[45]

D2rl: Deep dense architectures in reinforcement learning.arXiv preprint arXiv:2010.09163(2020)

work page arXiv 2010
[46]

Chen Tessler, Guy Tennenholtz, and Shie Mannor. 2019. Distributional policy optimization: An alternative approach for continuous control.Advances in Neural Information Processing Systems32 (2019)

2019
[47]

Vassilis Vassiliades, Konstantinos Chatzilygeroudis, and Jean-Baptiste Mouret
[48]

Using centroidal voronoi tessellations to scale up the multidimensional archive of phenotypic elites algorithm.IEEE Transactions on Evolutionary Com- putation22, 4 (2017), 623–630

2017
[49]

Ke Xue, Ren-Jian Wang, Pengyi Li, Dong Li, Jianye Hao, and Chao Qian. 2024. Sample-efficient quality-diversity by cooperative coevolution. InThe Twelfth International Conference on Learning Representations. GECCO ’26, July 13–17, 2026, San José, Costa Rica Koohy et al. Appendix 6.1 Hyperparameters for QDHUAC Table 1:Hyperparameters for QDHUAC. We separate...

2024