arxiv: 2605.10734 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

XQCfD: Accelerating Fast Actor-Critic Algorithms with Prior Data and Prior Policies

Daniel Palenicek , Florian Vogt , Joe Watson , Ingmar Posner , Danica Kragic , Jan Peters

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:52 UTC · model grok-4.3

classification 💻 cs.LG

keywords reinforcement learningactor criticdemonstrationsrobotic manipulationsample efficiencypretrained policiesstationary networkssparse rewards

0 comments

The pith

XQCfD uses stationary networks and augmented buffers to retain and improve upon pretrained policies in actor-critic learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing actor-critic methods tend to overwrite useful pretrained policies when they start learning from new experience mixed with demonstrations. XQCfD addresses this by extending the XQC algorithm with augmented replay buffers that combine prior and online data, and by employing stationary policy networks that keep higher entropy in their predictions for states outside the current training distribution. This design supports continued policy improvement without relying on ensembles or frequent updates. The result is state-of-the-art performance on several robotic manipulation benchmarks that feature sparse rewards. A sympathetic reader would see this as evidence that careful architectural choices can make better use of expensive demonstration data in real-world settings.

Core claim

The central discovery is that a stationary policy architecture combined with augmented replay buffers allows the XQC actor-critic to avoid rapidly unlearning strong initial policies from demonstrations. Instead, the higher entropy predictions enable effective policy improvement on out-of-distribution states, producing state-of-the-art results across complex manipulation tasks on the Adroit, Robomimic, and MimicGen benchmarks with a low update-to-data ratio and no ensemble networks.

What carries the argument

Stationary policy network architecture that generates higher-entropy predictions out of distribution to support ongoing improvement from pretrained policies.

If this is right

Pretrained policies can be retained and refined using demonstration data without special stabilization techniques beyond the stationary design.
Robotic agents achieve higher sample efficiency in sparse-reward settings by mixing prior data with new interactions.
Performance gains occur without increasing the update-to-data ratio or adding network ensembles.
The method applies directly to popular benchmarks for dexterous manipulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Other fast actor-critic variants might benefit from similar stationary designs to preserve prior knowledge.
This approach could lower the barrier to using demonstration data in online reinforcement learning by reducing the risk of catastrophic forgetting.
Future work might test whether the higher entropy property holds across different task distributions beyond manipulation.

Load-bearing premise

The assumption that standard network architectures inherently lose high entropy predictions out of distribution, and that making them stationary will fix this without new instabilities.

What would settle it

Running the method on the Adroit benchmark and finding that the learned policy entropy matches that of non-stationary networks on out-of-distribution states, or that final performance does not exceed prior actor-critic baselines.

Figures

Figures reproduced from arXiv: 2605.10734 by Danica Kragic, Daniel Palenicek, Florian Vogt, Ingmar Posner, Jan Peters, Joe Watson.

**Figure 2.** Figure 2: Performance of XQCfD and baselines on Adroit over 10 seeds, showing the IQM and 10th and 90th percentile stratified bootstrap confidence intervals. BC pretraining performance is shown before the vertical dashed line. Notably, in two of the tasks, BC achieves expert performance. The benefit of improving upon BC is clear in this scenario, and baselines then require 100K–1M interactions to recover BC-level pe… view at source ↗

**Figure 3.** Figure 3: Performance results on Robomimic over 10 seeds showing the IQM and 10th and 90th [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Performance results of XQCfD and baselines on MimicGen over 10 seeds, showing the IQM and 10th and 90th percentile stratified bootstrap confidence intervals. BC pretraining performance is shown before the vertical dashed line. OD refers to adding expert offline data. Compared to Robomimic, these tasks are slightly more complex due to a broader initial state distribution and longer task horizon. For this su… view at source ↗

**Figure 5.** Figure 5: An empirical analysis of the optimization landscape of the actor with and without the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation studies on XQCfD, replacing the HetStat MLP with a standard MLP and replacing KL regularization with standard entropy regularization for the Adroit (top), Robomimic (middle) and MimicGen (bottom) environments over 10 seeds, showing the IQM and 10th and 90th percentile stratified bootstrap confidence intervals. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Sensitivity study of XQCfD’s temperature α that controls KL regularization against the BC policy. Evaluated over one task for Adroit, Robomimic and MimicGen over 10 seeds, showing the IQM and 10th and 90th percentile stratified bootstrap confidence intervals. Lower temperatures result in a larger performance drop on the transition from BC to RL due to unlearning, but lower temperatures also facilitate grea… view at source ↗

**Figure 8.** Figure 8: Sensitivity study on XQCfD, replacing the number of expert demonstrations for the Adroit environments over 10 seeds, showing the IQM and 10th and 90th percentile stratified bootstrap confidence intervals. 0.0 0.25 0.5 0.75 1.0 Environment Steps (1M) 0.0 0.5 1.0 Success Lift 0.0 0.25 0.5 0.75 1.0 Environment Steps (1M) PickPlaceCan 0.0 0.25 0.5 0.75 1.0 Environment Steps (1M) NutAssemblySquare N=200 N=100 N… view at source ↗

**Figure 9.** Figure 9: Sensitivity study on XQCfD, replacing the number of expert demonstrations for the Robomimic environments over 10 seeds, showing the IQM and 10th and 90th percentile stratified bootstrap confidence intervals. 0.0 0.25 0.5 0.75 1.0 Environment Steps (1M) 0.0 0.5 1.0 Success StackThree D0 0.0 0.25 0.5 0.75 1.0 Environment Steps (1M) Coffee D0 0.0 0.25 0.5 0.75 1.0 Environment Steps (1M) HammerCleanup D0 0.0 0… view at source ↗

**Figure 10.** Figure 10: Sensitivity study on XQCfD, replacing the number of expert demonstrations for the MimicGen environments over 10 seeds, showing the IQM and 10th and 90th percentile stratified bootstrap confidence intervals. B The Categorical Critic Objective In C51 [3], the critic models the return distribution as a categorical distribution over N fixed atoms {zi} N i=1 with probabilities pϕ(s, a) = (pϕ,1(s, a), . . . , p… view at source ↗

read the original abstract

For reinforcement learning in the real world online exploration is expensive A common practice in robotic reinforcement learning is to incorporate additional data to improve sample efficiency Expert demonstration data is often crucial for solving hard exploration tasks with sparse rewards While prior data is used to augment experience and pretrain models we show that the design of existing algorithms fails to achieve the sample efficiency that is possible in this setting due to a failure to use pretrained policies effectively We propose XQCfD which extends the sample-efficient XQC actor-critic to learn from demonstrations using augmented replay buffers pretrained policies and stationary policy architectures designed to avoid rapidly unlearning the strong initial policy like prior works We show our stationary network architecture enables policy improvement out-of-distribution better than standard network architectures due to its higher entropy predictions XQCfD achieves state of the art performance across a range of complex manipulation tasks with sparse rewards from the popular Adroit Robomimic and MimicGen benchmarks -- notably with a low update-to-data ratio and no ensemble networks

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

XQCfD keeps pretrained policies intact longer via stationary networks and augmented buffers, yielding claimed SOTA sample efficiency on sparse-reward robotic tasks, though the entropy-OOD link needs tighter evidence.

read the letter

The main thing to know is that this paper fixes a common issue in actor-critic RL for robotics: pretrained policies get unlearned too quickly when mixing in online data. By using stationary policy networks along with augmented replay buffers, XQCfD maintains the initial policy better and achieves higher entropy predictions out of distribution, which helps with exploration on hard sparse-reward tasks.

Referee Report

2 major / 2 minor

Summary. The paper proposes XQCfD, an extension of the XQC actor-critic algorithm that incorporates expert demonstration data via augmented replay buffers, pretrained policies, and stationary policy network architectures. The central claim is that this design prevents rapid unlearning of strong initial policies (unlike prior actor-critic methods), enables better out-of-distribution policy improvement through higher-entropy predictions, and achieves state-of-the-art sample-efficient performance on sparse-reward robotic manipulation tasks from the Adroit, Robomimic, and MimicGen benchmarks, notably without ensembles and at low update-to-data ratios.

Significance. If the empirical results hold under rigorous verification, the work could meaningfully advance sample-efficient robotic RL by showing how to better leverage prior data and policies. The emphasis on stationary architectures for preserving entropy in OOD regions offers a practical design insight that may reduce reliance on ensembles or high update frequencies in demonstration-augmented settings.

major comments (2)

[Abstract] Abstract: The central attribution of SOTA gains to the stationary architecture's higher-entropy OOD predictions is load-bearing for the contribution, yet the provided description contains no reference to specific ablation studies, entropy measurements, or controlled comparisons against non-stationary baselines that would isolate this mechanism from the effects of augmented buffers and pretrained policies.
[Abstract] The claim that existing actor-critic designs inherently fail to retain and improve upon strong pretrained policies (Abstract) requires explicit evidence from head-to-head experiments; without reported metrics on policy retention (e.g., performance degradation curves or KL divergence to the initial policy) on the same Adroit/MimicGen tasks, it is difficult to assess whether the stationary design is necessary or merely sufficient.

minor comments (2)

[Abstract] The abstract is written as a single unbroken paragraph with multiple run-on clauses, reducing readability; breaking it into 2-3 sentences would improve clarity.
[Abstract] The acronym XQCfD is introduced without an explicit expansion on first use, which is standard for algorithmic papers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on strengthening the abstract and supporting claims with clearer evidence. We address each major comment below and will incorporate revisions to improve the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central attribution of SOTA gains to the stationary architecture's higher-entropy OOD predictions is load-bearing for the contribution, yet the provided description contains no reference to specific ablation studies, entropy measurements, or controlled comparisons against non-stationary baselines that would isolate this mechanism from the effects of augmented buffers and pretrained policies.

Authors: We agree the abstract is too concise on this point. The full manuscript includes ablation studies in Section 5.2 that isolate the stationary architecture by comparing variants with and without it (while holding buffers and pretraining fixed), plus entropy measurements in Figure 6 and OOD policy improvement analysis in Section 4.3. We will revise the abstract to explicitly reference these controlled comparisons and measurements. revision: yes
Referee: [Abstract] The claim that existing actor-critic designs inherently fail to retain and improve upon strong pretrained policies (Abstract) requires explicit evidence from head-to-head experiments; without reported metrics on policy retention (e.g., performance degradation curves or KL divergence to the initial policy) on the same Adroit/MimicGen tasks, it is difficult to assess whether the stationary design is necessary or merely sufficient.

Authors: Section 4.1 already reports head-to-head results on Adroit and MimicGen showing performance degradation for non-stationary baselines (SAC, TD3) initialized from the same pretrained policies, contrasted with XQCfD's retention and improvement. However, we did not include explicit KL divergence to the initial policy or full degradation curves. We will add these metrics in the revision to directly support the necessity claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; algorithmic extension with empirical validation

full rationale

The paper presents XQCfD as an extension of the prior XQC actor-critic algorithm, incorporating augmented replay buffers, pretrained policies, and a stationary network architecture. Claims of improved out-of-distribution policy improvement and SOTA performance on Adroit/Robomimic/MimicGen benchmarks are supported by experimental results rather than any closed-form derivations or predictions. No equations, fitted parameters renamed as predictions, or self-citation chains that reduce the central result to its own inputs appear in the provided abstract or high-level description. The derivation chain is self-contained against external benchmarks and does not exhibit self-definitional, fitted-input, or uniqueness-imported circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the paper introduces no new mathematical axioms or invented entities; it relies on standard RL assumptions such as the existence of useful demonstration data and the ability of replay buffers to mix distributions. No free parameters are explicitly fitted in the provided text.

pith-pipeline@v0.9.0 · 5486 in / 1216 out tokens · 62733 ms · 2026-05-12T04:52:50.016377+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose XQCfD which extends the sample-efficient XQC actor-critic to learn from demonstrations using augmented replay buffers, pre-trained policies and stationary policy architectures... HetStat policies... random Fourier features
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

stationary last layer features... p(z|s) = N(0, σ²) ... random Fourier features φθ,V(s) = fk(V s)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 1 internal anchor

[1]

Maximum a posteriori policy optimisation

Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation. InInternational Conference on Learning Representations (ICLR), 2018

work page 2018
[2]

Efficient online reinforcement learning with offline data

Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning (ICML), 2023

work page 2023
[3]

Bellemare, Will Dabney, and Rémi Munos

Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforce- ment learning. InInternational Conference on Machine Learning (ICML), 2017

work page 2017
[4]

CrossQ: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity.International Conference on Learning Representations (ICLR), 2024

Aditya Bhatt, Daniel Palenicek, Boris Belousov, Max Argus, Artemij Amiranashvili, Thomas Brox, and Jan Peters. CrossQ: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity.International Conference on Learning Representations (ICLR), 2024

work page 2024
[5]

On-robot reinforcement learning with goal-contrastive rewards

Ondrej Biza, Thomas Weng, Lingfeng Sun, Karl Schmeckpeper, Tarik Kelestemur, Yecheng Ja- son Ma, Robert Platt, Jan-Willem van de Meent, and Lawson LS Wong. On-robot reinforcement learning with goal-contrastive rewards. InIEEE International Conference on Robotics and Automation (ICRA), 2025

work page 2025
[6]

Randomized Ensembled Dou- ble Q-Learning: Learning fast without a model

Xinyue Chen, Che Wang, Zijian Zhou, and Keith W Ross. Randomized Ensembled Dou- ble Q-Learning: Learning fast without a model. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[7]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research (IJRR), 2025

work page 2025
[8]

Asymptotic evaluation of certain Markov process expectations for large time.Communications on Pure and Applied Mathematics, 1983

Monroe D Donsker and SR Srinivasa Varadhan. Asymptotic evaluation of certain Markov process expectations for large time.Communications on Pure and Applied Mathematics, 1983

work page 1983
[9]

An investigation into neural net opti- mization via Hessian eigenvalue density

Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net opti- mization via Hessian eigenvalue density. InInternational Conference on Machine Learning (ICML), 2019

work page 2019
[10]

Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational Conference on Machine Learning (ICML), 2018. 10

work page 2018
[11]

TD-MPC2: Scalable, robust world models for continuous control

Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[12]

Imitation bootstrapped reinforcement learning

Hengyuan Hu, Suvir Mirchandani, and Dorsa Sadigh. Imitation bootstrapped reinforcement learning. InRobotics: Science and Systems (RSS), 2024

work page 2024
[13]

Batch normalization: Accelerating deep network training by reducing internal covariate shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InInternational Conference on Machine Learning (ICML), 2015

work page 2015
[14]

Policy search for motor primitives in robotics.Advances in Neural Information Processing Systems (NeurIPS), 2008

Jens Kober and Jan Peters. Policy search for motor primitives in robotics.Advances in Neural Information Processing Systems (NeurIPS), 2008

work page 2008
[15]

Offline-to-online reinforcement learning via balanced replay and pessimistic Q-ensemble

Seunghyun Lee, Younggyo Seo, Kimin Lee, Pieter Abbeel, and Jinwoo Shin. Offline-to-online reinforcement learning via balanced replay and pessimistic Q-ensemble. InConference on Robot Learning (CoRL), 2022

work page 2022
[16]

End-to-end training of deep visuomotor policies.Journal of Machine Learning Research (JMLR), 2016

Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies.Journal of Machine Learning Research (JMLR), 2016

work page 2016
[17]

Normalization and effective learning rates in reinforcement learning

Clare Lyle, Zeyu Zheng, Khimya Khetarpal, James Martens, Hado van Hasselt, Razvan Pascanu, and Will Dabney. Normalization and effective learning rates in reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[18]

What matters in learning from offline human demonstrations for robot manipulation

Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation. InConference on Robot Learning (CoRL), 2021

work page 2021
[19]

Mimicgen: A data generation system for scalable robot learning using human demonstrations

Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. InConference on Robot Learning (CoRL), 2023

work page 2023
[20]

Periodic activation functions induce stationarity

Lassi Meronen, Martin Trapp, and Arno Solin. Periodic activation functions induce stationarity. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[21]

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. AW AC: Accelerating online reinforcement learning with offline datasets.arXiv preprint arXiv:2006.09359, 2020

work page internal anchor Pith review arXiv 2006
[22]

Cal-QL: Calibrated offline RL pre-training for efficient online fine-tuning.Advances in Neural Information Processing Systems (NeurIPS), 2023

Mitsuhiko Nakamoto, Simon Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, and Sergey Levine. Cal-QL: Calibrated offline RL pre-training for efficient online fine-tuning.Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[23]

What matters for adversarial imitation learning? InAdvances in Neural Information Processing Systems (NeurIPS), 2021

Manu Orsini, Anton Raichuk, Léonard Hussenot, Damien Vincent, Robert Dadashi, Sertan Girgin, Matthieu Geist, Olivier Bachem, Olivier Pietquin, and Marcin Andrychowicz. What matters for adversarial imitation learning? InAdvances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[24]

Scaling off-policy reinforcement learning with batch and weight normalization.Advances in Neural Information Processing Systems (NeurIPS), 2025

Daniel Palenicek, Florian V ogt, Joe Watson, and Jan Peters. Scaling off-policy reinforcement learning with batch and weight normalization.Advances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[25]

XQC: Well- conditioned optimization accelerates deep reinforcement learning.International Conference on Learning Representations (ICLR), 2026

Daniel Palenicek, Florian V ogt, Joe Watson, Ingmar Posner, and Jan Peters. XQC: Well- conditioned optimization accelerates deep reinforcement learning.International Conference on Learning Representations (ICLR), 2026

work page 2026
[26]

Policy gradient methods for robotics

Jan Peters and Stefan Schaal. Policy gradient methods for robotics. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2006

work page 2006
[27]

Information theoretic methods in statistics and computer science: Lecture 1 — f-divergences, 2020

Yury Polyanskiy. Information theoretic methods in statistics and computer science: Lecture 1 — f-divergences, 2020

work page 2020
[28]

D. A. Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 1991. 11

work page 1991
[29]

Random features for large-scale kernel machines.Advances in Neural Information Processing Systems (NeurIPS), 2007

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines.Advances in Neural Information Processing Systems (NeurIPS), 2007

work page 2007
[30]

Learning complex dexterous manipulation with deep reinforcement learning and demonstrations.Robotics: Science and Systems (RSS), 2018

Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations.Robotics: Science and Systems (RSS), 2018

work page 2018
[31]

Rasmussen and Chris Williams.Gaussian Processes for Machine Learning

Carl E. Rasmussen and Chris Williams.Gaussian Processes for Machine Learning. MIT Press, 2006

work page 2006
[32]

On stochastic optimal control and reinforcement learning by approximate inference

Konrad Rawlik, Marc Toussaint, and Sethu Vijayakumar. On stochastic optimal control and reinforcement learning by approximate inference. InRobotics: Science and Systems (RSS), 2013

work page 2013
[33]

On pathologies in KL-regularized reinforcement learning from expert demonstrations.Advances in Neural Information Processing Systems (NeurIPS), 2021

Tim GJ Rudner, Cong Lu, Michael A Osborne, Yarin Gal, and Yee Teh. On pathologies in KL-regularized reinforcement learning from expert demonstrations.Advances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[34]

How does batch nor- malization help optimization?Advances in Neural Information Processing Systems (NeurIPS), 2018

Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch nor- malization help optimization?Advances in Neural Information Processing Systems (NeurIPS), 2018

work page 2018
[35]

Temporal difference learning of position evaluation in the game of Go.Advances in Neural Information Processing Systems (NeurIPS), 1993

Nicol Schraudolph, Peter Dayan, and Terrence J Sejnowski. Temporal difference learning of position evaluation in the game of Go.Advances in Neural Information Processing Systems (NeurIPS), 1993

work page 1993
[36]

Keep doing what worked: Behavior modelling priors for offline reinforcement learning

Noah Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, Nicolas Heess, and Martin Riedmiller. Keep doing what worked: Behavior modelling priors for offline reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2020

work page 2020
[37]

Mastering the game of Go with deep neural networks and tree search.Nature, 2016

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess- che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search.Nature, 2016

work page 2016
[38]

Sanjana, and David A

Andrew Stirn, Hans-Hermann Wessels, Megan Schertzer, Laura Pereira, Neville E. Sanjana, and David A. Knowles. Faithful heteroscedastic regression with neural networks. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2023

work page 2023
[39]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA, 1998

work page 1998
[40]

L2 regularization versus batch and weight normalization

Twan Van Laarhoven. L2 regularization versus batch and weight normalization.arXiv preprint arXiv:1706.05350, 2017

work page arXiv 2017
[41]

Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards

Mel Vecerik, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess, Thomas Rothörl, Thomas Lampe, and Martin Riedmiller. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards.arXiv preprint arXiv:1707.08817, 2017

work page Pith review arXiv 2017
[42]

Neural linear models with functional Gaussian process priors

Joe Watson, Jihao Andreas Lin, Pascal Klink, and Jan Peters. Neural linear models with functional Gaussian process priors. InThird Symposium on Advances in Approximate Bayesian Inference

work page
[43]

Huang, and Nicolas Heess

Joe Watson, Sandy H. Huang, and Nicolas Heess. Coherent soft imitation learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[44]

NX i=1 mi(s,a,s ′) logp θ,i(s,a) # ,(6) where the target projection coefficients are computed as mi(s,a,s ′) = NX j=1

Yi Zhao, Rinu Boney, Alexander Ilin, Juho Kannala, and Joni Pajarinen. Adaptive behavior cloning regularization for stable offline-to-online reinforcement learning.Offline Reinforcement Learning Workshop at Neural Information Processing Systems, 2021. 12 A Ablation: Sensitivity to the Number of Expert Demonstrations Figures 8 to 10 show that XQCfD remains...

work page 2021
[45]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page