Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards

Christian Scherer; Daniel Palenicek; Ingmar Posner; Jan Peters; Joe Watson; Theo Gruner

arxiv: 2606.02194 · v1 · pith:5XBO2DTGnew · submitted 2026-06-01 · 💻 cs.LG

Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards

Christian Scherer , Joe Watson , Theo Gruner , Daniel Palenicek , Ingmar Posner , Jan Peters This is my paper

Pith reviewed 2026-06-28 15:44 UTC · model grok-4.3

classification 💻 cs.LG

keywords inverse reinforcement learningbehavioral cloningrobotic manipulationpolicy finetuninglearned rewardsoff-policy improvementdexterous manipulation

0 comments

The pith

Coherent imitation learning from expert data lets large behavior models improve without the usual RL finetuning drop.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether inverse reinforcement learning can learn dense rewards from demonstrations to make finetuning of large behavioral cloning policies more effective than direct RL on sparse rewards. It focuses on coherent imitation learning, which uses a particular reward formulation that comes with theoretical guarantees for improvement. Experiments on six robotic manipulation tasks show the approach maintains or raises performance of the starting policy and reaches at least 90 percent success on five of the six complex cases while beating sparse-reward RL baselines. A reader would care because behavioral cloning scales well for capable policies yet further gains via RL are often inefficient when rewards are sparse; the method offers a route to denser signals that avoid early setbacks.

Core claim

We show that our IRL method maintains or improves the performance of pi-0.5 on all six sparse manipulation tasks and achieves a ≥90% success rate on five out of six complex manipulation tasks, outperforming RL-based baselines using sparse rewards. By ensuring our initial pretrained finetuning policy is optimal for our initial reward and critic, our method circumvents the initial drop commonly seen in RL finetuning and enables faster improvement.

What carries the argument

coherent imitation learning, an IRL method that uses a specific reward formulation with theoretical guarantees to enable improvement of the behavioral cloning policy

Load-bearing premise

Making the initial pretrained finetuning policy optimal for the learned reward and critic circumvents the initial drop commonly seen in RL finetuning and enables faster improvement.

What would settle it

An experiment showing that the IRL-finetuned policy still exhibits an initial performance drop or fails to outperform sparse-reward RL baselines on the six manipulation tasks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.02194 by Christian Scherer, Daniel Palenicek, Ingmar Posner, Jan Peters, Joe Watson, Theo Gruner.

**Figure 1.** Figure 1: An illustration of RL finetuning of strong BC policies using residuals. Despite a strong initial performance by a BC policy, using function approximation in actor-critic methods means this performance is rapidly unlearned and relearned, so in practice, the BC initialization provides little benefit when running the RL method from scratch. We use coherent soft imitation learning (CSIL) for (inverse) RL… view at source ↗

**Figure 2.** Figure 2: Given demonstration data ( ), the coherent reward provides positive rewards for correct actions for observed states, negative rewards for incorrect actions in seen states, and zero rewards for any action under unseen states. This reward encourages the agent to stay (and return) to the demonstration distribution. In contrast to adversarial methods, no on-policy samples are needed to learn it. The contour … view at source ↗

**Figure 3.** Figure 3: A system-level figure of CSIL-based finetuning of VLAs using ensemble actions. The [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of the different action modalities. The cyan, magenta, and blue lines rep [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Performance of CSIL and our improved version CSIL++ on the [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Performance of LBM finetuning on six simulated environments across three seeds. Suc [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: A visualization of 25 rollouts between the VLA and the refined policy on [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Distilling expert demonstration data into large generative models using behavioral cloning is a scalable approach to learning capable policies for robotic control, particularly for dexterous manipulation. Reinforcement learning (RL) can be used as a means to finetune these policies further using additional experience. An open question is whether RL is more sample-efficient than collecting more human demonstrations. Prior work has finetuned large pretrained policies in a scalable fashion by applying RL to a smaller residual policy that corrects the pretrained model. However, for the typical sparse reward tasks, RL algorithms can struggle to optimize the behavior in a sample-efficient manner. We explore inverse reinforcement learning, where a dense reward function is learned from expert demonstrations, potentially reducing the challenge of RL finetuning. We specifically consider coherent imitation learning, an IRL method that facilitates improvement of the BC policy through using a specific reward formulation with theoretical guarantees. We show that our IRL method maintains or improves the performance of pi-0.5 on all six sparse manipulation tasks and achieves a $\geq 90\%$ success rate on five out of six complex manipulation tasks, outperforming RL-based baselines using sparse rewards. By ensuring our initial pretrained finetuning policy is optimal for our initial reward and critic, our method circumvents the initial drop commonly seen in RL finetuning and enables faster improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Coherent IRL finetuning keeps large BC policies from dropping at the start on sparse robotic tasks and beats sparse-reward RL, but the gains look like a solid application of prior theory rather than a big methodological leap.

read the letter

The core claim here is that coherent imitation learning lets you finetune a large pretrained policy on six sparse manipulation tasks without the usual early performance drop, reaching 90%+ success on five of them while beating RL baselines that use the original sparse rewards. They do this by learning a dense reward from demonstrations and ensuring the initial policy is already optimal for that reward plus critic.

What the work actually adds is an empirical demonstration at the scale of large behavior models. The method itself recycles the coherent IRL formulation from earlier papers, so the novelty sits in showing that the optimality trick transfers to pi-0.5-style models and produces usable gains on real manipulation problems. The results are reported cleanly enough in the abstract to suggest the comparison is at least directionally informative.

The soft spots are mostly about scope and verification. Because the abstract is all we have here, we cannot check the exact baseline implementations, the amount of extra interaction data, or whether the learned reward is doing the heavy lifting versus the optimality condition. If the full paper does not include ablations that isolate those two factors, readers will be left wondering how much of the improvement is specific to coherent IRL versus simply having a better-shaped reward. The theoretical guarantees are imported, which is fine, but it means the paper's contribution is more engineering validation than new analysis.

This is aimed at people working on scaling robotic policies from demonstration data who already know the RL finetuning pain points. A reader who cares about sample-efficient improvement of large models will get practical takeaways; someone looking for fresh theory will not. The empirical results on multiple tasks are concrete enough that a serious referee should see it, even if revisions are needed to tighten the comparisons and ablations.

Referee Report

1 major / 1 minor

Summary. The paper proposes using coherent imitation learning, an IRL method with theoretical guarantees, to learn dense rewards from expert demonstrations. This is applied to finetune large behavior-cloned policies (e.g., pi-0.5) for robotic dexterous manipulation tasks. The central claim is that ensuring the initial pretrained policy is optimal for the learned reward and critic avoids the typical initial performance drop in RL finetuning, enabling faster improvement; empirical results show maintained or improved performance on all six sparse manipulation tasks, with ≥90% success on five of six complex tasks, outperforming sparse-reward RL baselines.

Significance. If the results hold with the promised theoretical grounding and empirical details, the work could advance scalable finetuning of large generative models for robotics by providing a sample-efficient alternative to additional human demonstrations or direct sparse-reward RL. The explicit linkage of coherent IRL's reward formulation to avoiding initial drops is a potentially useful contribution if the optimality condition is shown to hold in practice.

major comments (1)

Abstract: The claim that 'ensuring our initial pretrained finetuning policy is optimal for our initial reward and critic' circumvents the initial drop is presented as a key mechanism, but without the specific reward formulation, optimality proof, or empirical verification (e.g., performance curves in the first training steps) in the methods or results, it is not possible to assess whether this assumption holds or is load-bearing for the reported gains over RL baselines.

minor comments (1)

Abstract: Notation such as 'pi-0.5' is used without definition; this should be clarified on first use for readers unfamiliar with the base model.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and address the major comment below. We provide clarifications on the support for the abstract claim while remaining faithful to the manuscript content.

read point-by-point responses

Referee: Abstract: The claim that 'ensuring our initial pretrained finetuning policy is optimal for our initial reward and critic' circumvents the initial drop is presented as a key mechanism, but without the specific reward formulation, optimality proof, or empirical verification (e.g., performance curves in the first training steps) in the methods or results, it is not possible to assess whether this assumption holds or is load-bearing for the reported gains over RL baselines.

Authors: The manuscript introduces coherent imitation learning in Section 3 as the IRL method with a specific reward formulation and theoretical guarantees that make the BC policy optimal for the learned reward and critic; this is the basis for the abstract claim. The results in Section 4 report that performance is maintained or improved on all tasks (with ≥90% success on five of six), which is consistent with avoiding an initial drop relative to the sparse-reward RL baselines. If the linkage requires more explicit cross-referencing or early-step curve insets for clarity, we will incorporate a brief methods paragraph and additional figure details in revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical evaluation of an IRL method (coherent imitation learning) for finetuning pretrained policies on robotic manipulation tasks. The abstract and provided text contain no equations, derivations, or load-bearing steps that reduce predictions or results to inputs by construction, self-definition, or self-citation chains. Claims rest on experimental success rates across six tasks rather than any internal mathematical reduction. The reference to 'theoretical guarantees' is not accompanied by a derivation within the visible content that would trigger circularity patterns. This is a standard empirical result with no detectable circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is provided, so no free parameters, axioms, or invented entities can be identified from the text.

pith-pipeline@v0.9.1-grok · 5779 in / 1038 out tokens · 31448 ms · 2026-06-28T15:44:39.509799+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 6 canonical work pages · 3 internal anchors

[1]

Driess, F

D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al. PaLM-E: An embodied multimodal language model. InInternational Conference on Machine Learning (ICML), 2023

2023
[2]

A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation

J. Barreiros, A. Beaulieu, A. Bhat, R. Cory, E. Cousineau, H. Dai, C.-H. Fang, K. Hashimoto, M. Z. Irshad, M. Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation.arXiv preprint arXiv:2507.05331, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

R. S. Sutton and A. G. Barto.Reinforcement Learning: An Introduction. MIT press, 2018

2018
[4]

Ankile, A

L. Ankile, A. Simeonov, I. Shenfeld, M. Torne, and P. Agrawal. From imitation to refinement- residual RL for precise assembly. InIEEE International Conference on Robotics and Automa- tion (ICRA), 2025

2025
[5]

X. Yuan, T. Mu, S. Tao, Y . Fang, M. Zhang, and H. Su. Policy decorator: Model-agnostic online refinement for large policy model. InInternational Conference on Learning Representations (ICLR), 2025

2025
[6]

G. Ma, L. Li, H. Wang, Z. Liu, P.-L. Bacon, and D. Tao. What makes value learning efficient in residual reinforcement learning?arXiv preprint arXiv:2602.10539, 2026

work page arXiv 2026
[7]

Eschmann

J. Eschmann. Reward function design in reinforcement learning.Reinforcement learning algorithms: Analysis and Applications, 2021

2021
[8]

W. Xiao, H. Lin, A. Peng, H. Xue, T. He, Y . Xie, F. Hu, J. Wu, Z. Luo, L. Fan, et al. Self- improving vision-language-action models with data generation via residual rl. InInternational Conference on Learning Representation (ICLR), 2026

2026
[9]

Watson, S

J. Watson, S. H. Huang, and N. Heess. Coherent soft imitation learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[10]

Palenicek, F

D. Palenicek, F. V ogt, J. Watson, I. Posner, and J. Peters. XQC: Well-conditioned optimization accelerates deep reinforcement learning.International Conference on Learning Representa- tions (ICLR), 2026

2026
[11]

A. Ren, J. Lidard, L. Ankile, A. Simeonov, P. Agrawal, A. Majumdar, B. Burchfiel, H. Dai, and M. Simchowitz. Diffusion policy policy optimization. InInternational Conference on Learning Representations (ICLR), 2025. 9

2025
[12]

K. Chen, Z. Liu, T. Zhang, Z. Guo, S. Xu, H. Lin, H. Zang, X. Li, Q. Zhang, Z. Yu, G. Fan, T. Huang, Y . Wang, and C. Yu.π RL: Online RL fine-tuning for flow-based vision-language- action models, 2026

2026
[13]

Wagenmaker, Y

A. Wagenmaker, Y . Zhang, M. Nakamoto, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine. Steering your diffusion policy with latent space reinforcement learning. InConfer- ence on Robot Learning (CoRL), 2025

2025
[14]

Residual Policy Learning

T. Silver, K. Allen, J. Tenenbaum, and L. Kaelbling. Residual policy learning.arXiv preprint arXiv:1812.06298, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

Ankile, Z

L. Ankile, Z. Jiang, R. Duan, G. Shi, P. Abbeel, and A. Nagabandi. Residual off-policy RL for finetuning behavior cloning policies.arXiv preprint arXiv:2509.19301, 2025

work page arXiv 2025
[16]

Orsini, A

M. Orsini, A. Raichuk, L. Hussenot, D. Vincent, R. Dadashi, S. Girgin, M. Geist, O. Bachem, O. Pietquin, and M. Andrychowicz. What matters for adversarial imitation learning? In Advances in Neural Information Processing Systems (NeurIPS), 2021

2021
[17]

Sun and S

Z. Sun and S. Song. From prior to pro: Efficient skill mastery via distribution contractive rl finetuning.arXiv preprint arXiv:2603.10263, 2026

work page internal anchor Pith review arXiv 2026
[18]

arXiv preprint arXiv:2603.26666 , year=

Z. Zhong, H. Yan, J. Li, J. He, T. Zhang, and H. Li. VLA-OPD: Bridging offline sft and online RL for vision-language-action models via on-policy distillation.arXiv preprint arXiv:2603.26666, 2026

work page arXiv 2026
[19]

A. Y . Ng, D. Harada, and S. J. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. InInternational Conference on Machine Learning (ICML), 1999

1999
[20]

Haarnoja, A

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational Conference on Machine Learning (ICML), 2018

2018
[21]

Meronen, M

L. Meronen, M. Trapp, and A. Solin. Periodic activation functions induce stationarity. In Advances in Neural Information Processing Systems (NeurIPS), 2021

2021
[22]

Rasmussen and C

C. Rasmussen and C. Williams.Gaussian Processes for Machine Learning. MIT Press, 2006

2006
[23]

Ioffe and C

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InInternational Conference on Machine Learning (ICML), 2015

2015
[24]

C. Lyle, Z. Zheng, K. Khetarpal, J. Martens, H. van Hasselt, R. Pascanu, and W. Dabney. Normalization and effective learning rates in reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[25]

J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. InNIPS 2016 Deep Learning Symposium, 2016

2016
[26]

Palenicek, F

D. Palenicek, F. V ogt, J. Watson, and J. Peters. Scaling off-policy reinforcement learning with batch and weight normalization. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025
[27]

Fujimoto, H

S. Fujimoto, H. Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. InInternational Conference on Machine Learning (ICML), 2018

2018
[28]

Polyanskiy

Y . Polyanskiy. Information theoretic methods in statistics and computer science: Lecture 1 — f-divergences, 2020. 10

2020
[29]

Espeholt, H

L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V . Mnih, T. Ward, Y . Doron, V . Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu. IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. InInternational Conference on Machine Learning (ICML), 2018

2018
[30]

Levine, C

S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research (JMLR), 2016

2016
[31]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Rep- resentations (ICLR), 2022

2022
[32]

Mandlekar, D

A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Mart´ın-Mart´ın. What matters in learning from offline human demonstrations for robot manipulation. InConference on Robot Learning (CoRL), 2021

2021
[33]

Mandlekar, S

A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox. MimicGen: A data generation system for scalable robot learning using human demonstrations. InConference on Robot Learning (CoRL), 2023

2023
[34]

Intelligence, K

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. V...

2025

[1] [1]

Driess, F

D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al. PaLM-E: An embodied multimodal language model. InInternational Conference on Machine Learning (ICML), 2023

2023

[2] [2]

A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation

J. Barreiros, A. Beaulieu, A. Bhat, R. Cory, E. Cousineau, H. Dai, C.-H. Fang, K. Hashimoto, M. Z. Irshad, M. Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation.arXiv preprint arXiv:2507.05331, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

R. S. Sutton and A. G. Barto.Reinforcement Learning: An Introduction. MIT press, 2018

2018

[4] [4]

Ankile, A

L. Ankile, A. Simeonov, I. Shenfeld, M. Torne, and P. Agrawal. From imitation to refinement- residual RL for precise assembly. InIEEE International Conference on Robotics and Automa- tion (ICRA), 2025

2025

[5] [5]

X. Yuan, T. Mu, S. Tao, Y . Fang, M. Zhang, and H. Su. Policy decorator: Model-agnostic online refinement for large policy model. InInternational Conference on Learning Representations (ICLR), 2025

2025

[6] [6]

G. Ma, L. Li, H. Wang, Z. Liu, P.-L. Bacon, and D. Tao. What makes value learning efficient in residual reinforcement learning?arXiv preprint arXiv:2602.10539, 2026

work page arXiv 2026

[7] [7]

Eschmann

J. Eschmann. Reward function design in reinforcement learning.Reinforcement learning algorithms: Analysis and Applications, 2021

2021

[8] [8]

W. Xiao, H. Lin, A. Peng, H. Xue, T. He, Y . Xie, F. Hu, J. Wu, Z. Luo, L. Fan, et al. Self- improving vision-language-action models with data generation via residual rl. InInternational Conference on Learning Representation (ICLR), 2026

2026

[9] [9]

Watson, S

J. Watson, S. H. Huang, and N. Heess. Coherent soft imitation learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[10] [10]

Palenicek, F

D. Palenicek, F. V ogt, J. Watson, I. Posner, and J. Peters. XQC: Well-conditioned optimization accelerates deep reinforcement learning.International Conference on Learning Representa- tions (ICLR), 2026

2026

[11] [11]

A. Ren, J. Lidard, L. Ankile, A. Simeonov, P. Agrawal, A. Majumdar, B. Burchfiel, H. Dai, and M. Simchowitz. Diffusion policy policy optimization. InInternational Conference on Learning Representations (ICLR), 2025. 9

2025

[12] [12]

K. Chen, Z. Liu, T. Zhang, Z. Guo, S. Xu, H. Lin, H. Zang, X. Li, Q. Zhang, Z. Yu, G. Fan, T. Huang, Y . Wang, and C. Yu.π RL: Online RL fine-tuning for flow-based vision-language- action models, 2026

2026

[13] [13]

Wagenmaker, Y

A. Wagenmaker, Y . Zhang, M. Nakamoto, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine. Steering your diffusion policy with latent space reinforcement learning. InConfer- ence on Robot Learning (CoRL), 2025

2025

[14] [14]

Residual Policy Learning

T. Silver, K. Allen, J. Tenenbaum, and L. Kaelbling. Residual policy learning.arXiv preprint arXiv:1812.06298, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[15] [15]

Ankile, Z

L. Ankile, Z. Jiang, R. Duan, G. Shi, P. Abbeel, and A. Nagabandi. Residual off-policy RL for finetuning behavior cloning policies.arXiv preprint arXiv:2509.19301, 2025

work page arXiv 2025

[16] [16]

Orsini, A

M. Orsini, A. Raichuk, L. Hussenot, D. Vincent, R. Dadashi, S. Girgin, M. Geist, O. Bachem, O. Pietquin, and M. Andrychowicz. What matters for adversarial imitation learning? In Advances in Neural Information Processing Systems (NeurIPS), 2021

2021

[17] [17]

Sun and S

Z. Sun and S. Song. From prior to pro: Efficient skill mastery via distribution contractive rl finetuning.arXiv preprint arXiv:2603.10263, 2026

work page internal anchor Pith review arXiv 2026

[18] [18]

arXiv preprint arXiv:2603.26666 , year=

Z. Zhong, H. Yan, J. Li, J. He, T. Zhang, and H. Li. VLA-OPD: Bridging offline sft and online RL for vision-language-action models via on-policy distillation.arXiv preprint arXiv:2603.26666, 2026

work page arXiv 2026

[19] [19]

A. Y . Ng, D. Harada, and S. J. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. InInternational Conference on Machine Learning (ICML), 1999

1999

[20] [20]

Haarnoja, A

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational Conference on Machine Learning (ICML), 2018

2018

[21] [21]

Meronen, M

L. Meronen, M. Trapp, and A. Solin. Periodic activation functions induce stationarity. In Advances in Neural Information Processing Systems (NeurIPS), 2021

2021

[22] [22]

Rasmussen and C

C. Rasmussen and C. Williams.Gaussian Processes for Machine Learning. MIT Press, 2006

2006

[23] [23]

Ioffe and C

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InInternational Conference on Machine Learning (ICML), 2015

2015

[24] [24]

C. Lyle, Z. Zheng, K. Khetarpal, J. Martens, H. van Hasselt, R. Pascanu, and W. Dabney. Normalization and effective learning rates in reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[25] [25]

J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. InNIPS 2016 Deep Learning Symposium, 2016

2016

[26] [26]

Palenicek, F

D. Palenicek, F. V ogt, J. Watson, and J. Peters. Scaling off-policy reinforcement learning with batch and weight normalization. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025

[27] [27]

Fujimoto, H

S. Fujimoto, H. Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. InInternational Conference on Machine Learning (ICML), 2018

2018

[28] [28]

Polyanskiy

Y . Polyanskiy. Information theoretic methods in statistics and computer science: Lecture 1 — f-divergences, 2020. 10

2020

[29] [29]

Espeholt, H

L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V . Mnih, T. Ward, Y . Doron, V . Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu. IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. InInternational Conference on Machine Learning (ICML), 2018

2018

[30] [30]

Levine, C

S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research (JMLR), 2016

2016

[31] [31]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Rep- resentations (ICLR), 2022

2022

[32] [32]

Mandlekar, D

A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Mart´ın-Mart´ın. What matters in learning from offline human demonstrations for robot manipulation. InConference on Robot Learning (CoRL), 2021

2021

[33] [33]

Mandlekar, S

A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox. MimicGen: A data generation system for scalable robot learning using human demonstrations. InConference on Robot Learning (CoRL), 2023

2023

[34] [34]

Intelligence, K

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. V...

2025