pith. machine review for the scientific record. sign in

arxiv: 1509.03005 · v1 · submitted 2015-09-10 · 💻 cs.LG · cs.AI· cs.NE· stat.ML

Recognition: unknown

Compatible Value Gradients for Reinforcement Learning of Continuous Deep Policies

Authors on Pith no claims yet
classification 💻 cs.LG cs.AIcs.NEstat.ML
keywords learningreinforcementgpropalgorithmbanditchallengingcompatiblecontinuous
0
0 comments X
read the original abstract

This paper proposes GProp, a deep reinforcement learning algorithm for continuous policies with compatible function approximation. The algorithm is based on two innovations. Firstly, we present a temporal-difference based method for learning the gradient of the value-function. Secondly, we present the deviator-actor-critic (DAC) model, which comprises three neural networks that estimate the value function, its gradient, and determine the actor's policy respectively. We evaluate GProp on two challenging tasks: a contextual bandit problem constructed from nonparametric regression datasets that is designed to probe the ability of reinforcement learning algorithms to accurately estimate gradients; and the octopus arm, a challenging reinforcement learning benchmark. GProp is competitive with fully supervised methods on the bandit task and achieves the best performance to date on the octopus arm.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Continuous control with deep reinforcement learning

    cs.LG 2015-09 accept novelty 7.0

    DDPG is a model-free actor-critic algorithm that learns continuous control policies end-to-end from states or pixels using deterministic policy gradients and deep networks, solving more than 20 physics tasks competiti...

  2. AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

    cs.LG 2020-06 unverdicted novelty 6.0

    AWAC combines offline data with online RL via advantage-weighted actor-critic updates to enable faster acquisition of robotic skills such as dexterous manipulation.

  3. Soft Deterministic Policy Gradient with Gaussian Smoothing

    cs.LG 2026-05 unverdicted novelty 5.0

    Soft-DPG uses Gaussian smoothing on the Bellman equation to derive a well-defined policy gradient without relying on critic action derivatives, yielding competitive performance on dense-reward tasks and gains on discr...