Sample Efficient Actor-Critic with Experience Replay

Koray Kavukcuoglu; Nando de Freitas; Nicolas Heess; Remi Munos; Victor Bapst; Volodymyr Mnih; Ziyu Wang

arxiv: 1611.01224 · v2 · pith:NVVW6T56new · submitted 2016-11-03 · 💻 cs.LG

Sample Efficient Actor-Critic with Experience Replay

Ziyu Wang , Victor Bapst , Nicolas Heess , Volodymyr Mnih , Remi Munos , Koray Kavukcuoglu , Nando de Freitas This is my paper

classification 💻 cs.LG

keywords actor-criticefficientexperienceincludingreplaysampleseveralachieve

0 comments

read the original abstract

This paper presents an actor-critic deep reinforcement learning agent with experience replay that is stable, sample efficient, and performs remarkably well on challenging environments, including the discrete 57-game Atari domain and several continuous control problems. To achieve this, the paper introduces several innovations, including truncated importance sampling with bias correction, stochastic dueling network architectures, and a new trust region policy optimization method.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 12 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Concentration of General Stochastic Approximation Under Heavy-Tailed Markovian Noise
math.PR 2026-05 unverdicted novelty 7.0

Establishes maximal concentration bounds for stochastic approximation under heavy-tailed Markovian noise, with tails ranging from sub-Gaussian to heavier than Weibull depending on step sizes and contractivity properti...
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
cs.LG 2026-05 unverdicted novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Does Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecasters
cs.LG 2026-05 accept novelty 7.0

Synthetic data augmentation helps channel-mixing time series models but degrades channel-independent ones, with reliable gains only from seasonal-trend generators and gradual schedules in low-resource settings.
Beyond Importance Sampling: Rejection-Gated Policy Optimization
cs.LG 2026-04 unverdicted novelty 6.0

RGPO replaces importance sampling with a smooth [0,1] acceptance gate in policy gradients, unifying TRPO/PPO/REINFORCE, bounding variance for heavy-tailed ratios, and showing gains in online RLHF experiments.
Dissecting Discrete Soft Actor-Critic: Limitations and Principled Alternatives
cs.LG 2025-09 conditional novelty 6.0

Shows entropy coupling limits DSAC on discrete tasks and introduces a generalized actor-critic framework with m-step critics and novel entropy-regularized objectives that perform robustly on Atari.
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
cs.LG 2019-10 conditional novelty 6.0

AWR learns policies via advantage-weighted supervised regression on actions, achieving competitive off-policy performance on Gym tasks and strong results from static data alone.
Polychromic Objectives for Reinforcement Learning
cs.LG 2025-09 unverdicted novelty 5.0

Introduces polychromic objectives adapted into PPO via vine sampling and modified advantages, showing higher success rates and better coverage under perturbations on BabyAI, Minigrid, and algorithmic tasks.
Multi-Agent Deep Reinforcement Learning for Liquidation Strategy Analysis
q-fin.TR 2019-06 unverdicted novelty 5.0

The authors extend the Almgren-Chriss model to a multi-agent setting and apply deep reinforcement learning to simulate and optimize liquidation strategies under practical constraints.
To Learn or Not to Learn: Analyzing the Role of Learning for Navigation in Virtual Environments
cs.CV 2019-07 unverdicted novelty 4.0

Classical agents outperform learning-based ones on MINOS and Stanford 3D Indoor Spaces, with learned agents weaker at collision avoidance and memory but stronger at handling ambiguity and noise.
A Dual Memory Structure for Efficient Use of Replay Memory in Deep Reinforcement Learning
cs.LG 2019-07 unverdicted novelty 4.0

Dual memory (main plus cache) for replay memory in DRL yields higher scores than single memory across three Gym environments.
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
cs.LG 2020-05 unverdicted novelty 2.0

Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.
Optimal Use of Experience in First Person Shooter Environments
cs.LG 2019-06 unverdicted novelty 2.0

Empirical tests in VizDoom show multiple DQN updates per step do not improve performance after learning rate adjustment, with a 4:1 update-to-step ratio optimal before significant degradation.