hub

Deep Reinforcement Learning and the Deadly Triad

Hado Van Hasselt, Yotam Doron, Florian Strub, Matteo Hessel, Nicolas Sonnerat, Joseph Modayil · 2018 · cs.AI · arXiv 1812.02648

19 Pith papers cite this work. Polarity classification is still indexing.

19 Pith papers citing it

open full Pith review browse 19 citing papers arXiv PDF

abstract

We know from reinforcement learning theory that temporal difference learning can fail in certain cases. Sutton and Barto (2018) identify a deadly triad of function approximation, bootstrapping, and off-policy learning. When these three properties are combined, learning can diverge with the value estimates becoming unbounded. However, several algorithms successfully combine these three properties, which indicates that there is at least a partial gap in our understanding. In this work, we investigate the impact of the deadly triad in practice, in the context of a family of popular deep reinforcement learning models - deep Q-networks trained with experience replay - analysing how the components of this system play a role in the emergence of the deadly triad, and in the agent's performance

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

The Mirage of Optimizing Training Policies: Monotonic Inference Policies as the Real Objective for LLM Reinforcement Learning

cs.LG · 2026-06-28 · unverdicted · novelty 7.0

Proposes Monotonic Inference Policy Improvement (MIPI) objective and MIPU two-step update framework to address objective misalignment between training and inference policies in LLM reinforcement learning.

Replicable Reinforcement Learning with Linear Function Approximation

cs.LG · 2025-09-10 · unverdicted · novelty 7.0

Introduces replicable random design regression and covariance estimation tools to enable the first provably efficient replicable RL algorithms for linear MDPs in generative and episodic settings.

Learning Agentic Policy from Action Guidance

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

Beyond the Independence Assumption: Finite-Sample Guarantees for Deep Q-Learning under $\tau$-Mixing

stat.ML · 2026-05-07 · unverdicted · novelty 7.0

Finite-sample risk bounds for DQN with ReLU networks are extended to τ-mixing data, showing an extra dimensionality penalty in the convergence rate due to dependence.

Rollout-Level Advantage-Prioritized Experience Replay for GRPO

cs.LG · 2026-06-03 · conditional · novelty 6.0

Rollout-level advantage-prioritized experience replay for GRPO recycles high-advantage individual rollouts with age eviction and fresh-anchored batches to outperform standard GRPO on math benchmarks, with gains increasing with model size.

Behavior-Consistent Deep Reinforcement Learning

cs.LG · 2026-05-20 · unverdicted · novelty 6.0 · 2 refs

QED bounds cross-run KL divergence in Boltzmann policies by setting temperature proportional to Q-disagreement and reduces return variance by two orders of magnitude on 18 continuous-control tasks without performance loss.

FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

cs.LG · 2026-04-06 · unverdicted · novelty 6.0 · 2 refs

FlashSAC improves training speed and final performance of off-policy RL on high-dimensional robot tasks by reducing update frequency, increasing model scale, and bounding norms to limit critic error accumulation.

Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning

cs.LG · 2025-10-02 · unverdicted · novelty 6.0

MINTO sets bootstrapped targets to the minimum of online and target network estimates, yielding faster stable value learning across online/offline RL and discrete/continuous actions.

Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities

cs.AI · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.

AdamO: A Collapse-Suppressed Optimizer for Offline RL

cs.LG · 2026-05-03 · unverdicted · novelty 6.0

AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.

QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

cs.LG · 2026-05-03 · unverdicted · novelty 6.0

QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.

K-Score: Kalman Filter as a Principled Alternative to Reward Normalization in Reinforcement Learning

cs.LG · 2026-04-24 · unverdicted · novelty 6.0

A 1D Kalman filter for online reward mean estimation accelerates convergence and lowers variance in policy gradient RL compared to standard normalization on LunarLander and CartPole.

Behavior Regularized Offline Reinforcement Learning

cs.LG · 2019-11-26 · unverdicted · novelty 6.0

Behavior-regularized actor-critic methods achieve strong offline RL results with simple regularization, rendering many recent technical additions unnecessary.

Direct Advantage Estimation for Scalable and Sample-efficient Deep Reinforcement Learning

cs.LG · 2026-06-18 · unverdicted · novelty 5.0

Extends DAE theory to POMDPs with minimal changes and introduces discrete latent dynamics to cut computational cost, with ALE experiments showing scalability and retained sample efficiency.

Trace-Mediated Peak Bias: Bridging Temporal Credit Assignment and Cognitive Heuristics in Deep Reinforcement Learning

cs.LG · 2026-06-03 · unverdicted · novelty 5.0

Eligibility traces in deep RL create a peak bias by amplifying distal TD errors into gradient shocks that fixed-step SGD cannot normalize, leading to overestimation of peak-reward trajectories and a mechanistic account of the peak-end rule.

Deep Double Q-learning

cs.LG · 2025-06-30 · unverdicted · novelty 5.0

Deep Double Q-learning explicitly trains two Q-functions in deep RL, outperforming Double DQN on 47 of 57 Atari games while further reducing overestimation.

Event-Driven Reinforcement Learning Enables Long-Horizon Control in Semiconductor Fabrication

cs.LG · 2026-06-09 · unverdicted · novelty 4.0

Event-driven RL framework for semiconductor manufacturing control shows throughput and utilization gains in high-fidelity simulations under offline and online training.

Plasticity Loss in Deep Reinforcement Learning: A Survey

cs.AI · 2024-11-07 · unverdicted · novelty 4.0

Survey unifies the definition of plasticity loss in DRL, taxonomizes over 50 mitigations, identifies evaluation gaps, and finds general regularization often outperforms domain-specific methods.

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

cs.LG · 2020-05-04 · unverdicted · novelty 2.0

Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.

citing papers explorer

Showing 19 of 19 citing papers.

The Mirage of Optimizing Training Policies: Monotonic Inference Policies as the Real Objective for LLM Reinforcement Learning cs.LG · 2026-06-28 · unverdicted · none · ref 11 · internal anchor
Proposes Monotonic Inference Policy Improvement (MIPI) objective and MIPU two-step update framework to address objective misalignment between training and inference policies in LLM reinforcement learning.
Replicable Reinforcement Learning with Linear Function Approximation cs.LG · 2025-09-10 · unverdicted · none · ref 6 · internal anchor
Introduces replicable random design regression and covariance estimation tools to enable the first provably efficient replicable RL algorithms for linear MDPs in generative and episodic settings.
Learning Agentic Policy from Action Guidance cs.CL · 2026-05-12 · unverdicted · none · ref 58
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
Beyond the Independence Assumption: Finite-Sample Guarantees for Deep Q-Learning under $\tau$-Mixing stat.ML · 2026-05-07 · unverdicted · none · ref 32
Finite-sample risk bounds for DQN with ReLU networks are extended to τ-mixing data, showing an extra dimensionality penalty in the convergence rate due to dependence.
Rollout-Level Advantage-Prioritized Experience Replay for GRPO cs.LG · 2026-06-03 · conditional · none · ref 61 · internal anchor
Rollout-level advantage-prioritized experience replay for GRPO recycles high-advantage individual rollouts with age eviction and fresh-anchored batches to outperform standard GRPO on math benchmarks, with gains increasing with model size.
Behavior-Consistent Deep Reinforcement Learning cs.LG · 2026-05-20 · unverdicted · none · ref 99 · 2 links · internal anchor
QED bounds cross-run KL divergence in Boltzmann policies by setting temperature proportional to Q-disagreement and reduces return variance by two orders of magnitude on 18 continuous-control tasks without performance loss.
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control cs.LG · 2026-04-06 · unverdicted · none · ref 87 · 2 links · internal anchor
FlashSAC improves training speed and final performance of off-policy RL on high-dimensional robot tasks by reducing update frequency, increasing model scale, and bounding norms to limit critic error accumulation.
Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning cs.LG · 2025-10-02 · unverdicted · none · ref 10 · internal anchor
MINTO sets bootstrapped targets to the minimum of online and target network estimates, yielding faster stable value learning across online/offline RL and discrete/continuous actions.
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities cs.AI · 2026-05-07 · unverdicted · none · ref 45 · 2 links
LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
AdamO: A Collapse-Suppressed Optimizer for Offline RL cs.LG · 2026-05-03 · unverdicted · none · ref 34
AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL cs.LG · 2026-05-03 · unverdicted · none · ref 199
QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.
K-Score: Kalman Filter as a Principled Alternative to Reward Normalization in Reinforcement Learning cs.LG · 2026-04-24 · unverdicted · none · ref 5
A 1D Kalman filter for online reward mean estimation accelerates convergence and lowers variance in policy gradient RL compared to standard normalization on LunarLander and CartPole.
Behavior Regularized Offline Reinforcement Learning cs.LG · 2019-11-26 · unverdicted · none · ref 18
Behavior-regularized actor-critic methods achieve strong offline RL results with simple regularization, rendering many recent technical additions unnecessary.
Direct Advantage Estimation for Scalable and Sample-efficient Deep Reinforcement Learning cs.LG · 2026-06-18 · unverdicted · none · ref 63 · internal anchor
Extends DAE theory to POMDPs with minimal changes and introduces discrete latent dynamics to cut computational cost, with ALE experiments showing scalability and retained sample efficiency.
Trace-Mediated Peak Bias: Bridging Temporal Credit Assignment and Cognitive Heuristics in Deep Reinforcement Learning cs.LG · 2026-06-03 · unverdicted · none · ref 42 · internal anchor
Eligibility traces in deep RL create a peak bias by amplifying distal TD errors into gradient shocks that fixed-step SGD cannot normalize, leading to overestimation of peak-reward trajectories and a mechanistic account of the peak-end rule.
Deep Double Q-learning cs.LG · 2025-06-30 · unverdicted · none · ref 38 · internal anchor
Deep Double Q-learning explicitly trains two Q-functions in deep RL, outperforming Double DQN on 47 of 57 Atari games while further reducing overestimation.
Event-Driven Reinforcement Learning Enables Long-Horizon Control in Semiconductor Fabrication cs.LG · 2026-06-09 · unverdicted · none · ref 58 · internal anchor
Event-driven RL framework for semiconductor manufacturing control shows throughput and utilization gains in high-fidelity simulations under offline and online training.
Plasticity Loss in Deep Reinforcement Learning: A Survey cs.AI · 2024-11-07 · unverdicted · none · ref 105 · internal anchor
Survey unifies the definition of plasticity loss in DRL, taxonomizes over 50 mitigations, identifies evaluation gaps, and finds general regularization often outperforms domain-specific methods.
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems cs.LG · 2020-05-04 · unverdicted · none · ref 55
Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.

Deep Reinforcement Learning and the Deadly Triad

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer