hub Mixed citations

Human-level control through deep reinforcement learning.nature, 518(7540):529–533

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al · 2015

Mixed citation behavior. Most common role is background (60%).

11 Pith papers citing it

Background 60% of classified citations

browse 11 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 3 baseline 1 method 1

citation-polarity summary

background 3 baseline 1 use method 1

representative citing papers

On the Importance of Multistability for Horizon Generalization in Reinforcement Learning

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Multistability is necessary for temporal horizon generalization in POMDPs, sufficient in simple tasks along with transient dynamics in complex ones, while monostable parallelizable RNNs like SSMs and gated linear RNNs fail by construction.

Towards Model-Free Learning in Dynamic Population Games: An Application to Karma Economies

cs.GT · 2026-05-11 · unverdicted · novelty 7.0

Model-free DQN learning achieves suboptimality bounds of O(1/sqrt(Ns)) + O(1/N) in Karma DPGs at equilibrium, and deep RL combined with fictitious play empirically reaches near-Stationary Nash Equilibrium from scratch.

Non-Myopic Active Feature Acquisition via Pathwise Policy Gradients

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

NM-PPG optimizes non-myopic acquisition policies for costly features by enabling pathwise gradients via continuous relaxation and straight-through rollouts in POMDPs, outperforming SOTA baselines.

Vulnerable Agent Identification in Large-Scale Multi-Agent Reinforcement Learning

cs.MA · 2025-09-18 · unverdicted · novelty 7.0

Proposes HAD-MFC framework that decouples upper-level vulnerable agent selection from lower-level adversarial policy learning in large-scale MARL using Fenchel-Rockafellar transform and MDP reformulation with provable optimality preservation.

DR-SAC: Distributionally Robust Soft Actor-Critic for Reinforcement Learning under Uncertainty

cs.LG · 2025-06-14 · unverdicted · novelty 7.0

DR-SAC is the first actor-critic distributionally robust RL algorithm for offline continuous control that derives a convergent robust soft policy iteration and reports up to 9.8x higher rewards than SAC under perturbations.

Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

Frontier LRMs match human game-learning behavior and predict fMRI signals an order of magnitude better than RL or Bayesian agents because of their in-context game-state representations.

Integrating Causal DAGs in Deep RL: Activating Minimal Markovian States with Multi-Order Exposure

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

A procedure builds provably minimal Markovian states from a longitudinal causal graph, but deep RL requires multi-order historical state exposure (MOSE) to realize gains over minimal or fixed-window baselines.

Quantile Geometry Regularization for Distributional Reinforcement Learning

cs.LG · 2026-05-05 · unverdicted · novelty 6.0

RQIQN introduces a Wasserstein DRO-based correction to Bellman quantile targets that enlarges distributional spread without altering risk-neutral averages.

ANO: A Principled Approach to Robust Policy Optimization

cs.AI · 2026-05-04 · unverdicted · novelty 6.0

ANO derives a robust policy optimizer from geometric principles that replaces clipping with a smooth redescending gradient, showing better performance and stability than PPO, SPO, and GRPO in MuJoCo, Atari, and RLHF experiments.

On Gaussian approximation for entropy-regularized Q-learning with function approximation

stat.ML · 2026-05-17 · unverdicted · novelty 5.0

Establishes n^{-1/4} Gaussian approximation in convex distance for averaged entropy-regularized Q-learning with linear function approximation and polynomial stepsizes.

Soft Deterministic Policy Gradient with Gaussian Smoothing

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

Soft-DPG uses Gaussian smoothing on the Bellman equation to derive a well-defined policy gradient without relying on critic action derivatives, yielding competitive performance on dense-reward tasks and gains on discretized-reward variants.

citing papers explorer

Showing 11 of 11 citing papers.

On the Importance of Multistability for Horizon Generalization in Reinforcement Learning cs.LG · 2026-05-12 · unverdicted · none · ref 30
Multistability is necessary for temporal horizon generalization in POMDPs, sufficient in simple tasks along with transient dynamics in complex ones, while monostable parallelizable RNNs like SSMs and gated linear RNNs fail by construction.
Towards Model-Free Learning in Dynamic Population Games: An Application to Karma Economies cs.GT · 2026-05-11 · unverdicted · none · ref 26
Model-free DQN learning achieves suboptimality bounds of O(1/sqrt(Ns)) + O(1/N) in Karma DPGs at equilibrium, and deep RL combined with fictitious play empirically reaches near-Stationary Nash Equilibrium from scratch.
Non-Myopic Active Feature Acquisition via Pathwise Policy Gradients cs.LG · 2026-05-06 · unverdicted · none · ref 54
NM-PPG optimizes non-myopic acquisition policies for costly features by enabling pathwise gradients via continuous relaxation and straight-through rollouts in POMDPs, outperforming SOTA baselines.
Vulnerable Agent Identification in Large-Scale Multi-Agent Reinforcement Learning cs.MA · 2025-09-18 · unverdicted · none · ref 68
Proposes HAD-MFC framework that decouples upper-level vulnerable agent selection from lower-level adversarial policy learning in large-scale MARL using Fenchel-Rockafellar transform and MDP reformulation with provable optimality preservation.
DR-SAC: Distributionally Robust Soft Actor-Critic for Reinforcement Learning under Uncertainty cs.LG · 2025-06-14 · unverdicted · none · ref 33
DR-SAC is the first actor-critic distributionally robust RL algorithm for offline continuous control that derives a convergent robust soft policy iteration and reports up to 9.8x higher rewards than SAC under perturbations.
Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners cs.AI · 2026-05-08 · unverdicted · none · ref 5
Frontier LRMs match human game-learning behavior and predict fMRI signals an order of magnitude better than RL or Bayesian agents because of their in-context game-state representations.
Integrating Causal DAGs in Deep RL: Activating Minimal Markovian States with Multi-Order Exposure cs.LG · 2026-05-08 · unverdicted · none · ref 11
A procedure builds provably minimal Markovian states from a longitudinal causal graph, but deep RL requires multi-order historical state exposure (MOSE) to realize gains over minimal or fixed-window baselines.
Quantile Geometry Regularization for Distributional Reinforcement Learning cs.LG · 2026-05-05 · unverdicted · none · ref 21
RQIQN introduces a Wasserstein DRO-based correction to Bellman quantile targets that enlarges distributional spread without altering risk-neutral averages.
ANO: A Principled Approach to Robust Policy Optimization cs.AI · 2026-05-04 · unverdicted · none · ref 18
ANO derives a robust policy optimizer from geometric principles that replaces clipping with a smooth redescending gradient, showing better performance and stability than PPO, SPO, and GRPO in MuJoCo, Atari, and RLHF experiments.
On Gaussian approximation for entropy-regularized Q-learning with function approximation stat.ML · 2026-05-17 · unverdicted · none · ref 19
Establishes n^{-1/4} Gaussian approximation in convex distance for averaged entropy-regularized Q-learning with linear function approximation and polynomial stepsizes.
Soft Deterministic Policy Gradient with Gaussian Smoothing cs.LG · 2026-05-07 · unverdicted · none · ref 14
Soft-DPG uses Gaussian smoothing on the Bellman equation to derive a well-defined policy gradient without relying on critic action derivatives, yielding competitive performance on dense-reward tasks and gains on discretized-reward variants.

Human-level control through deep reinforcement learning.nature, 518(7540):529–533

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer