Multistability is necessary for temporal horizon generalization in POMDPs, sufficient in simple tasks along with transient dynamics in complex ones, while monostable parallelizable RNNs like SSMs and gated linear RNNs fail by construction.
hub Mixed citations
Human-level control through deep reinforcement learning.nature, 518(7540):529–533
Mixed citation behavior. Most common role is background (60%).
hub tools
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 11representative citing papers
Model-free DQN learning achieves suboptimality bounds of O(1/sqrt(Ns)) + O(1/N) in Karma DPGs at equilibrium, and deep RL combined with fictitious play empirically reaches near-Stationary Nash Equilibrium from scratch.
NM-PPG optimizes non-myopic acquisition policies for costly features by enabling pathwise gradients via continuous relaxation and straight-through rollouts in POMDPs, outperforming SOTA baselines.
Proposes HAD-MFC framework that decouples upper-level vulnerable agent selection from lower-level adversarial policy learning in large-scale MARL using Fenchel-Rockafellar transform and MDP reformulation with provable optimality preservation.
DR-SAC is the first actor-critic distributionally robust RL algorithm for offline continuous control that derives a convergent robust soft policy iteration and reports up to 9.8x higher rewards than SAC under perturbations.
Frontier LRMs match human game-learning behavior and predict fMRI signals an order of magnitude better than RL or Bayesian agents because of their in-context game-state representations.
A procedure builds provably minimal Markovian states from a longitudinal causal graph, but deep RL requires multi-order historical state exposure (MOSE) to realize gains over minimal or fixed-window baselines.
RQIQN introduces a Wasserstein DRO-based correction to Bellman quantile targets that enlarges distributional spread without altering risk-neutral averages.
ANO derives a robust policy optimizer from geometric principles that replaces clipping with a smooth redescending gradient, showing better performance and stability than PPO, SPO, and GRPO in MuJoCo, Atari, and RLHF experiments.
Establishes n^{-1/4} Gaussian approximation in convex distance for averaged entropy-regularized Q-learning with linear function approximation and polynomial stepsizes.
Soft-DPG uses Gaussian smoothing on the Bellman equation to derive a well-defined policy gradient without relying on critic action derivatives, yielding competitive performance on dense-reward tasks and gains on discretized-reward variants.
citing papers explorer
-
On the Importance of Multistability for Horizon Generalization in Reinforcement Learning
Multistability is necessary for temporal horizon generalization in POMDPs, sufficient in simple tasks along with transient dynamics in complex ones, while monostable parallelizable RNNs like SSMs and gated linear RNNs fail by construction.
-
Towards Model-Free Learning in Dynamic Population Games: An Application to Karma Economies
Model-free DQN learning achieves suboptimality bounds of O(1/sqrt(Ns)) + O(1/N) in Karma DPGs at equilibrium, and deep RL combined with fictitious play empirically reaches near-Stationary Nash Equilibrium from scratch.
-
Non-Myopic Active Feature Acquisition via Pathwise Policy Gradients
NM-PPG optimizes non-myopic acquisition policies for costly features by enabling pathwise gradients via continuous relaxation and straight-through rollouts in POMDPs, outperforming SOTA baselines.
-
Vulnerable Agent Identification in Large-Scale Multi-Agent Reinforcement Learning
Proposes HAD-MFC framework that decouples upper-level vulnerable agent selection from lower-level adversarial policy learning in large-scale MARL using Fenchel-Rockafellar transform and MDP reformulation with provable optimality preservation.
-
DR-SAC: Distributionally Robust Soft Actor-Critic for Reinforcement Learning under Uncertainty
DR-SAC is the first actor-critic distributionally robust RL algorithm for offline continuous control that derives a convergent robust soft policy iteration and reports up to 9.8x higher rewards than SAC under perturbations.
-
Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners
Frontier LRMs match human game-learning behavior and predict fMRI signals an order of magnitude better than RL or Bayesian agents because of their in-context game-state representations.
-
Integrating Causal DAGs in Deep RL: Activating Minimal Markovian States with Multi-Order Exposure
A procedure builds provably minimal Markovian states from a longitudinal causal graph, but deep RL requires multi-order historical state exposure (MOSE) to realize gains over minimal or fixed-window baselines.
-
Quantile Geometry Regularization for Distributional Reinforcement Learning
RQIQN introduces a Wasserstein DRO-based correction to Bellman quantile targets that enlarges distributional spread without altering risk-neutral averages.
-
ANO: A Principled Approach to Robust Policy Optimization
ANO derives a robust policy optimizer from geometric principles that replaces clipping with a smooth redescending gradient, showing better performance and stability than PPO, SPO, and GRPO in MuJoCo, Atari, and RLHF experiments.
-
On Gaussian approximation for entropy-regularized Q-learning with function approximation
Establishes n^{-1/4} Gaussian approximation in convex distance for averaged entropy-regularized Q-learning with linear function approximation and polynomial stepsizes.
-
Soft Deterministic Policy Gradient with Gaussian Smoothing
Soft-DPG uses Gaussian smoothing on the Bellman equation to derive a well-defined policy gradient without relying on critic action derivatives, yielding competitive performance on dense-reward tasks and gains on discretized-reward variants.