hub

Asynchronous methods for deep reinforcement learning.arXiv preprint arXiv:1602.01783

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P Lillicrap, Tim Harley, David Silver, Koray Kavukcuoglu · 2016 · arXiv 1602.01783

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

unclear 1

representative citing papers

KL for a KL: On-Policy Distillation with Control Variate Baseline

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.

Planning in entropy-regularized Markov decision processes and games

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

SmoothCruiser achieves O~(1/epsilon^4) problem-independent sample complexity for value estimation in entropy-regularized MDPs and games via a generative model.

OpenAI Gym

cs.LG · 2016-06-05 · accept · novelty 7.0

OpenAI Gym introduces a common interface for reinforcement learning environments and a results-sharing website to enable consistent algorithm comparisons.

Delay-Empowered Causal Hierarchical Reinforcement Learning

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

DECHRL models causal structures and stochastic delay distributions within hierarchical RL and incorporates them into a delay-aware empowerment objective to improve performance under temporal uncertainty.

Error whitening: Why Gauss-Newton outperforms Newton

cs.LG · 2026-05-11 · conditional · novelty 6.0

Gauss-Newton descent whitens errors by projecting Newton directions or gradients onto the tangent space, replacing JJ^T with the identity and removing parameterization distortions that affect Newton descent.

Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities

cs.AI · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.

A Meta Reinforcement Learning Approach to Goals-Based Wealth Management

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.

Distill-Belief: Closed-Loop Inverse Source Localization and Characterization in Physical Fields

cs.AI · 2026-04-28 · unverdicted · novelty 6.0

Distill-Belief distills Bayesian information-gain signals from a particle-filter teacher into a compact student policy for fast closed-loop source localization and parameter estimation while avoiding reward hacking.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

A General Language Assistant as a Laboratory for Alignment

cs.CL · 2021-12-01 · conditional · novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

DeepMind Control Suite

cs.AI · 2018-01-02 · accept · novelty 6.0

The DeepMind Control Suite supplies a standardized collection of continuous control tasks with interpretable rewards for benchmarking reinforcement learning agents.

Beyond Distribution Sharpening: The Importance of Task Rewards

cs.LG · 2026-04-17 · unverdicted · novelty 5.0

Task-reward reinforcement learning yields robust gains on math benchmarks for models like Llama-3.2-3B while distribution sharpening alone delivers only limited and unstable improvements.

Closed-Loop CO2 Storage Control With History-Based Reinforcement Learning and Latent Model-Based Adaptation

cs.LG · 2026-05-04 · unverdicted · novelty 4.0

History-conditioned RL policies recover nearly all privileged-state performance with deployable well data, and latent model-based retuning outperforms direct model-free retuning under abnormal reservoir conditions.

citing papers explorer

Showing 13 of 13 citing papers.

KL for a KL: On-Policy Distillation with Control Variate Baseline cs.LG · 2026-05-08 · unverdicted · none · ref 28
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.
Planning in entropy-regularized Markov decision processes and games cs.LG · 2026-04-21 · unverdicted · none · ref 19
SmoothCruiser achieves O~(1/epsilon^4) problem-independent sample complexity for value estimation in entropy-regularized MDPs and games via a generative model.
OpenAI Gym cs.LG · 2016-06-05 · accept · none · ref 4
OpenAI Gym introduces a common interface for reinforcement learning environments and a results-sharing website to enable consistent algorithm comparisons.
Delay-Empowered Causal Hierarchical Reinforcement Learning cs.LG · 2026-05-12 · unverdicted · none · ref 39
DECHRL models causal structures and stochastic delay distributions within hierarchical RL and incorporates them into a delay-aware empowerment objective to improve performance under temporal uncertainty.
Error whitening: Why Gauss-Newton outperforms Newton cs.LG · 2026-05-11 · conditional · none · ref 40
Gauss-Newton descent whitens errors by projecting Newton directions or gradients onto the tangent space, replacing JJ^T with the identity and removing parameterization distortions that affect Newton descent.
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities cs.AI · 2026-05-07 · unverdicted · none · ref 28 · 2 links
LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management cs.LG · 2026-05-04 · unverdicted · none · ref 40
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
Distill-Belief: Closed-Loop Inverse Source Localization and Characterization in Physical Fields cs.AI · 2026-04-28 · unverdicted · none · ref 50
Distill-Belief distills Bayesian information-gain signals from a particle-filter teacher into a compact student policy for fast closed-loop source localization and parameter estimation while avoiding reward hacking.
Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 232
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 155
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
DeepMind Control Suite cs.AI · 2018-01-02 · accept · none · ref 9
The DeepMind Control Suite supplies a standardized collection of continuous control tasks with interpretable rewards for benchmarking reinforcement learning agents.
Beyond Distribution Sharpening: The Importance of Task Rewards cs.LG · 2026-04-17 · unverdicted · none · ref 25
Task-reward reinforcement learning yields robust gains on math benchmarks for models like Llama-3.2-3B while distribution sharpening alone delivers only limited and unstable improvements.
Closed-Loop CO2 Storage Control With History-Based Reinforcement Learning and Latent Model-Based Adaptation cs.LG · 2026-05-04 · unverdicted · none · ref 70
History-conditioned RL policies recover nearly all privileged-state performance with deployable well data, and latent model-based retuning outperforms direct model-free retuning under abnormal reservoir conditions.

Asynchronous methods for deep reinforcement learning.arXiv preprint arXiv:1602.01783

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer