vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.
hub Mixed citations
Asynchronous Methods for Deep Reinforcement Learning
Mixed citation behavior. Most common role is background (40%).
abstract
We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU. Furthermore, we show that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
SmoothCruiser achieves O~(1/epsilon^4) problem-independent sample complexity for value estimation in entropy-regularized MDPs and games via a generative model.
OpenAI Gym introduces a common interface for reinforcement learning environments and a results-sharing website to enable consistent algorithm comparisons.
SALT is a subspace-adaptive plug-in for GRPO that decomposes group-relative coefficients into shared and residual channels using mini-batch Gram geometry and amplifies residuals to mitigate signed cancellation in RLVR.
SOL is a new hierarchical RL algorithm that reaches 35x higher throughput and outperforms flat agents when trained on 30 billion frames in NetHack while showing positive scaling.
A deep RL vulnerability-prediction policy trained in semantic embedding space finds up to 23% more unique robot manipulation failures than vision-language baselines and enables more efficient fine-tuning.
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
RLAIF matches RLHF on summarization and dialogue tasks, with a direct-RLAIF variant achieving superior results by using LLM rewards directly during training.
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
DECHRL models causal structures and stochastic delay distributions within hierarchical RL and incorporates them into a delay-aware empowerment objective to improve performance under temporal uncertainty.
Gauss-Newton descent whitens errors by projecting Newton directions or gradients onto the tangent space, replacing JJ^T with the identity and removing parameterization distortions that affect Newton descent.
LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
Distill-Belief distills Bayesian information-gain signals from a particle-filter teacher into a compact student policy for fast closed-loop source localization and parameter estimation while avoiding reward hacking.
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
The DeepMind Control Suite supplies a standardized collection of continuous control tasks with interpretable rewards for benchmarking reinforcement learning agents.
History-conditioned RL policies recover nearly all privileged-state performance for CO2 storage using only well data, while latent model-based retuning outperforms direct model-free adaptation under limited simulator budgets for abnormal scenarios.
Task-reward reinforcement learning yields robust gains on math benchmarks for models like Llama-3.2-3B while distribution sharpening alone delivers only limited and unstable improvements.
SDPG is a new on-policy visual RL algorithm that estimates gradients via stochastic perturbations of rollouts, achieving faster training and lower memory use than baselines on visual MuJoCo tasks while adding new robotics benchmarks and sim-to-real results.
A centralized HRL planner with HTAN, multi-stage curricula, and counterfactual baseline scales multi-robot task planning to 200 robots and 1000 racks on unlearned maps in RMFS.
Supervised and reinforcement learning are used to find initial adjoint variables for real-time solution of Hamilton-Jacobi-Bellman equations in two-point boundary value problems.
PROM-Net performs unsupervised visual prediction of robot motion from raw frames and integrates the predictions into model predictive control for navigation in unknown dynamic settings.
Applies established DQN and A3C algorithms to train an agent to play Flappy Bird from raw game images via reinforcement learning.
citing papers explorer
-
KL for a KL: On-Policy Distillation with Control Variate Baseline
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.
-
Planning in entropy-regularized Markov decision processes and games
SmoothCruiser achieves O~(1/epsilon^4) problem-independent sample complexity for value estimation in entropy-regularized MDPs and games via a generative model.
-
OpenAI Gym
OpenAI Gym introduces a common interface for reinforcement learning environments and a results-sharing website to enable consistent algorithm comparisons.
-
SALT: When More Rollouts Don't Help in Group-Based Policy Optimization and How to Make Them Matter
SALT is a subspace-adaptive plug-in for GRPO that decomposes group-relative coefficients into shared and residual channels using mini-batch Gram geometry and amplifies residuals to mitigate signed cancellation in RLVR.
-
Scalable Option Learning in High-Throughput Environments
SOL is a new hierarchical RL algorithm that reaches 35x higher throughput and outperforms flat agents when trained on 30 billion frames in NetHack while showing positive scaling.
-
RoboMD: Uncovering Robot Vulnerabilities through Semantic Potential Fields
A deep RL vulnerability-prediction policy trained in semantic embedding space finds up to 23% more unique robot manipulation failures than vision-language baselines and enables more efficient fine-tuning.
-
Training Language Models to Self-Correct via Reinforcement Learning
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
-
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
RLAIF matches RLHF on summarization and dialogue tasks, with a direct-RLAIF variant achieving superior results by using LLM rewards directly during training.
-
Scaling Laws for Transfer
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
-
Delay-Empowered Causal Hierarchical Reinforcement Learning
DECHRL models causal structures and stochastic delay distributions within hierarchical RL and incorporates them into a delay-aware empowerment objective to improve performance under temporal uncertainty.
-
Error whitening: Why Gauss-Newton outperforms Newton
Gauss-Newton descent whitens errors by projecting Newton directions or gradients onto the tangent space, replacing JJ^T with the identity and removing parameterization distortions that affect Newton descent.
-
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
-
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
-
Distill-Belief: Closed-Loop Inverse Source Localization and Characterization in Physical Fields
Distill-Belief distills Bayesian information-gain signals from a particle-filter teacher into a compact student policy for fast closed-loop source localization and parameter estimation while avoiding reward hacking.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
DeepMind Control Suite
The DeepMind Control Suite supplies a standardized collection of continuous control tasks with interpretable rewards for benchmarking reinforcement learning agents.
-
Closed-Loop CO2 Storage Control With History-Based Reinforcement Learning and Latent Model-Based Adaptation
History-conditioned RL policies recover nearly all privileged-state performance for CO2 storage using only well data, while latent model-based retuning outperforms direct model-free adaptation under limited simulator budgets for abnormal scenarios.
-
Beyond Distribution Sharpening: The Importance of Task Rewards
Task-reward reinforcement learning yields robust gains on math benchmarks for models like Llama-3.2-3B while distribution sharpening alone delivers only limited and unstable improvements.
-
Efficient On-policy Visual-RL via Stochastic Decoupled Policy Gradient
SDPG is a new on-policy visual RL algorithm that estimates gradients via stochastic perturbations of rollouts, achieving faster training and lower memory use than baselines on visual MuJoCo tasks while adding new robotics benchmarks and sim-to-real results.
-
Scalable Hierarchical Reinforcement Learning for Hyper Scale Multi-Robot Task Planning
A centralized HRL planner with HTAN, multi-stage curricula, and counterfactual baseline scales multi-robot task planning to 200 robots and 1000 racks on unlearned maps in RMFS.
-
Learning-based Hamilton-Jacobi-Bellman Methods for Optimal Control
Supervised and reinforcement learning are used to find initial adjoint variables for real-time solution of Hamilton-Jacobi-Bellman equations in two-point boundary value problems.
-
Planning Robot Motion using Deep Visual Prediction
PROM-Net performs unsupervised visual prediction of robot motion from raw frames and integrates the predictions into model predictive control for navigation in unknown dynamic settings.
-
Playing Flappy Bird via Asynchronous Advantage Actor Critic Algorithm
Applies established DQN and A3C algorithms to train an agent to play Flappy Bird from raw game images via reinforcement learning.