First in-orbit demonstration of a DRL-trained AI satellite attitude controller that performs robust inertial pointing after sim-to-real transfer.
super hub Mixed citations
Continuous control with deep reinforcement learning
Mixed citation behavior. Most common role is background (62%).
abstract
We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algo
authors
co-cited works
representative citing papers
Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
Generalized advantage estimation combined with trust region optimization enables stable neural network policy learning for complex continuous control from raw kinematics.
Establishes maximal concentration bounds for stochastic approximation under heavy-tailed Markovian noise, with tails ranging from sub-Gaussian to heavier than Weibull depending on step sizes and contractivity properties, plus a truncation argument for unbounded noise.
A reference-decoupled reformulation makes direct data-driven LQT equivalent to certainty-equivalence solutions and supports convergent offline and online DeePO algorithms.
DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
A new marginalized reparameterization estimator allows low-variance training of mixture policies in entropy-regularized actor-critic algorithms, matching or exceeding Gaussian policy performance in several continuous control benchmarks.
The reciprocity gradient allows agents to learn near-optimal context-sensitive policies by analytically propagating reward gradients through reputation chains in multi-agent settings.
Derives loss-to-TV bounds providing probabilistic guarantees for GFlowNets and introduces Stable GFlowNets algorithm for improved training stability and distributional fidelity.
A new injective multi-jet framework for RL flow control provides jet-count-independent running cost upper bounds and enables superior coordinated jet strategies, achieving drag suppression beyond symmetric ideals on cylinders and aerodynamic efficiency gains from 53% to 73% on airfoils.
SRSD uses human-provided semantic labels to learn rewards that encourage reinforcement learning agents to discover a wide variety of meaningful and distinct behaviors.
Intentional TD and Intentional Policy Gradient select step sizes for fixed fractional TD error reduction and bounded policy KL divergence, yielding stable streaming deep RL performance on par with batch methods.
A model-free reinforcement learning agent learns to align crystals from diffraction images without human supervision or theoretical knowledge.
A litmus test based on reachset-conformant model identification and correlation analysis of uncertainties predicts if RL-based control is superior to model-based control without any RL training.
Frictional Q-Learning encodes supported actions as tangent directions on an action manifold using a contrastive variational autoencoder to reduce extrapolation errors in off-policy reinforcement learning.
Action aliasing from safety projections harms policy-gradient estimates more severely when the projection is inside the policy than when it is outside, but a penalty term restores competitiveness.
Guided RL using Bezier curves and UARM model enables efficient, explainable omnidirectional jumping in quadruped robots.
DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.
DR-SAC is the first actor-critic distributionally robust RL algorithm for offline continuous control that derives a convergent robust soft policy iteration and reports up to 9.8x higher rewards than SAC under perturbations.
vsOED uses a variational one-point reward and RL policy optimization to provide a lower bound on expected information gain for sequential experimental design, supporting nuisance parameters, implicit likelihoods, and multiple design goals.
DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
Diffusion-QL uses conditional diffusion models as expressive policies in offline RL by coupling behavior cloning with Q-value maximization, achieving SOTA on most D4RL tasks.
Dreamer learns to control from images by imagining and optimizing behaviors in a learned latent world model, outperforming prior methods on 20 visual tasks in data efficiency and final performance.
Introduces a benchmark suite of over 18 MBRL environments, evaluates multiple algorithms under consistent settings, and identifies three core challenges: dynamics bottleneck, planning horizon dilemma, and early-termination dilemma.
citing papers explorer
-
LeLaR: The First In-Orbit Demonstration of an AI-Based Satellite Attitude Controller
First in-orbit demonstration of a DRL-trained AI satellite attitude controller that performs robust inertial pointing after sim-to-real transfer.
-
Consistency Models
Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
-
High-Dimensional Continuous Control Using Generalized Advantage Estimation
Generalized advantage estimation combined with trust region optimization enables stable neural network policy learning for complex continuous control from raw kinematics.
-
Concentration of General Stochastic Approximation Under Heavy-Tailed Markovian Noise
Establishes maximal concentration bounds for stochastic approximation under heavy-tailed Markovian noise, with tails ranging from sub-Gaussian to heavier than Weibull depending on step sizes and contractivity properties, plus a truncation argument for unbounded noise.
-
Direct Data-Driven Linear Quadratic Tracking via Policy Optimization
A reference-decoupled reformulation makes direct data-driven LQT equivalent to certainty-equivalence solutions and supports convergent offline and online DeePO algorithms.
-
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
-
Revisiting Mixture Policies in Entropy-Regularized Actor-Critic
A new marginalized reparameterization estimator allows low-variance training of mixture policies in entropy-regularized actor-critic algorithms, matching or exceeding Gaussian policy performance in several continuous control benchmarks.
-
The Reciprocity Gradient
The reciprocity gradient allows agents to learn near-optimal context-sensitive policies by analytically propagating reward gradients through reputation chains in multi-agent settings.
-
Stable GFlowNets with Probabilistic Guarantees
Derives loss-to-TV bounds providing probabilistic guarantees for GFlowNets and introduces Stable GFlowNets algorithm for improved training stability and distributional fidelity.
-
A Provably Robust Multi-Jet Framework applied to Active Flow Control of an Airfoil in Weakly Compressible Flow
A new injective multi-jet framework for RL flow control provides jet-count-independent running cost upper bounds and enables superior coordinated jet strategies, achieving drag suppression beyond symmetric ideals on cylinders and aerodynamic efficiency gains from 53% to 73% on airfoils.
-
Leveraging Human Feedback for Semantically-Relevant Skill Discovery
SRSD uses human-provided semantic labels to learn rewards that encourage reinforcement learning agents to discover a wide variety of meaningful and distinct behaviors.
-
Intentional Updates for Streaming Reinforcement Learning
Intentional TD and Intentional Policy Gradient select step sizes for fixed fractional TD error reduction and bounded policy KL divergence, yielding stable streaming deep RL performance on par with batch methods.
-
Autonomous Diffractometry Enabled by Visual Reinforcement Learning
A model-free reinforcement learning agent learns to align crystals from diffraction images without human supervision or theoretical knowledge.
-
To Learn or Not to Learn: A Litmus Test for Using Reinforcement Learning in Control
A litmus test based on reachset-conformant model identification and correlation analysis of uncertainties predicts if RL-based control is superior to model-based control without any RL training.
-
Frictional Q-Learning
Frictional Q-Learning encodes supported actions as tangent directions on an action manifold using a contrastive variational autoencoder to reduce extrapolation errors in off-policy reinforcement learning.
-
Safe Reinforcement Learning using Action Projection: Safeguard the Policy or the Environment?
Action aliasing from safety projections harms policy-gradient estimates more severely when the projection is inside the policy than when it is outside, but a penalty term restores competitiveness.
-
Guided Reinforcement Learning for Omnidirectional 3D Jumping in Quadruped Robots
Guided RL using Bezier curves and UARM model enables efficient, explainable omnidirectional jumping in quadruped robots.
-
Steering Your Diffusion Policy with Latent Space Reinforcement Learning
DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.
-
DR-SAC: Distributionally Robust Soft Actor-Critic for Reinforcement Learning under Uncertainty
DR-SAC is the first actor-critic distributionally robust RL algorithm for offline continuous control that derives a convergent robust soft policy iteration and reports up to 9.8x higher rewards than SAC under perturbations.
-
Variational Sequential Optimal Experimental Design using Reinforcement Learning
vsOED uses a variational one-point reward and RL policy optimization to provide a lower bound on expected information gain for sequential experimental design, supporting nuisance parameters, implicit likelihoods, and multiple design goals.
-
Mastering Diverse Domains through World Models
DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
-
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning
Diffusion-QL uses conditional diffusion models as expressive policies in offline RL by coupling behavior cloning with Q-value maximization, achieving SOTA on most D4RL tasks.
-
Dream to Control: Learning Behaviors by Latent Imagination
Dreamer learns to control from images by imagining and optimizing behaviors in a learned latent world model, outperforming prior methods on 20 visual tasks in data efficiency and final performance.
-
Benchmarking Model-Based Reinforcement Learning
Introduces a benchmark suite of over 18 MBRL environments, evaluates multiple algorithms under consistent settings, and identifies three core challenges: dynamics bottleneck, planning horizon dilemma, and early-termination dilemma.
-
Finding Needles in a Moving Haystack: Prioritizing Alerts with Adversarial Reinforcement Learning
Adversarial RL approximates a game-theoretic equilibrium to yield a stochastic policy for prioritizing alerts against adaptive attackers in fraud and intrusion detection.
-
Exploring Model-based Planning with Policy Networks
POPLIN combines policy networks with model-predictive planning by optimizing either action sequences or policy parameters, yielding 3x better sample efficiency than PETS, TD3 and SAC on MuJoCo locomotion tasks.
-
Soft Actor-Critic Algorithms and Applications
SAC extends maximum-entropy RL into a stable off-policy actor-critic method with constrained temperature tuning, outperforming prior algorithms in sample efficiency and consistency on locomotion and manipulation tasks.
-
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
Soft Actor-Critic is an off-policy maximum-entropy actor-critic algorithm that achieves state-of-the-art performance and high stability on continuous control benchmarks.
-
Goal-Conditioned Agents that Learn Everything All at Once
LEO enables efficient all-goals learning in goal-conditioned RL by jointly predicting for all goals in one network pass, yielding >250x speedup over relabelling and better performance on Craftax.
-
Optimal design of solar-battery hybrid resources considering multi-market participation under weather and price uncertainty
A deep reinforcement learning co-optimization framework is developed for jointly sizing solar-battery hybrids and determining their multi-market bidding strategies under stochastic weather and price conditions.
-
Policy-DRIFT: Dynamic Reward-Informed Flow Trajectory Steering
Policy-DRIFT combines conditional flow matching with terminal reward guidance and decoupled DRL to achieve 49% drag reduction in Re_tau=180 channel flow, 16% above DRL benchmarks and with 37 times less actuation energy.
-
Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy Gradients
The k-step policy gradient converges exponentially close to the optimal deterministic policy in restricted classes, achieving O(1/T) rates under smoothness assumptions without distribution mismatch factors.
-
AdamO: A Collapse-Suppressed Optimizer for Offline RL
AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.
-
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.
-
RL Token: Bootstrapping Online RL with Vision-Language-Action Models
RL Token enables sample-efficient online RL fine-tuning of large VLAs, delivering up to 3x speed gains and higher success rates on real-robot manipulation tasks within minutes to hours.
-
A Systematic Review and Taxonomy of Reinforcement Learning-Model Predictive Control Integration for Linear Systems
This review synthesizes existing RL-MPC integration methods for linear systems into a taxonomy across RL roles, algorithms, MPC formulations, costs, and domains while identifying recurring patterns and practical challenges.
-
Safe Control using Learned Safety Filters and Adaptive Conformal Inference
ACoFi adaptively tunes the switching threshold of learned safety filters using conformal inference on the range of predicted safety values, asymptotically bounding the rate of incorrect safety assessments by a user parameter and reducing violations versus fixed thresholds in simulations.
-
Scalable Neighborhood-Based Multi-Agent Actor-Critic
MADDPG-K scales centralized critics in multi-agent RL by limiting each critic to k-nearest neighbors under Euclidean distance, yielding constant input size and competitive performance.
-
Distributional Off-Policy Evaluation with Deep Quantile Process Regression
DQPOPE estimates the entire return distribution in off-policy evaluation via deep quantile process regression, providing statistical advantages over standard single-value methods with equivalent sample sizes.
-
Physics-guided surrogate learning enables zero-shot control of turbulent wings
Zero-shot RL control trained on matched channel flows reduces skin-friction drag 28.7% and total drag 10.7% on a NACA4412 wing, outperforming opposition control.
-
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control
FlashSAC improves training speed and final performance of off-policy RL on high-dimensional robot tasks by reducing update frequency, increasing model scale, and bounding norms to limit critic error accumulation.
-
LLM-Enhanced Reinforcement Learning for Long-Term User Satisfaction in Interactive Recommendation
LERL combines an LLM high-level planner for diverse semantic categories with a low-level RL policy for item selection to improve long-term user satisfaction and reduce content homogeneity in interactive recommenders.
-
Tensor-Efficient High-Dimensional Q-learning
TEQL uses a low-rank tensor representation of the Q-function plus error-uncertainty guided exploration to achieve better sample efficiency than matrix low-rank or deep RL baselines on classic control tasks under matched parameter budgets.
-
Reinforcement Learning-based Control via Y-wise Affine Neural Networks (YANNs)
YANN-RL initializes RL actor and critic networks with explicit multi-parametric linear MPC solutions via YANNs to start from linear optimal control performance and then learn nonlinear policies through online interaction.
-
Joint Scheduling of Deferrable and Nondeferrable Demand with Colocated Stochastic Supply
Optimal scheduling of deferrable demands with colocated stochastic supply and piecewise-linear pricing reduces to a finite set of three procrastination thresholds per demand class; a reinforcement learning algorithm learns these thresholds when distributions are unknown.
-
Reinforcement Learning with Action Chunking
Q-chunking improves offline-to-online RL sample efficiency on long-horizon sparse-reward manipulation tasks by applying action chunking to TD learning.
-
SENIOR: Efficient Query Selection and Preference-Guided Exploration in Preference-based Reinforcement Learning
SENIOR improves feedback efficiency and policy learning speed in PbRL by combining motion-distinction query selection via kernel density estimation with preference-guided intrinsic rewards, showing gains on simulated and real robot tasks.
-
Neural Mean-Field Games: Extending Mean-Field Game Theory with Neural Stochastic Differential Equations
Neural mean-field games integrate mean-field game theory with neural SDEs to learn strategic interactions from data in a model-free way, demonstrated on games and viral dynamics.
-
Physically Interpretable World Models via Weakly Supervised Representation Learning
PIWM aligns latent states in image-based world models with physical variables and constrains their dynamics to known equations via weak distribution supervision, yielding accurate long-horizon predictions and parameter recovery on Cart Pole, Lunar Lander, and Donkey Car.
-
Training Language Models to Self-Correct via Reinforcement Learning
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.