First in-orbit demonstration of a DRL-trained AI satellite attitude controller that performs robust inertial pointing after sim-to-real transfer.
super hub Mixed citations
Continuous control with deep reinforcement learning
Mixed citation behavior. Most common role is background (62%).
abstract
We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algo
authors
co-cited works
representative citing papers
Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
Generalized advantage estimation combined with trust region optimization enables stable neural network policy learning for complex continuous control from raw kinematics.
QGF performs test-time policy optimization for flow models in RL by guiding a behavior-cloned reference policy with value-function gradients, achieving strong results on high-dimensional offline RL benchmarks without additional policy training.
Minimum-flow GFlowNets on graphs encode optimal transport plans, with the learned policy recovering the optimal coupling between source and target distributions.
A training-free survival regression approach uses tabular foundation models to build an accelerated failure time model and iteratively impute right-censored data with a non-parametric in-context estimator, matching the performance of trained Cox and parametric AFT models on benchmarks.
Periodic and soft target updates guarantee convergence in linear Q-learning to the exact projected Q-Bellman solution under spectral and step-size conditions via joint spectral radius analysis of switched linear systems.
Introduces a state-aligned latent actor-critic framework that lets diffusion models act as their own timestep-conditioned value functions for trajectory-level RL post-training and inference steering.
Wasserstein policy gradient converges globally in entropy-regularized RL via Bellman-induced distributional PL geometry and uniform LSI, yielding geometric contraction up to discretization bias.
CurveRL derives a quantile-coordinate reweighting rule from a utility functional on pass rates and shows it outperforms GRPO on reasoning benchmarks.
Establishes maximal concentration bounds for stochastic approximation under heavy-tailed Markovian noise, with tails ranging from sub-Gaussian to heavier than Weibull depending on step sizes and contractivity properties, plus a truncation argument for unbounded noise.
A reference-decoupled reformulation makes direct data-driven LQT equivalent to certainty-equivalence solutions and supports convergent offline and online DeePO algorithms.
DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
A new marginalized reparameterization estimator allows low-variance training of mixture policies in entropy-regularized actor-critic algorithms, matching or exceeding Gaussian policy performance in several continuous control benchmarks.
The reciprocity gradient allows agents to learn near-optimal context-sensitive policies by analytically propagating reward gradients through reputation chains in multi-agent settings.
Derives loss-to-TV bounds providing probabilistic guarantees for GFlowNets and introduces Stable GFlowNets algorithm for improved training stability and distributional fidelity.
A new injective multi-jet framework for RL flow control provides jet-count-independent running cost upper bounds and enables superior coordinated jet strategies, achieving drag suppression beyond symmetric ideals on cylinders and aerodynamic efficiency gains from 53% to 73% on airfoils.
SRSD uses human-provided semantic labels to learn rewards that encourage reinforcement learning agents to discover a wide variety of meaningful and distinct behaviors.
Intentional TD and Intentional Policy Gradient select step sizes for fixed fractional TD error reduction and bounded policy KL divergence, yielding stable streaming deep RL performance on par with batch methods.
A model-free reinforcement learning agent learns to align crystals from diffraction images without human supervision or theoretical knowledge.
A litmus test based on reachset-conformant model identification and correlation analysis of uncertainties predicts if RL-based control is superior to model-based control without any RL training.
Action aliasing from safety projections harms policy-gradient estimates more severely when the projection is inside the policy than when it is outside, but a penalty term restores competitiveness.
Guided RL using Bezier curves and UARM model enables efficient, explainable omnidirectional jumping in quadruped robots.
DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.
citing papers explorer
-
LeLaR: The First In-Orbit Demonstration of an AI-Based Satellite Attitude Controller
First in-orbit demonstration of a DRL-trained AI satellite attitude controller that performs robust inertial pointing after sim-to-real transfer.
-
Guided Reinforcement Learning for Omnidirectional 3D Jumping in Quadruped Robots
Guided RL using Bezier curves and UARM model enables efficient, explainable omnidirectional jumping in quadruped robots.
-
Steering Your Diffusion Policy with Latent Space Reinforcement Learning
DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.
-
Structured 4D Latent Predictive Model for Robot Planning
A 4D latent predictive model encodes scenes holistically to generate 3D-consistent futures that an inverse dynamics module converts into robot actions, outperforming video-based planners on manipulation tasks.
-
SENIOR: Efficient Query Selection and Preference-Guided Exploration in Preference-based Reinforcement Learning
SENIOR improves feedback efficiency and policy learning speed in PbRL by combining motion-distinction query selection via kernel density estimation with preference-guided intrinsic rewards, showing gains on simulated and real robot tasks.
-
Diffusion Policy Policy Optimization
DPPO fine-tunes diffusion policies via policy gradients and outperforms prior RL approaches for diffusion policies and PG-tuned alternatives on robot benchmarks while enabling stable training and hardware deployment.
-
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
A comprehensive benchmark study of offline imitation learning methods on multi-stage robot manipulation tasks identifies key sensitivities to algorithm design, data quality, and stopping criteria while releasing all datasets and code.
-
Environment Probing Interaction Policies
EPI policies use a transition-predictability reward to probe environments and condition task policies, outperforming standard generalization methods on novel test environments.
-
RL-RRT: Kinodynamic Motion Planning via Learning Reachability Estimators from RL Policies
RL-RRT learns an RL policy for local planning and a reachability estimator to guide RRT expansion, yielding faster kinodynamic planning than prior methods on three robot systems including hardware.
-
Learning All-Terrain Locomotion for a Planetary Rover with Actively Articulated Suspension
Reinforcement learning produces a single unified controller that lets an actively suspended planetary rover autonomously cross heterogeneous rough terrains after sim training and zero-shot hardware transfer.
-
Implicit Action Chunking for Smooth Continuous Control
Dual-Window Smoothing uses an execution window for deterministic smoothness and a value window to correct critic bias, plus a first-order temporal regularizer, to achieve smoother RL control than explicit chunking or standard baselines.
-
Generating Realistic Safety-Critical Scenarios for Vehicle-Pedestrian Interactions
A three-stage framework pre-trains multi-agent RL agents on real safety-critical data, refines them via online learning in CARLA, and generates the VPSCI dataset of over 198,000 realistic vehicle-pedestrian interaction episodes.
-
REAP: Reinforcement-Learning End-to-End Autonomous Parking with Gaussian Splatting Simulator for Real2Sim2Real Transfer
REAP trains an end-to-end SAC policy with behavior cloning and collision penalties inside a 3DGS Real2Sim simulator and transfers it to physical vehicles, succeeding in narrow mechanical parking slots.
-
E$^2$DT: Efficient and Effective Decision Transformer with Experience-Aware Sampling for Robotic Manipulation
E²DT couples a Decision Transformer with a k-Determinantal Point Process that scores trajectories on return-to-go quantiles, predictive uncertainty, and stage coverage to improve sample efficiency and policy quality in robotic manipulation.
-
Morphology-Aware Graph Reinforcement Learning for Tensegrity Robot Locomotion
A GNN-augmented SAC policy that encodes tensegrity topology as a graph improves sample efficiency and enables zero-shot sim-to-real locomotion on a 3-bar tensegrity robot.
-
Learning to Solve a Rubik's Cube with a Dexterous Hand
Hierarchical RL combines a model-based cube solver with a model-free hand controller to solve Rubik's cubes in simulation, achieving 90.3% success on 1400 random scrambles.
-
Learning Safe Unlabeled Multi-Robot Planning with Motion Constraints
A multi-agent RL framework for unlabeled multi-robot planning that uses velocity obstacle projections to guarantee collision-free trajectories applicable to arbitrary robot models.
-
Path Planning Using Deep Deterministic Policy Gradient: A Reinforcement Learning Approach
Applies DDPG with a composite reward (attractive destination field, repulsive obstacle fields, control energy penalty) to learn safe paths, claiming faster real-time performance than pseudo-spectral optimal control in simulations.
-
RDGen: Demonstration Generation for High-Quality Robot Learning via Reinforcement Learning
RDGen uses sim-to-real RL policies to generate smoother robot demonstrations that improve downstream VLA performance over human-collected data on pick-and-place tasks.
-
SARAD: LLM-Based Safety-Aware Hybrid Reinforcement Learning with Collision Prediction for Autonomous Driving
SARAD is a hybrid LLM-DRL framework for autonomous driving that replaces random exploration with RAG-enhanced LLM guidance, an attention discriminator, and a collision predictor, reporting performance gains in the Highway-Env simulator.
-
Visualizing Latent Phase Structures in Locomotion Policies: A Multi-Environment Study with Temporal Feature Extension
A temporal-feature extension to clustering, with self-transition suppression for cluster count selection, yields clearer phase structures than prior methods across Ant-v5, HalfCheetah-v5, and Walker2D-v5.
-
Efficient On-policy Visual-RL via Stochastic Decoupled Policy Gradient
SDPG is a new on-policy visual RL algorithm that estimates gradients via stochastic perturbations of rollouts, achieving faster training and lower memory use than baselines on visual MuJoCo tasks while adding new robotics benchmarks and sim-to-real results.
-
Prior Policy Guided Dual-Agent Coordinated Manipulation Planning of Spacecraft-Manipulator System
DACMP applies dual-agent DRL with timestep-level expert switching guidance to achieve simultaneous end-effector precision and base attitude stability in spacecraft-manipulator systems, reporting higher success rates than baselines in simulation.
-
On Training Flexible Robots using Deep Reinforcement Learning
Deep reinforcement learning learns robust policies for flexible robots but is sensitive to sensor choice.
-
Zero-shot Transfer of Reinforcement Learning Control Policies for the Swing-Up and Stabilization of a Cart-Pole System
Zero-shot sim-to-real transfer of independently trained RL policies for cart-pole swing-up and stabilization is achieved via sensitivity-guided domain randomization, linear curriculum learning, and first-order action smoothing with Simulink switching logic.
-
Motion Planning in Dynamic Environments: A Survey from Classical to Modern Methods
Survey of 138 papers (2015-2025) categorizing motion planning in dynamic environments into sampling, graph search, MPC, learning, and classical local methods, plus perception and challenges like prediction uncertainty.
-
A Hierarchical Architecture for Sequential Decision-Making in Autonomous Driving using Deep Reinforcement Learning
A hierarchical DRL architecture generates lane-change commands from occupancy grids for stochastic highway driving and claims improved reliability over end-to-end methods.
-
Transfer Learning for Customized Car Racing Environments
The study applies transfer learning to deep RL in OpenAI car racing, observing that model-based approaches outperform model-free methods and that transfer boosts target domain performance.
-
An Introduction to Deep Reinforcement and Imitation Learning
The paper delivers a concise, self-contained tutorial on foundational DRL algorithms including REINFORCE and PPO and DIL methods including behavioral cloning, DAgger, and GAIL for embodied agents.