pith. sign in

super hub Mixed citations

Continuous control with deep reinforcement learning

Mixed citation behavior. Most common role is background (62%).

146 Pith papers citing it
Background 62% of classified citations
abstract

We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate that for many of the tasks the algorithm can learn policies end-to-end: directly from raw pixel inputs.

hub tools

citation-role summary

background 10 method 6

citation-polarity summary

claims ledger

  • abstract We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algo

authors

co-cited works

clear filters

representative citing papers

Consistency Models

cs.LG · 2023-03-02 · conditional · novelty 8.0

Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.

Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

cs.LG · 2026-06-09 · unverdicted · novelty 7.0

QGF performs test-time policy optimization for flow models in RL by guiding a behavior-cloned reference policy with value-function gradients, achieving strong results on high-dimensional offline RL benchmarks without additional policy training.

Staying Alive: Uncensored Survival Analysis with Tabular Foundation Models

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

A training-free survival regression approach uses tabular foundation models to build an accelerated failure time model and iteratively impute right-censored data with a non-parametric in-context estimator, matching the performance of trained Cox and parametric AFT models on benchmarks.

Explicit Critic Guidance for Aligning Diffusion Models

cs.LG · 2026-05-26 · unverdicted · novelty 7.0

Introduces a state-aligned latent actor-critic framework that lets diffusion models act as their own timestep-conditioned value functions for trajectory-level RL post-training and inference steering.

Revisiting Mixture Policies in Entropy-Regularized Actor-Critic

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

A new marginalized reparameterization estimator allows low-variance training of mixture policies in entropy-regularized actor-critic algorithms, matching or exceeding Gaussian policy performance in several continuous control benchmarks.

The Reciprocity Gradient

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

The reciprocity gradient allows agents to learn near-optimal context-sensitive policies by analytically propagating reward gradients through reputation chains in multi-agent settings.

Stable GFlowNets with Probabilistic Guarantees

cs.LG · 2026-05-03 · unverdicted · novelty 7.0

Derives loss-to-TV bounds providing probabilistic guarantees for GFlowNets and introduces Stable GFlowNets algorithm for improved training stability and distributional fidelity.

Intentional Updates for Streaming Reinforcement Learning

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Intentional TD and Intentional Policy Gradient select step sizes for fixed fractional TD error reduction and bounded policy KL divergence, yielding stable streaming deep RL performance on par with batch methods.

citing papers explorer

Showing 4 of 4 citing papers after filters.

  • Target Updates May Stabilize Linear Q-Learning: Periodic and Soft Dynamics stat.ML · 2026-05-31 · unverdicted · none · ref 19 · internal anchor

    Periodic and soft target updates guarantee convergence in linear Q-learning to the exact projected Q-Bellman solution under spectral and step-size conditions via joint spectral radius analysis of switched linear systems.

  • Variational Sequential Optimal Experimental Design using Reinforcement Learning stat.ML · 2023-06-17 · unverdicted · none · ref 59 · internal anchor

    vsOED uses a variational one-point reward and RL policy optimization to provide a lower bound on expected information gain for sequential experimental design, supporting nuisance parameters, implicit likelihoods, and multiple design goals.

  • Distributional Off-Policy Evaluation with Deep Quantile Process Regression stat.ML · 2026-04-20 · unverdicted · none · ref 115 · internal anchor

    DQPOPE estimates the entire return distribution in off-policy evaluation via deep quantile process regression, providing statistical advantages over standard single-value methods with equivalent sample sizes.

  • Variational Proximal Policy Optimization stat.ML · 2026-06-06 · unverdicted · none · ref 211 · internal anchor

    VP2O maps PPO to SVGD in a MoE architecture using functional kernels and expert orthogonalization, claiming +179 ELO on Codeforces and 32% token reduction on AIME for a 33B/4B model.