AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.
Crossnorm: Normalization for off-policy td reinforcement learning
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 4verdicts
UNVERDICTED 4representative citing papers
FastDSAC enables state-of-the-art maximum entropy RL for high-dimensional humanoid control via entropy redistribution per dimension and improved continuous value estimation.
Develops Way Off-Policy batch RL algorithms with pre-trained model priors, KL-control, and dropout uncertainty estimates to learn implicit rewards from offline human dialog data, reporting live deployment gains over prior offline methods.
QDHUAC is a distributional, target-free QD-RL method that enables stable high-UTD training and competitive performance on Brax locomotion tasks using far fewer environment steps than prior approaches.
citing papers explorer
-
AdamO: A Collapse-Suppressed Optimizer for Offline RL
AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.
-
FastDSAC: Unlocking the Potential of Maximum Entropy RL in High-Dimensional Humanoid Control
FastDSAC enables state-of-the-art maximum entropy RL for high-dimensional humanoid control via entropy redistribution per dimension and improved continuous value estimation.
-
Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog
Develops Way Off-Policy batch RL algorithms with pre-trained model priors, KL-control, and dropout uncertainty estimates to learn implicit rewards from offline human dialog data, reporting live deployment gains over prior offline methods.
-
Distributional Value Estimation Without Target Networks for Robust Quality-Diversity
QDHUAC is a distributional, target-free QD-RL method that enables stable high-UTD training and competitive performance on Brax locomotion tasks using far fewer environment steps than prior approaches.