Crossnorm: Normalization for off-policy td reinforcement learning

Aditya Bhatt, Max Argus, Artemij Amiranashvili, Thomas Brox · 1902 · arXiv 1902.05605

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

AdamO: A Collapse-Suppressed Optimizer for Offline RL

cs.LG · 2026-05-03 · unverdicted · novelty 6.0

AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.

FastDSAC: Unlocking the Potential of Maximum Entropy RL in High-Dimensional Humanoid Control

cs.LG · 2026-03-13 · unverdicted · novelty 6.0

FastDSAC enables state-of-the-art maximum entropy RL for high-dimensional humanoid control via entropy redistribution per dimension and improved continuous value estimation.

Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

cs.LG · 2019-06-30 · unverdicted · novelty 6.0

Develops Way Off-Policy batch RL algorithms with pre-trained model priors, KL-control, and dropout uncertainty estimates to learn implicit rewards from offline human dialog data, reporting live deployment gains over prior offline methods.

Distributional Value Estimation Without Target Networks for Robust Quality-Diversity

cs.LG · 2026-04-22 · unverdicted · novelty 5.0

QDHUAC is a distributional, target-free QD-RL method that enables stable high-UTD training and competitive performance on Brax locomotion tasks using far fewer environment steps than prior approaches.

citing papers explorer

Showing 4 of 4 citing papers.

AdamO: A Collapse-Suppressed Optimizer for Offline RL cs.LG · 2026-05-03 · unverdicted · none · ref 37
AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.
FastDSAC: Unlocking the Potential of Maximum Entropy RL in High-Dimensional Humanoid Control cs.LG · 2026-03-13 · unverdicted · none · ref 20
FastDSAC enables state-of-the-art maximum entropy RL for high-dimensional humanoid control via entropy redistribution per dimension and improved continuous value estimation.
Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog cs.LG · 2019-06-30 · unverdicted · none · ref 4
Develops Way Off-Policy batch RL algorithms with pre-trained model priors, KL-control, and dropout uncertainty estimates to learn implicit rewards from offline human dialog data, reporting live deployment gains over prior offline methods.
Distributional Value Estimation Without Target Networks for Robust Quality-Diversity cs.LG · 2026-04-22 · unverdicted · none · ref 5
QDHUAC is a distributional, target-free QD-RL method that enables stable high-UTD training and competitive performance on Brax locomotion tasks using far fewer environment steps than prior approaches.

Crossnorm: Normalization for off-policy td reinforcement learning

fields

years

verdicts

representative citing papers

citing papers explorer