Delightful Policy Gradient gates updates with advantage times surprisal to suppress rare failures while preserving rare successes in distributed RL with stale or buggy data.
Conservative Q-learning for offline reinforcement learning
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
UARD is a reinforcement learning method that discounts rewards using combined epistemic and aleatoric uncertainty signals via a Reliability Filter, with claimed convergence guarantees and large reductions in reward hacking on benchmarks.
citing papers explorer
-
Delightful Distributed Policy Gradient
Delightful Policy Gradient gates updates with advantage times surprisal to suppress rare failures while preserving rare successes in distributed RL with stale or buggy data.
-
Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking
UARD is a reinforcement learning method that discounts rewards using combined epistemic and aleatoric uncertainty signals via a Reliability Filter, with claimed convergence guarantees and large reductions in reward hacking on benchmarks.