Conservative Q-learning for offline reinforcement learning

Aviral Kumar, Aurick Zhou, George Tucker, Sergey Levine · 2020

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

cs.LG · 2026-03-20 · unverdicted · novelty 6.0

Delightful Policy Gradient gates updates with advantage times surprisal to suppress rare failures while preserving rare successes in distributed RL with stale or buggy data.

Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking

cs.LG · 2026-04-29 · unverdicted · novelty 5.0

UARD is a reinforcement learning method that discounts rewards using combined epistemic and aleatoric uncertainty signals via a Reliability Filter, with claimed convergence guarantees and large reductions in reward hacking on benchmarks.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Delightful Distributed Policy Gradient cs.LG · 2026-03-20 · unverdicted · none · ref 14
Delightful Policy Gradient gates updates with advantage times surprisal to suppress rare failures while preserving rare successes in distributed RL with stale or buggy data.
Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking cs.LG · 2026-04-29 · unverdicted · none · ref 11
UARD is a reinforcement learning method that discounts rewards using combined epistemic and aleatoric uncertainty signals via a Reliability Filter, with claimed convergence guarantees and large reductions in reward hacking on benchmarks.

Conservative Q-learning for offline reinforcement learning

fields

years

verdicts

representative citing papers

citing papers explorer