pith. sign in

Conservative Q-learning for offline reinforcement learning

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

fields

cs.LG 2

years

2026 2

verdicts

UNVERDICTED 2

clear filters

representative citing papers

Delightful Distributed Policy Gradient

cs.LG · 2026-03-20 · unverdicted · novelty 6.0

Delightful Policy Gradient gates updates with advantage times surprisal to suppress rare failures while preserving rare successes in distributed RL with stale or buggy data.

Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking

cs.LG · 2026-04-29 · unverdicted · novelty 5.0

UARD is a reinforcement learning method that discounts rewards using combined epistemic and aleatoric uncertainty signals via a Reliability Filter, with claimed convergence guarantees and large reductions in reward hacking on benchmarks.

citing papers explorer

Showing 2 of 2 citing papers after filters.

  • Delightful Distributed Policy Gradient cs.LG · 2026-03-20 · unverdicted · none · ref 14

    Delightful Policy Gradient gates updates with advantage times surprisal to suppress rare failures while preserving rare successes in distributed RL with stale or buggy data.

  • Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking cs.LG · 2026-04-29 · unverdicted · none · ref 11

    UARD is a reinforcement learning method that discounts rewards using combined epistemic and aleatoric uncertainty signals via a Reliability Filter, with claimed convergence guarantees and large reductions in reward hacking on benchmarks.