pith. machine review for the scientific record. sign in

arxiv: 2310.17245 · v2 · submitted 2023-10-26 · 💻 cs.LG · cs.AI

Recognition: unknown

CROP: Conservative Reward for Model-based Offline Policy Optimization

Hao Li, Mei-Jiang Gui, Shi-Qi Liu, Shuang-Yi Wang, Shu-Hai Li, Xiao-Hu Zhou, Xiao-Liang Xie, Zeng-Guang Hou, Zhen-Qiu Feng

Authors on Pith no claims yet
classification 💻 cs.LG cs.AI
keywords offlinerewardconservativecropmodel-basedpolicydatadistribution
0
0 comments X
read the original abstract

Offline reinforcement learning (RL) aims to optimize a policy using collected data without online interactions. Model-based approaches are particularly appealing for addressing offline RL challenges because of their capability to mitigate the limitations of data coverage through data generation using models. Nonetheless, a prevalent issue in offline RL is the overestimation caused by distribution shift. This study proposes a novel model-based offline RL algorithm named Conservative Reward for model-based Offline Policy optimization (CROP). CROP introduces a streamlined objective that concurrently minimizes estimation error and the rewards of random actions, thereby yielding a robustly conservative reward estimator. Theoretical analysis shows that the designed conservative reward mechanism leads to a conservative policy evaluation and mitigates distribution shift. Experiments showcase that with the simple modification to reward estimation, CROP can conservatively estimate the reward and achieve competitive performance with existing methods. The source code will be available after acceptance.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. When Policies Cannot Be Retrained: A Unified Closed-Form View of Post-Training Steering in Offline Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    For diagonal-Gaussian frozen actors, PoE with alpha equals KL adaptation with beta = alpha/(1-alpha); empirically, composition shows an actor-competence ceiling with 4/5/3 HELP/FROZEN/HURT split on D4RL and zero succe...