pith. machine review for the scientific record. sign in

arxiv: 1604.00923 · v1 · submitted 2016-04-04 · 💻 cs.LG · cs.AI

Recognition: unknown

Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning

Emma Brunskill, Philip S. Thomas

Authors on Pith no claims yet
classification 💻 cs.LG cs.AI
keywords policydataestimatesestimatorhistoricallearningreinforcementability
0
0 comments X
read the original abstract

In this paper we present a new way of predicting the performance of a reinforcement learning policy given historical data that may have been generated by a different policy. The ability to evaluate a policy from historical data is important for applications where the deployment of a bad policy can be dangerous or costly. We show empirically that our algorithm produces estimates that often have orders of magnitude lower mean squared error than existing methods---it makes more efficient use of the available data. Our new estimator is based on two advances: an extension of the doubly robust estimator (Jiang and Li, 2015), and a new way to mix between model based estimates and importance sampling based estimates.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards

    cs.LG 2026-05 unverdicted novelty 7.0

    DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.