Fast Policy Learning through Imitation and Reinforcement

Byron Boots; Ching-An Cheng; Nolan Wagener; Xinyan Yan

arxiv: 1805.10413 · v1 · pith:RGV6GGTRnew · submitted 2018-05-26 · 💻 cs.LG · stat.ML

Fast Policy Learning through Imitation and Reinforcement

Ching-An Cheng , Xinyan Yan , Nolan Wagener , Byron Boots This is my paper

classification 💻 cs.LG stat.ML

keywords learningpolicyexpertlokialgorithmsgradientimitationlearn

0 comments

read the original abstract

Imitation learning (IL) consists of a set of tools that leverage expert demonstrations to quickly learn policies. However, if the expert is suboptimal, IL can yield policies with inferior performance compared to reinforcement learning (RL). In this paper, we aim to provide an algorithm that combines the best aspects of RL and IL. We accomplish this by formulating several popular RL and IL algorithms in a common mirror descent framework, showing that these algorithms can be viewed as a variation on a single approach. We then propose LOKI, a strategy for policy learning that first performs a small but random number of IL iterations before switching to a policy gradient RL method. We show that if the switching time is properly randomized, LOKI can learn to outperform a suboptimal expert and converge faster than running policy gradient from scratch. Finally, we evaluate the performance of LOKI experimentally in several simulated environments.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving
cs.RO 2026-03 unverdicted novelty 7.0

PaIR-Drive runs IL and RL in parallel branches with a tree-structured sampler to reach 91.2 PDMS and 87.9 EPDMS on NAVSIM benchmarks while outperforming sequential RL fine-tuning and correcting some human errors.
Leveraging Experience in Lazy Search
cs.RO 2019-07 unverdicted novelty 6.0

Uses imitation learning from oracles to train an edge-evaluation policy for lazy graph search, outperforming heuristics on 2D and 7D motion planning problems when test instances are similar to training.