Reinforcement and Imitation Learning via Interactive No-Regret Learning

J. Andrew Bagnell, Stephane Ross

Authors on Pith no claims yet

classification 💻 cs.LG stat.ML

keywords learningimitationapproachinteractivereinforcementcostexistingextend

read the original abstract

Recent work has demonstrated that problems-- particularly imitation learning and structured prediction-- where a learner's predictions influence the input-distribution it is tested on can be naturally addressed by an interactive approach and analyzed using no-regret online learning. These approaches to imitation learning, however, neither require nor benefit from information about the cost of actions. We extend existing results in two directions: first, we develop an interactive imitation learning approach that leverages cost information; second, we extend the technique to address reinforcement learning. The results provide theoretical support to the commonly observed successes of online approximate policy iteration. Our approach suggests a broad new family of algorithms and provides a unifying view of existing techniques for imitation and reinforcement learning.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Revisiting DAgger in the Era of LLM-Agents
cs.LG 2026-05 conditional novelty 6.0

DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
Provable imitation learning for control of instability in partially-observed Vlasov--Poisson equations
cs.LG 2026-05 unverdicted novelty 6.0

Imitation learning yields provably stable feedback policies for partially observed Vlasov-Poisson instabilities, with error floors determined by minimal behavior-cloning loss characterized via entropy of the initial d...
BridgeSim: Unveiling the OL-CL Gap in End-to-End Autonomous Driving
cs.RO 2026-04 unverdicted novelty 6.0

The primary OL-CL gap in end-to-end autonomous driving arises from objective mismatch creating structural inability to model reactive behaviors, which a test-time adaptation method can mitigate.
The Essence of Balance for Self-Improving Agents in Vision-and-Language Navigation
cs.CV 2026-04 unverdicted novelty 5.0

SDB balances behavioral diversity and learning stability in VLN self-improvement by expanding decisions into latent hypotheses, performing reliability-aware aggregation, and applying a regularizer, yielding gains such...
Jump-Start Reinforcement Learning with Vision-Language-Action Regularization
cs.LG 2026-04 unverdicted novelty 5.0

VLAJS augments PPO with sparse annealed VLA guidance through directional regularization to cut required interactions by over 50% on manipulation tasks and enable zero-shot sim-to-real transfer.