RoboCLIP: One Demonstration is Enough to Learn Robot Policies

Chelsea Finn; Dorsa Sadigh; Erdem B{\i}y{\i}k; Jesse Zhang; Karl Pertsch; Laurent Itti; S\'ebastien M. R. Arnold; Sumedh A Sontakke

arxiv: 2310.07899 · v1 · pith:APIT64Q6new · submitted 2023-10-11 · 💻 cs.AI · cs.RO

RoboCLIP: One Demonstration is Enough to Learn Robot Policies

Sumedh A Sontakke , Jesse Zhang , S\'ebastien M. R. Arnold , Karl Pertsch , Erdem B{\i}y{\i}k , Dorsa Sadigh , Chelsea Finn , Laurent Itti This is my paper

classification 💻 cs.AI cs.RO

keywords demonstrationlearningrewardroboclipdemonstrationsexpertimitationdesign

0 comments

read the original abstract

Reward specification is a notoriously difficult problem in reinforcement learning, requiring extensive expert supervision to design robust reward functions. Imitation learning (IL) methods attempt to circumvent these problems by utilizing expert demonstrations but typically require a large number of in-domain expert demonstrations. Inspired by advances in the field of Video-and-Language Models (VLMs), we present RoboCLIP, an online imitation learning method that uses a single demonstration (overcoming the large data requirement) in the form of a video demonstration or a textual description of the task to generate rewards without manual reward function design. Additionally, RoboCLIP can also utilize out-of-domain demonstrations, like videos of humans solving the task for reward generation, circumventing the need to have the same demonstration and deployment domains. RoboCLIP utilizes pretrained VLMs without any finetuning for reward generation. Reinforcement learning agents trained with RoboCLIP rewards demonstrate 2-3 times higher zero-shot performance than competing imitation learning methods on downstream robot manipulation tasks, doing so using only one video/text demonstration.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RMTL: Reinforced Micro-task Learning for Long-Horizon Manipulation with VLM Rewards
cs.RO 2026-06 unverdicted novelty 6.0

RMTL decomposes long-horizon Fetch manipulation into three micro-tasks with per-stage VLM rewards, a reverse curriculum, and a learned hierarchical manager, yielding faster learning than single-prompt VLM rewards.
Agent AI: Surveying the Horizons of Multimodal Interaction
cs.AI 2024-01 unverdicted novelty 4.0

The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.