Recognition: 2 theorem links
· Lean TheoremVIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training
Pith reviewed 2026-05-15 04:38 UTC · model grok-4.3
The pith
VIP pre-trains a visual representation on unlabeled human videos that supplies dense rewards for many robot tasks without any fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VIP casts representation learning from human videos as an offline goal-conditioned reinforcement learning problem and derives a self-supervised dual goal-conditioned value-function objective that does not depend on actions, enabling pre-training on unlabeled data. Theoretically, the objective functions as an implicit time contrastive loss that generates a temporally smooth embedding, so the value function is implicitly defined by embedding distance and can be used to construct the reward for any goal-image specified downstream task.
What carries the argument
Dual goal-conditioned value-function objective, which serves as an implicit time contrastive loss producing temporally smooth embeddings whose distances define dense rewards.
If this is right
- The frozen representation supplies dense visual rewards for an extensive set of simulated and real-robot tasks.
- Diverse reward-based visual control methods become viable without task-specific fine-tuning.
- Simple few-shot offline RL succeeds on real-world robot tasks using as few as twenty trajectories.
- The approach significantly outperforms all prior pre-trained representations in providing usable visual rewards.
Where Pith is reading between the lines
- The action-free objective could allow pre-training on even larger passive video corpora beyond Ego4D.
- If the embeddings capture general visual dynamics, the same rewards might support non-manipulation tasks such as navigation.
- Minimal robot data could be used to fine-tune the embedding for further gains on specific domains while retaining the broad pre-trained prior.
Load-bearing premise
A value function learned solely from unlabeled human videos will produce rewards that remain effective when transferred to robotic embodiments and dynamics without further adaptation.
What would settle it
If control policies trained with VIP-derived rewards fail to solve tasks or perform no better than baselines on real-robot benchmarks, the claim of effective zero-shot transfer would be refuted.
read the original abstract
Reward and representation learning are two long-standing challenges for learning an expanding set of robot manipulation skills from sensory observations. Given the inherent cost and scarcity of in-domain, task-specific robot data, learning from large, diverse, offline human videos has emerged as a promising path towards acquiring a generally useful visual representation for control; however, how these human videos can be used for general-purpose reward learning remains an open question. We introduce $\textbf{V}$alue-$\textbf{I}$mplicit $\textbf{P}$re-training (VIP), a self-supervised pre-trained visual representation capable of generating dense and smooth reward functions for unseen robotic tasks. VIP casts representation learning from human videos as an offline goal-conditioned reinforcement learning problem and derives a self-supervised dual goal-conditioned value-function objective that does not depend on actions, enabling pre-training on unlabeled human videos. Theoretically, VIP can be understood as a novel implicit time contrastive objective that generates a temporally smooth embedding, enabling the value function to be implicitly defined via the embedding distance, which can then be used to construct the reward for any goal-image specified downstream task. Trained on large-scale Ego4D human videos and without any fine-tuning on in-domain, task-specific data, VIP's frozen representation can provide dense visual reward for an extensive set of simulated and $\textbf{real-robot}$ tasks, enabling diverse reward-based visual control methods and significantly outperforming all prior pre-trained representations. Notably, VIP can enable simple, $\textbf{few-shot}$ offline RL on a suite of real-world robot tasks with as few as 20 trajectories.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Value-Implicit Pre-Training (VIP), which frames visual representation learning from large-scale unlabeled human videos (Ego4D) as an offline goal-conditioned RL problem. It derives an action-free dual goal-conditioned value-function objective that is re-interpreted as an implicit time-contrastive loss, yielding a temporally smooth embedding. Downstream, the frozen embedding distance is used to define dense rewards for arbitrary goal-image tasks. Experiments claim that this representation, without any in-domain fine-tuning, provides effective rewards for a wide range of simulated and real-robot manipulation tasks, outperforming prior pre-trained representations and enabling few-shot offline RL with as few as 20 trajectories.
Significance. If the central transfer claim holds, the work would be significant for robotics: it offers a concrete path to leverage abundant, diverse human video data for universal, dense visual rewards without task-specific robot data collection. The action-free objective and its contrastive interpretation are technically interesting contributions that could generalize beyond the reported tasks. The real-robot results with frozen representations and minimal trajectories are practically relevant if the domain-gap robustness is convincingly demonstrated.
major comments (2)
- [§4.3] §4.3 and Table 2: the real-robot results with 20 trajectories report high success rates, yet no ablation isolates the effect of the human-to-robot embodiment gap (kinematics, dynamics, camera intrinsics) on reward density or smoothness; without this, it is unclear whether the embedding distance truly encodes task progress invariantly or primarily captures human-specific visual statistics.
- [§3.1] §3.1, Eq. (4)–(6): the derivation of the dual goal-conditioned objective from the offline RL formulation is presented at a high level; the step that removes action dependence and yields an exact implicit value function via embedding distance is load-bearing for the zero-shot reward claim but lacks an expanded proof or explicit assumptions that would allow verification of whether the equivalence holds without additional regularization.
minor comments (2)
- [Figure 3] Figure 3: the reward visualization panels would benefit from explicit scale bars or normalized distance values to allow readers to assess smoothness quantitatively rather than qualitatively.
- [§5.2] §5.2: the comparison tables list prior methods but omit the exact hyper-parameter settings used for each baseline, making it difficult to reproduce the reported performance gaps.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the significance of our work and for the constructive comments. We address each major comment below, providing clarifications and indicating revisions to the manuscript where appropriate.
read point-by-point responses
-
Referee: [§4.3] §4.3 and Table 2: the real-robot results with 20 trajectories report high success rates, yet no ablation isolates the effect of the human-to-robot embodiment gap (kinematics, dynamics, camera intrinsics) on reward density or smoothness; without this, it is unclear whether the embedding distance truly encodes task progress invariantly or primarily captures human-specific visual statistics.
Authors: We agree that an ablation isolating the human-to-robot embodiment gap would strengthen the claims regarding the invariance of the learned embedding. While the current results demonstrate that the frozen VIP representation, trained only on human videos, successfully provides dense rewards for real-robot tasks with high success rates using as few as 20 trajectories, we acknowledge that this does not explicitly separate visual statistics from task progress. In the revised version, we will include additional analysis, such as reward curves on held-out robot videos and comparisons to baselines that might capture human-specific features, to better isolate these effects. We will also add a discussion in §4.3 on the robustness to embodiment differences. revision: yes
-
Referee: [§3.1] §3.1, Eq. (4)–(6): the derivation of the dual goal-conditioned objective from the offline RL formulation is presented at a high level; the step that removes action dependence and yields an exact implicit value function via embedding distance is load-bearing for the zero-shot reward claim but lacks an expanded proof or explicit assumptions that would allow verification of whether the equivalence holds without additional regularization.
Authors: Thank you for highlighting the need for a more detailed derivation. The action-free dual objective is obtained by integrating out the actions from the goal-conditioned Bellman equation under the data distribution, resulting in a contrastive loss whose minimizer defines the value function implicitly as the negative embedding distance. To address the concern, we will provide an expanded proof in the appendix of the revised manuscript, including all intermediate steps, the precise assumptions (such as the behavior policy covering the state-goal space and the form of the reward), and verification that no additional regularization is required beyond the derived objective for the equivalence to hold. revision: yes
Circularity Check
VIP derivation chain is self-contained with no circular reductions
full rationale
The paper derives its dual goal-conditioned value objective directly from standard offline goal-conditioned RL (action-free formulation on unlabeled videos), then mathematically reinterprets the learned embedding distance as implicitly defining the value function for downstream reward construction. This is an explicit design choice and equivalence proof, not a fitted parameter renamed as prediction or a self-definitional loop. No load-bearing self-citations, uniqueness theorems from prior author work, or ansatz smuggling are present in the derivation. The human-to-robot transfer claim is an empirical assertion evaluated on real-robot tasks, not a definitional reduction. The overall method remains independent of its inputs beyond the intended contrastive-style pre-training objective.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A value function learned from unlabeled human videos via an action-free objective will produce rewards that transfer to robotic tasks.
Forward citations
Cited by 23 Pith papers
-
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.
-
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.
-
KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis
KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.
-
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
A collaborative dataset spanning 22 robots and 527 skills enables RT-X models that transfer capabilities across different robot embodiments.
-
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
-
PriorZero: Bridging Language Priors and World Models for Decision Making
PriorZero uses root-only LLM prior injection in MCTS and alternating world-model training with LLM fine-tuning to raise exploration efficiency and final performance on Jericho text games and BabyAI gridworlds.
-
Multi-scale Predictive Representations for Goal-conditioned Reinforcement Learning
Ms.PR applies multi-scale predictive supervision to enforce goal-directed alignment in latent spaces for offline GCRL, yielding improved representation quality and performance on vision and state-based tasks.
-
MoMo: Conditioned Contrastive Representation Learning for Preference-Modulated Planning
MoMo conditions contrastive representations and prediction operators on user preferences via FiLM and low-rank modulation to enable continuous modulation of plan safety while preserving inference efficiency.
-
MoMo: Conditioned Contrastive Representation Learning for Preference-Modulated Planning
MoMo uses Feature-Wise Linear Modulation and low-rank neural modulation to condition contrastive planning representations on user preferences while preserving inference efficiency and probability density ratios.
-
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...
-
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...
-
GazeVLA: Learning Human Intention for Robotic Manipulation
GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
-
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
-
WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations
WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match tele...
-
Grounding Hierarchical Vision-Language-Action Models Through Explicit Language-Action Alignment
A contrastive alignment model plus offline preference learning explicitly grounds hierarchical VLA language descriptions to actions and visuals on LanguageTable, achieving performance comparable to fully supervised fi...
-
ARM: Advantage Reward Modeling for Long-Horizon Manipulation
ARM trains reward models on Progressive/Regressive/Stagnant labels to enable adaptive reweighting in offline RL, reaching 99.4% success on towel-folding with minimal human intervention.
-
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.
-
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
-
Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation
Gen2Act enables generalizable robot manipulation for unseen objects and novel motions by using zero-shot human video generation from web data to condition a policy trained on an order of magnitude less robot interaction data.
-
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
A GPT-style model pre-trained on large video datasets achieves 94.9% success on CALVIN multi-task manipulation and 85.4% zero-shot generalization, outperforming prior baselines.
-
TD-MPC2: Scalable, Robust World Models for Continuous Control
TD-MPC2 scales an implicit world-model RL method to a 317M-parameter agent that masters 80 tasks across four domains with a single hyperparameter configuration.
-
Efficient Hierarchical Implicit Flow Q-learning for Offline Goal-conditioned Reinforcement Learning
Proposes mean flow policies and LeJEPA loss to overcome Gaussian policy limits and weak subgoal generation in hierarchical offline GCRL, reporting strong results on OGBench state and pixel tasks.
-
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.
Reference graph
Works this paper leans on
-
[1]
Human-to-robot imitation in the wild
Shikhar Bahl, Abhinav Gupta, and Deepak Pathak. Human-to-robot imitation in the wild. arXiv preprint arXiv:2207.09450,
-
[2]
Actionable models: Unsupervised offline reinforcement learning of robotic skills
Yevgen Chebotar, Karol Hausman, Yao Lu, Ted Xiao, Dmitry Kalashnikov, Jake Varley, Alex Irpan, Benjamin Eysenbach, Ryan Julian, Chelsea Finn, et al. Actionable models: Unsupervised offline reinforcement learning of robotic skills. arXiv preprint arXiv:2104.07749,
-
[3]
Annie S Chen, Suraj Nair, and Chelsea Finn. Learning generalizable robotic reward functions from" in-the-wild" human videos. arXiv preprint arXiv:2103.16817,
-
[4]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee,
work page 2009
-
[5]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[7]
Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets
Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Contrastive learning as goal-conditioned reinforcement learning, 2023
Benjamin Eysenbach, Tianjun Zhang, Ruslan Salakhutdinov, and Sergey Levine. Contrastive learning as goal-conditioned reinforcement learning. arXiv preprint arXiv:2206.07568,
-
[9]
Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning
Abhishek Gupta, Vikash Kumar, Corey Lynch, Sergey Levine, and Karol Hausman. Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. arXiv preprint arXiv:1910.11956,
-
[10]
Noise-contrastive estimation: A new estimation principle for unnormalized statistical models
12 Published as a conference paper at ICLR 2023 Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 297–304. JMLR Workshop and Conference Proceedings,
work page 2023
-
[11]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
arXiv preprint arXiv:2109.10813 , year=
Aviral Kumar, Anikait Singh, Stephen Tian, Chelsea Finn, and Sergey Levine. A workflow for offline model-free robotic reinforcement learning. arXiv preprint arXiv:2109.10813,
-
[13]
Aviral Kumar, Joey Hong, Anikait Singh, and Sergey Levine. When should we prefer offline reinforcement learning over behavioral cloning? arXiv preprint arXiv:2204.05618,
-
[14]
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
URL https://openreview.net/forum?id=lp9foO8AFoD. Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643,
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[15]
Smodice: Versatile offline imitation learning via state occupancy matching
Yecheng Jason Ma, Andrew Shen, Dinesh Jayaraman, and Osbert Bastani. Smodice: Versatile offline imitation learning via state occupancy matching. arXiv preprint arXiv:2202.02433, 2022a. Yecheng Jason Ma, Jason Yan, Dinesh Jayaraman, and Osbert Bastani. How far i’ll go: Offline goal- conditioned reinforcement learning viaf-advantage regression. arXiv preprint...
-
[16]
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei- Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation. arXiv preprint arXiv:2108.03298,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Algaedice: Policy gradient from arbitrary experience
Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074,
-
[18]
Over- coming exploration in reinforcement learning with demonstrations
13 Published as a conference paper at ICLR 2023 Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Over- coming exploration in reinforcement learning with demonstrations. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 6292–6299. IEEE,
work page 2023
-
[19]
R3m: A universal visual representation for robot manipulation,
Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601,
-
[20]
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
The unsurprising effectiveness of pre- trained vision models for control,
Simone Parisi, Aravind Rajeswaran, Senthil Purushwalkam, and Abhinav Gupta. The unsurprising effectiveness of pre-trained vision models for control. arXiv preprint arXiv:2203.03580,
-
[22]
Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. Zero-shot visual imitation. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 2050–2053,
work page 2050
-
[23]
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177,
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[24]
Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations
URL https://openreview.net/forum?id=VfGk0ELQ4LC. Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
arXiv preprint arXiv:2011.06507 , year=
Karl Schmeckpeper, Oleh Rybkin, Kostas Daniilidis, Sergey Levine, and Chelsea Finn. Reinforce- ment learning with videos: Combining offline observations with interaction. arXiv preprint arXiv:2011.06507,
-
[26]
Unsupervised Perceptual Rewards for Imitation Learning
Pierre Sermanet, Kelvin Xu, and Sergey Levine. Unsupervised perceptual rewards for imitation learning. arXiv preprint arXiv:1612.06699,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Time-contrastive networks: Self-supervised learning from video
14 Published as a conference paper at ICLR 2023 Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, Sergey Levine, and Google Brain. Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 1134–1141. IEEE,
work page 2023
-
[28]
Rrl: Resnet as representation for reinforcement learning
Rutav Shah and Vikash Kumar. Rrl: Resnet as representation for reinforcement learning. arXiv preprint arXiv:2107.03380,
-
[29]
End-to-End Robotic Reinforcement Learning without Reward Engineering
Avi Singh, Larry Yang, Kristian Hartikainen, Chelsea Finn, and Sergey Levine. End-to-end robotic reinforcement learning without reward engineering. arXiv preprint arXiv:1904.07854,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[30]
On the learning and learnablity of quasimetrics
Tongzhou Wang and Phillip Isola. On the learning and learnablity of quasimetrics. arXiv preprint arXiv:2206.15478,
-
[31]
Masked visual pre-training for motor control,
Tete Xiao, Ilija Radosavovic, Trevor Darrell, and Jitendra Malik. Masked visual pre-training for motor control. arXiv preprint arXiv:2203.06173,
-
[32]
Learning by watching: Physical imitation of manipulation skills from human videos
Haoyu Xiong, Quanzhou Li, Yun-Chun Chen, Homanga Bharadhwaj, Samarth Sinha, and Animesh Garg. Learning by watching: Physical imitation of manipulation skills from human videos. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7827–7834. IEEE,
work page 2021
-
[33]
How to leverage unlabeled data in offline reinforcement learning
Tianhe Yu, Aviral Kumar, Yevgen Chebotar, Karol Hausman, Chelsea Finn, and Sergey Levine. How to leverage unlabeled data in offline reinforcement learning. arXiv preprint arXiv:2202.01741,
-
[34]
16 A.2 InfoNCE & Time Contrastive Learning
15 Published as a conference paper at ICLR 2023 Part I Appendix Table of Contents A Additional Background 16 A.1 Goal-Conditioned Reinforcement Learning . . . . . . . . . . . . . . . . . . . . 16 A.2 InfoNCE & Time Contrastive Learning. . . . . . . . . . . . . . . . . . . . . . . 17 B Extended Related Work 18 C Technical Derivations and Proofs 18 C.1 Proo...
work page 2023
-
[35]
principle. In particular, given an “anchor” datum x (otherwise known as context), and distribution of positives xpos and negativesxneg, the InfoNCE objective optimizes min φ Expos [ − log Sφ(x,x pos) ExnegSφ(x,x neg) ] , (11) where Exneg is often approximated with a fixed number of negatives in practice. It is shown in Oord et al. (2018) that optimizing eq...
work page 2018
-
[36]
considers multi-view videos and perform contrastive learning over frames in separate videos; in this work, we consider the single-view variant. At a high level, TCN attracts representations of frames that are temporally close, while pushing apart those of frames that are farther apart in time. More precisely, given three frames sampled from a video sequen...
work page 2023
-
[37]
First, we have minφEp(g) [ (1−γ)Eµ0(o;g)[−V∗(φ(o);φ(g))] + logED(o,o′;g) [ exp (˜δg(o) +γV(φ(o′);φ(g))−V(φ(o),φ(g)) )]−1] . (20) We can equivalently write this objective as minφEp(g) [ (1−γ)Eµ0(o;g)[−logeV∗(φ(o);φ(g))] + logED(o,o′;g) [ exp (˜δg(o) +γV(φ(o′);φ(g))−V(φ(o),φ(g)) )]−1] . (21) Then, minφ Ep(g) [ (1−γ)Eµ0(o;g) [ −logeV∗(φ(o);φ(g))−logED(o,o′;g...
work page 2023
-
[38]
Name Value Architecture Visual Backbone ResNet50 (He et al.,
20 Published as a conference paper at ICLR 2023 Table 2: VIP Architecture & Hyperparameters. Name Value Architecture Visual Backbone ResNet50 (He et al.,
work page 2023
-
[39]
Learning rate 0.0001 L1 weight penalty 0.001 L1 weight penalty 0.001 Mini-batch size 32 Discount factorγ 0.98 D.3 VIP P YTORCH PSEUDOCODE In this section, we present a pseudocode of VIP written in PyTorch (Paszke et al., 2019), Algorithm
work page 2019
-
[40]
We fit VIP and TCN representations using 100 demonstrations from the 2https://github.com/vikashplus/mj_envs/tree/v0.1real/mj_envs/envs/relay_kitchen 21 Published as a conference paper at ICLR 2023 (a) ldoor_close (left) (b) ldoor_close (center) (c) ldoor_close (right) (d) ldoor_open (left) (e) ldoor_open (center) (f) ldoor_open (right) (g) rdoor_close (lef...
work page 2023
-
[41]
is the goal-embedding distance difference, the score (i.e., sum of per-transition reward) of a proposed sequence of actions is equivalent to the negative embedding distance (i.e.,Sφ(φ(oT );φ(g))) at the last observation. E.3.1 R OBOT AND OBJECT POSE ERROR ANALYSIS In this section, we visualize the per-step robot and object poseL2 error with respect to the...
work page 2023
-
[42]
F.2 T RAINING AND EVALUATION DETAILS The policy network is implemented as a 2-layer MLP with hidden sizes [256, 256]. As in R3M’s real-world robot experiment setup, the policy takes in concatenated visual embedding of current observation and robot’s proprioceptive state and outputs robot action. The policy is trained with a learning rate of 0.001, and a b...
work page 2023
-
[43]
architecture backbone and exhibits an improving trend in performance similar to VIP; however, MAE’s absolute performance is still far inferior to VIP. G.2 V ALUE -BASED PRE-T RAINING ABLATION : L EAST-SQUARE TEMPORAL -D IFFERENCE While VIP is the first value-based pre-training approach and significantly outperforms all existing methods, we show that this ef...
work page 1996
-
[44]
R3M-Lang is the publicly released R3M variant without supervised language training
and consider 12 tasks combined from FrankaKitchen, MetaWorld (Yu et al., 2020), and Adroit (Rajeswaran et al., 2017), 3 camera views for each task, and 3 demonstration dataset sizes, and report the aggregate average maximum success rate achieved during training. R3M-Lang is the publicly released R3M variant without supervised language training. The averag...
work page 2020
-
[45]
This pattern is less accentuated on Ego4D
As shown, on all tasks in the real-robot dataset, VIP is distinctively more smooth than any other representation. This pattern is less accentuated on Ego4D. This is because a randomly sampled 50-frame snippet from Ego4D may not coherently represent a task solved from beginning to completion, so an embedding distance curve is not inherently supposed to be ...
work page 2023
-
[46]
ground-truth human-engineered reward correlation (VIP vs
30 Published as a conference paper at ICLR 2023 (a) rdoor_close (left) (b) rdoor_close (c) rdoor_close (right) (d) rdoor_open (left) (e) rdoor_open (f) rdoor_open (right) (g) micro_close (left) (h) micro_close (i) micro_close (right) (j) micro_open (left) (k) micro_open (l) micro_open (right) Figure 18: Embedding reward vs. ground-truth human-engineered r...
work page 2023
-
[47]
ground-truth human-engineered reward correlation (VIP vs
31 Published as a conference paper at ICLR 2023 (a) knob1_on (left) (b) knob1_on (c) knob1_on (right) (d) knob1_on (left) (e) knob1_on (f) knob1_on (right) (g) light_on (left) (h) light_on (i) light_on (right) (j) light_off (left) (k) light_off (l) light_off (right) Figure 19: Embedding reward vs. ground-truth human-engineered reward correlation (VIP vs. ...
work page 2023
-
[48]
Table 5: Proportion of bumps in embedding distance curves
32 Published as a conference paper at ICLR 2023 (a) Ego4D (b) Real-robot dataset Figure 20: Additional embedding distance curves on Ego4D and real-robot videos. Table 5: Proportion of bumps in embedding distance curves. Dataset VIP (Ours) R3M ResNet50 MOCO CLIP Ego4D 0.253 ± 0.117 0.309 ± 0.097 0.414 ± 0.052 0.398 ± 0.057 0.444 ± 0.047 In-House Robot Data...
work page 2023
-
[49]
The y-axis is in log-scale due to the large total count of Ego4D frames
The histograms are computed using the same set of 50-frame Ego4D video snippets as in Appendix G.6. The y-axis is in log-scale due to the large total count of Ego4D frames. As discussed, Ego4D video segments are less regular than those in our real-robot dataset, and this irregularity contributes to all representations having significantly more negative rew...
work page 2023
-
[50]
that can learn quasimetrics with finite data may be a fruitful future direction. We have also used VIP only as a frozen visual reward and representation module to test its broad generalization capability. Better absolute task performance may be achieved by fine-tuning VIP on task-specific data. Exploring how to best fine-tune VIP is a promising direction for ...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.