pith. machine review for the scientific record. sign in

arxiv: 2203.12601 · v3 · submitted 2022-03-23 · 💻 cs.RO · cs.AI· cs.CV· cs.LG

Recognition: 2 theorem links

· Lean Theorem

R3M: A Universal Visual Representation for Robot Manipulation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:21 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.LG
keywords robot manipulationvisual representation learningpre-traininghuman videodata efficiencycontrastive learningEgo4DFranka robot
0
0 comments X

The pith

Pre-trained visual features from human videos enable more data-efficient robot manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates using visual representations learned from large-scale human video data to improve learning of robot manipulation tasks. By pre-training on the Ego4D dataset with time-contrastive learning, video-language alignment, and a sparsity penalty, the resulting R3M model serves as a frozen visual module for policy learning. This leads to over 20% higher success rates on 12 simulated tasks compared to training from scratch, and outperforms other pre-trained models like CLIP. In real-world experiments, it allows a robotic arm to learn tasks in a cluttered environment with only 20 demonstrations. A reader would care because it points to a way to leverage abundant human video data to make robot training more practical.

Core claim

R3M is a universal visual representation pre-trained on diverse human video data from the Ego4D dataset. The pre-training combines time-contrastive learning to capture temporal structure, video-language alignment for semantic understanding, and an L1 penalty to promote sparse and compact features. When frozen and used for downstream robotic policy learning, R3M boosts task success rates by more than 20% over training from scratch and by more than 10% over state-of-the-art representations such as CLIP and MoCo across 12 simulated manipulation tasks. It further enables a real Franka Emika Panda arm to acquire a variety of manipulation skills in a cluttered apartment setting using just 20 demos

What carries the argument

The R3M visual encoder, obtained by pre-training on human videos with time-contrastive, language-alignment, and sparsity objectives, acting as a frozen perception module for policy learning.

If this is right

  • Pre-trained human video features transfer to robotic vision without adaptation.
  • Data efficiency in robot learning improves significantly with such representations.
  • Combining contrastive, language, and sparsity losses creates more effective visual features for control.
  • Real-world robot deployment becomes viable with small demonstration sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Scaling up the pre-training dataset could yield even stronger performance gains for a wider range of tasks.
  • This method might generalize to other robot embodiments or sensor modalities beyond the tested arm.
  • Integrating R3M with proprioception or other modalities could further enhance learning speed.

Load-bearing premise

Visual features learned from human video data will transfer effectively to robotic camera inputs and task distributions without any robot-specific fine-tuning or domain adaptation.

What would settle it

If policies using the frozen R3M encoder achieve no better success rates than random actions or from-scratch baselines across the 12 simulated manipulation tasks, the claimed transfer benefit would be falsified.

read the original abstract

We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of downstream robotic manipulation tasks. Concretely, we pre-train a visual representation using the Ego4D human video dataset using a combination of time-contrastive learning, video-language alignment, and an L1 penalty to encourage sparse and compact representations. The resulting representation, R3M, can be used as a frozen perception module for downstream policy learning. Across a suite of 12 simulated robot manipulation tasks, we find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo. Furthermore, R3M enables a Franka Emika Panda arm to learn a range of manipulation tasks in a real, cluttered apartment given just 20 demonstrations. Code and pre-trained models are available at https://tinyurl.com/robotr3m.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces R3M, a visual encoder pre-trained on the Ego4D human video dataset via a combination of time-contrastive learning, video-language alignment, and an L1 sparsity penalty. The frozen R3M representation is then used for downstream policy learning. On 12 simulated manipulation tasks, R3M yields >20% higher success rates than training from scratch and >10% gains over CLIP and MoCo. In real-world experiments, a Franka Emika Panda arm learns several manipulation tasks in a cluttered apartment from only 20 demonstrations.

Significance. If the reported transfer holds under controlled conditions, R3M would provide a practical route to data-efficient robot learning by leveraging large-scale human video corpora. The availability of code and pre-trained models strengthens reproducibility and enables direct follow-up work on domain adaptation or fine-tuning.

major comments (2)
  1. [§4.2] §4.2 (real-robot experiments): success rates are reported for R3M with 20 demonstrations, but no matched real-world baselines for CLIP, MoCo, or training from scratch are provided on the same Franka tasks. This omission prevents quantification of the transfer benefit and leaves the domain-shift assumption untested.
  2. [§3.2] §3.2 (pre-training objectives): the time-contrastive and video-language losses are defined on egocentric human video; no analysis or ablation quantifies robustness to the shift to fixed third-person robot camera views, lighting, and motion statistics that appear in the real-robot evaluation.
minor comments (2)
  1. [Figure 3] Figure 3 and Table 2: axis labels and legend entries are too small for print; increase font size and add error bars or statistical significance markers for the 12-task averages.
  2. [§4.1] §4.1: the exact number of training episodes per simulated task and the precise definition of 'success' (e.g., threshold on final pose error) should be stated explicitly rather than referenced to an appendix.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments. We address each major point below, acknowledging limitations where appropriate and outlining revisions.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (real-robot experiments): success rates are reported for R3M with 20 demonstrations, but no matched real-world baselines for CLIP, MoCo, or training from scratch are provided on the same Franka tasks. This omission prevents quantification of the transfer benefit and leaves the domain-shift assumption untested.

    Authors: We agree that matched real-world baselines would allow direct quantification of transfer gains. Real-robot experiments on the Franka are resource-intensive, which constrained our ability to run full comparisons for all methods. Simulation results already show consistent >10% gains for R3M over CLIP/MoCo and >20% over scratch. In revision we will add explicit discussion of this limitation in §4.2, note the practical success with 20 demos, and include any feasible preliminary real-world data points. revision: partial

  2. Referee: [§3.2] §3.2 (pre-training objectives): the time-contrastive and video-language losses are defined on egocentric human video; no analysis or ablation quantifies robustness to the shift to fixed third-person robot camera views, lighting, and motion statistics that appear in the real-robot evaluation.

    Authors: The referee is correct that no dedicated ablation isolates robustness to viewpoint, lighting, and motion shifts. The objectives aim to learn temporally consistent and semantically aligned features expected to generalize, and this is supported by sim-to-real transfer in our results. In the revised manuscript we will expand §3.2 with discussion of these factors and add supporting visualizations or limited ablations where space allows. revision: partial

standing simulated objections not resolved
  • Full matched real-world baselines for CLIP, MoCo, and training from scratch on the Franka tasks, as these require extensive additional physical robot time and resources beyond the current revision scope.

Circularity Check

0 steps flagged

No circularity: empirical pre-training objectives are independent of downstream robot-task metrics

full rationale

The paper pre-trains a visual encoder on Ego4D via time-contrastive loss, video-language alignment, and L1 sparsity, none of which are defined using the 12 simulated manipulation tasks or the real Franka apartment setup. Frozen R3M features are then evaluated on separate policy-learning benchmarks against scratch, CLIP, and MoCo baselines. No equation or result reduces to a fitted parameter taken from the target success rates; no self-citation supplies a uniqueness theorem that forces the architecture; and the reported gains are measured on held-out task distributions. The central empirical chain therefore remains self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the standard assumption that contrastive objectives on human video produce features transferable to robotic vision; no new entities or fitted parameters are introduced beyond the usual hyperparameters of contrastive learning.

axioms (1)
  • domain assumption Human video data contains visual features that transfer to robotic manipulation tasks.
    Invoked when the pre-trained encoder is used without adaptation on robot camera streams.

pith-pipeline@v0.9.0 · 5479 in / 1252 out tokens · 50780 ms · 2026-05-15T13:21:26.105969+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 7.0

    RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.

  2. RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 7.0

    RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.

  3. VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    cs.RO 2023-07 unverdicted novelty 7.0

    VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.

  4. VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

    cs.RO 2022-09 unverdicted novelty 7.0

    VIP learns a visual embedding from human videos whose distance defines dense, smooth rewards for arbitrary goal-image robot tasks without task-specific fine-tuning.

  5. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    cs.RO 2022-04 accept novelty 7.0

    SayCan combines an LLM's high-level semantic knowledge with robot skill value functions to select only feasible actions, enabling completion of abstract natural-language instructions on a real mobile manipulator.

  6. GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

    cs.RO 2026-05 unverdicted novelty 6.0

    GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.

  7. HumanNet: Scaling Human-centric Video Learning to One Million Hours

    cs.CV 2026-05 unverdicted novelty 6.0

    HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.

  8. Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.

  9. GazeVLA: Learning Human Intention for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.

  10. UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

    cs.RO 2026-04 unverdicted novelty 6.0

    UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.

  11. WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations

    cs.RO 2026-04 unverdicted novelty 6.0

    WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match tele...

  12. ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 6.0

    ProGAL-VLA uses 3D graphs, symbolic sub-goals, and a Grounding Alignment Contrastive loss to ground actions on verified embeddings, raising robustness from 30.3% to 71.5% and ambiguity AUROC to 0.81 on robotic benchmarks.

  13. Hierarchical Planning with Latent World Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Hierarchical planning over multi-scale latent world models enables 70% success on real robotic pick-and-place with goal-only input where flat models achieve 0%, while cutting planning compute up to 4x in simulations.

  14. Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

    cs.RO 2026-04 conditional novelty 6.0

    MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.

  15. InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

    cs.RO 2025-10 unverdicted novelty 6.0

    InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.

  16. GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    cs.RO 2025-03 unverdicted novelty 6.0

    GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.

  17. Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    cs.CV 2024-12 unverdicted novelty 6.0

    Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.

  18. GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    cs.RO 2024-10 unverdicted novelty 6.0

    GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.

  19. Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

    cs.RO 2024-09 unverdicted novelty 6.0

    Gen2Act enables generalizable robot manipulation for unseen objects and novel motions by using zero-shot human video generation from web data to condition a policy trained on an order of magnitude less robot interaction data.

  20. Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

    cs.RO 2024-01 conditional novelty 6.0

    A low-cost whole-body teleoperation system enables effective imitation learning for complex bimanual mobile manipulation by co-training on mobile and static demonstration datasets.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 19 Pith papers · 8 internal anchors

  1. [1]

    Levine, C

    S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016

  2. [2]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009

  3. [3]

    Mzurikwao, M

    D. Mzurikwao, M. Khan, O. Samuel, J. Cinatl, M. Wass, M. Michaelis, G. Marcelli, and C. S. Ang. Towards image-based cancer cell lines authentication using deep neural networks. Scientific Reports, 10, 11 2020. doi:10.1038/s41598-020-76670-6

  4. [4]

    Devlin, M.-W

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, Minnesota, June 2019. Association for Computational Linguistics

  5. [5]

    Willison, S

    Z. Zhang, J. Liu, and N. Razavian. BERT-XML: Large scale automated ICD coding using BERT pretraining. In Proceedings of the 3rd Clinical Natural Language Processing Workshop, pages 24–34, Online, Nov. 2020. Association for Computational Linguistics. doi:10.18653/v1/ 2020.clinicalnlp-1.3. URL https://aclanthology.org/2020.clinicalnlp-1.3

  6. [6]

    Z. Yang, N. Garcia, C. Chu, M. Otani, Y . Nakashima, and H. Takemura. Bert representations for video question answering. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1545–1554, 2020. doi:10.1109/W ACV45572.2020.9093596

  7. [7]

    Dasari, F

    S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn. Robonet: Large-scale multi-robot learning. In Conference on Robot Learning, 2019

  8. [8]

    Mandlekar, J

    A. Mandlekar, J. Booher, M. Spero, A. Tung, A. Gupta, Y . Zhu, A. Garg, S. Savarese, and L. Fei-Fei. Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1048–1055. IEEE, 2019

  9. [9]

    Young, D

    S. Young, D. Gandhi, S. Tulsiani, A. Gupta, P. Abbeel, and L. Pinto. Visual imitation made easy. In CoRL, 2020

  10. [10]

    Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

    F. Ebert, Y . Yang, K. Schmeckpeper, B. Bucher, G. Georgakis, K. Daniilidis, C. Finn, and S. Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. ArXiv, abs/2109.13396, 2021

  11. [11]

    T. B. Brown et al. Language models are few-shot learners. arXiv:2005.14165, 2020

  12. [12]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021

  13. [13]

    Goyal, Q

    P. Goyal, Q. Duval, I. Seessel, M. Caron, I. Misra, L. Sagun, A. Joulin, and P. Bojanowski. Vision models are more robust and fair when pretrained on uncurated images without supervision. ArXiv, abs/2202.08360, 2022. 9

  14. [14]

    Goyal, S

    R. Goyal, S. Ebrahimi Kahou, V . Michalski, J. Materzynska, S. Westphal, H. Kim, V . Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision, pages 5842–5850, 2017

  15. [15]

    Damen, H

    D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. Scaling egocentric vision: The epic-kitchens dataset. In European Conference on Computer Vision (ECCV), 2018

  16. [16]

    Grauman et al

    K. Grauman et al. Ego4D: Around the World in 3,000 Hours of Egocentric Video, 2021

  17. [17]

    L. Shao, T. Migimatsu, Q. Zhang, K. Yang, and J. Bohg. Concept2robot: Learning manipulation concepts from instructions and human demonstrations. In Proceedings of Robotics: Science and Systems (RSS), 2020

  18. [18]

    A. S. Chen, S. Nair, and C. Finn. Learning generalizable robotic reward functions from ”in-the-wild” human videos. ArXiv, abs/2103.16817, 2021

  19. [19]

    Sermanet, C

    P. Sermanet, C. Lynch, Y . Chebotar, J. Hsu, E. Jang, S. Schaal, and S. Levine. Time-contrastive networks: Self-supervised learning from video. Proceedings of International Conference in Robotics and Automation (ICRA), 2018

  20. [20]

    Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

    A. Rajeswaran, V . Kumar, A. Gupta, J. Schulman, E. Todorov, and S. Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. ArXiv, abs/1709.10087, 2018

  21. [21]

    Gupta, V

    A. Gupta, V . Kumar, C. Lynch, S. Levine, and K. Hausman. Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. In CoRL, 2019

  22. [22]

    T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, 2020

  23. [23]

    Parisi, A

    S. Parisi, A. Rajeswaran, S. Purushwalkam, and A. K. Gupta. The unsurprising effectiveness of pre-trained vision models for control. 2022

  24. [24]

    K. He, H. Fan, Y . Wu, S. Xie, and R. B. Girshick. Momentum contrast for unsupervised visual representation learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9726–9735, 2020

  25. [25]

    Laskin, K

    M. Laskin, K. Lee, A. Stooke, L. Pinto, P. Abbeel, and A. Srinivas. Reinforcement learning with augmented data. ArXiv, abs/2004.14990, 2020

  26. [26]

    Srinivas, M

    A. Srinivas, M. Laskin, and P. Abbeel. Curl: Contrastive unsupervised representations for reinforcement learning. In ICML, 2020

  27. [27]

    Kostrikov, D

    I. Kostrikov, D. Yarats, and R. Fergus. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. ArXiv, abs/2004.13649, 2021

  28. [28]

    J. Pari, N. M. M. Shafiullah, S. P. Arunachalam, and L. Pinto. The surprising effectiveness of representation learning for visual imitation. ArXiv, abs/2112.01511, 2021

  29. [29]

    DeepMDP: Learning Continuous Latent Space Models for Representation Learning

    C. Gelada, S. Kumar, J. Buckman, O. Nachum, and M. G. Bellemare. Deepmdp: Learning continuous latent space models for representation learning. ArXiv, abs/1906.02736, 2019

  30. [30]

    Dream to Control: Learning Behaviors by Latent Imagination

    D. Hafner, T. P. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination. ArXiv, abs/1912.01603, 2020

  31. [31]

    Zhang, R

    A. Zhang, R. McAllister, R. Calandra, Y . Gal, and S. Levine. Learning invariant representations for reinforcement learning without reconstruction. ArXiv, abs/2006.10742, 2021. 10

  32. [32]

    S. Nair, S. Savarese, and C. Finn. Goal-aware prediction: Learning to model what matters. ArXiv, abs/2007.07170, 2020

  33. [33]

    M. Hong, K. Lee, M. Kang, W. Jung, and S. Oh. Dynamics-aware metric embedding: Metric learning in a latent space for visual planning. IEEE Robotics and Automation Letters, 2022

  34. [34]

    Jonschkowski and O

    R. Jonschkowski and O. Brock. Learning state representations with robotic priors. Autonomous Robots, 39:407–428, 10 2015. doi:10.1007/s10514-015-9459-7

  35. [35]

    Y .-C. Lin, A. Zeng, S. Song, P. Isola, and T.-Y . Lin. Learning to see before learning to act: Visual pre-training for manipulation. 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 7286–7293, 2020

  36. [36]

    Shridhar, L

    M. Shridhar, L. Manuelli, and D. Fox. Cliport: What and where pathways for robotic manipula- tion. In CoRL, 2021

  37. [37]

    Khandelwal, L

    A. Khandelwal, L. Weihs, R. Mottaghi, and A. Kembhavi. Simple but effective: Clip embeddings for embodied ai. ArXiv, abs/2111.09888, 2021

  38. [38]

    Shah and V

    R. Shah and V . Kumar. Rrl: Resnet as representation for reinforcement learning. ArXiv, abs/2107.03380, 2021

  39. [39]

    Y . Seo, K. Lee, S. James, and P. Abbeel. Reinforcement learning with action-free pre-training from videos. ArXiv, abs/2203.13880, 2022

  40. [40]

    T. Xiao, I. Radosavovic, T. Darrell, and J. Malik. Masked visual pre-training for motor control. 2022

  41. [41]

    Y . Liu, A. Gupta, P. Abbeel, and S. Levine. Imitation from observation: Learning to imitate behaviors from raw video via context translation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1118–1125. IEEE, 2018

  42. [42]

    Sharma, D

    P. Sharma, D. Pathak, and A. Gupta. Third-person visual imitation learning via decoupled hierarchical controller. In NeurIPS, 2019

  43. [43]

    Smith, N

    L. Smith, N. Dhawan, M. Zhang, P. Abbeel, and S. Levine. A VID: Learning Multi-Stage Tasks via Pixel-Level Translation of Human Videos. InProceedings of Robotics: Science and Systems, Corvalis, Oregon, USA, July 2020

  44. [44]

    T. Yu, C. Finn, S. Dasari, A. Xie, T. Zhang, P. Abbeel, and S. Levine. One-shot imitation from observing humans via domain-adaptive meta-learning. In Proceedings of Robotics: Science and Systems, Pittsburgh, Pennsylvania, June 2018

  45. [45]

    Schmeckpeper, A

    K. Schmeckpeper, A. Xie, O. Rybkin, S. Tian, K. Daniilidis, S. Levine, and C. Finn. Learning predictive models from observation and interaction. In ECCV, 2020

  46. [46]

    A. D. Edwards and C. L. Isbell. Perceptual values from observation. arXiv preprint arXiv:1905.07861, 2019

  47. [47]

    Schmeckpeper, O

    K. Schmeckpeper, O. Rybkin, K. Daniilidis, S. Levine, and C. Finn. Reinforcement learning with videos: Combining offline observations with interaction. In CoRL, 2020

  48. [48]

    Scalise, J

    R. Scalise, J. Thomason, Y . Bisk, and S. Srinivasa. Improving robot success detection using static object data. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2019

  49. [49]

    S. Pirk, M. Khansari, Y . Bai, C. Lynch, and P. Sermanet. Online object representations with contrastive learning, 2019

  50. [50]

    Xiong, Q

    H. Xiong, Q. Li, Y .-C. Chen, H. Bharadhwaj, S. Sinha, and A. Garg. Learning by watching: Physical imitation of manipulation skills from human videos, 2021. 11

  51. [51]

    N. Das, S. Bechtle, T. Davchev, D. Jayaraman, A. Rai, and F. Meier. Model-based inverse reinforcement learning from visual demonstrations, 2021

  52. [52]

    Zakka, A

    K. Zakka, A. Zeng, P. Florence, J. Tompson, J. Bohg, and D. Dwibedi. Xirl: Cross-embodiment inverse reinforcement learning, 2021

  53. [53]

    Stepputtis, J

    S. Stepputtis, J. Campbell, M. Phielipp, S. Lee, C. Baral, and H. B. Amor. Language-conditioned imitation learning for robot manipulation tasks. ArXiv, abs/2010.12083, 2020

  54. [54]

    Lynch and P

    C. Lynch and P. Sermanet. Grounding language in play. ArXiv, abs/2005.07648, 2020

  55. [55]

    Y . Cui, S. Niekum, A. Gupta, V . Kumar, and A. Rajeswaran. Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation? In L4DC, 2022

  56. [56]

    S. Nair, E. Mitchell, K. Chen, B. Ichter, S. Savarese, and C. Finn. Learning language-conditioned robot behavior from offline data and crowd-sourced annotation. In CoRL, 2021

  57. [57]

    Pinto and A

    L. Pinto and A. Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In IEEE international conference on robotics and automation (ICRA), 2016

  58. [58]

    Sharma, L

    P. Sharma, L. Mohan, L. Pinto, and A. K. Gupta. Multiple interactions made easy (mime): Large scale demonstrations data for imitation. In CoRL, 2018

  59. [59]

    E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc- z: Zero-shot task generalization with robotic imitation learning. In A. Faust, D. Hsu, and G. Neumann, editors, Proceedings of the 5th Conference on Robot Learning, volume 164 of Proceedings of Machine Learning Research, pages 991–1002. PMLR, 08–11 Nov 2022. URL ht...

  60. [60]

    Wang and A

    X. Wang and A. K. Gupta. Unsupervised learning of visual representations using videos. 2015 IEEE International Conference on Computer Vision (ICCV), pages 2794–2802, 2015

  61. [61]

    Sermanet, K

    P. Sermanet, K. Xu, and S. Levine. Unsupervised perceptual rewards for imitation learning. Proceedings of Robotics: Science and Systems (RSS), 2017

  62. [62]

    X. Wang, A. Jabri, and A. A. Efros. Learning correspondence from the cycle-consistency of time. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2561–2571, 2019

  63. [63]

    Jabri, A

    A. Jabri, A. Owens, and A. A. Efros. Space-time correspondence as a contrastive random walk. ArXiv, abs/2006.14613, 2020

  64. [64]

    Goyal, S

    M. Goyal, S. Modi, R. Goyal, and S. Gupta. Human hands as probes for interactive object understanding. In Computer Vision and Pattern Recognition (CVPR), 2022

  65. [65]

    Miech, D

    A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2630–2640, 2019

  66. [66]

    H. Xu, G. Ghosh, P.-Y . Huang, D. Okhonko, A. Aghajanyan, and F. M. L. Z. C. Feichten- hofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding. ArXiv, abs/2109.14084, 2021

  67. [67]

    Representation Learning with Contrastive Predictive Coding

    A. van den Oord, Y . Li, and O. Vinyals. Representation learning with contrastive predictive coding. ArXiv, abs/1807.03748, 2018

  68. [68]

    S. Ross, G. J. Gordon, and J. A. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS, 2011

  69. [69]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. 12

  70. [70]

    Radosavovic, T

    I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik, and T. Darrell. Real-world robot learning with masked visual pre-training. CoRL, 2022

  71. [71]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    V . Sanh, L. Debut, J. Chaumond, and T. Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108, 2019. 13 A R3M Training Details A.1 Data Preprocessing The Ego4D dataset consists of several hour long videos within a certain scene. Within each scene, there are many sub-clips, each with a natural language a...