Recognition: 2 theorem links
· Lean TheoremR3M: A Universal Visual Representation for Robot Manipulation
Pith reviewed 2026-05-15 13:21 UTC · model grok-4.3
The pith
Pre-trained visual features from human videos enable more data-efficient robot manipulation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
R3M is a universal visual representation pre-trained on diverse human video data from the Ego4D dataset. The pre-training combines time-contrastive learning to capture temporal structure, video-language alignment for semantic understanding, and an L1 penalty to promote sparse and compact features. When frozen and used for downstream robotic policy learning, R3M boosts task success rates by more than 20% over training from scratch and by more than 10% over state-of-the-art representations such as CLIP and MoCo across 12 simulated manipulation tasks. It further enables a real Franka Emika Panda arm to acquire a variety of manipulation skills in a cluttered apartment setting using just 20 demos
What carries the argument
The R3M visual encoder, obtained by pre-training on human videos with time-contrastive, language-alignment, and sparsity objectives, acting as a frozen perception module for policy learning.
If this is right
- Pre-trained human video features transfer to robotic vision without adaptation.
- Data efficiency in robot learning improves significantly with such representations.
- Combining contrastive, language, and sparsity losses creates more effective visual features for control.
- Real-world robot deployment becomes viable with small demonstration sets.
Where Pith is reading between the lines
- Scaling up the pre-training dataset could yield even stronger performance gains for a wider range of tasks.
- This method might generalize to other robot embodiments or sensor modalities beyond the tested arm.
- Integrating R3M with proprioception or other modalities could further enhance learning speed.
Load-bearing premise
Visual features learned from human video data will transfer effectively to robotic camera inputs and task distributions without any robot-specific fine-tuning or domain adaptation.
What would settle it
If policies using the frozen R3M encoder achieve no better success rates than random actions or from-scratch baselines across the 12 simulated manipulation tasks, the claimed transfer benefit would be falsified.
read the original abstract
We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of downstream robotic manipulation tasks. Concretely, we pre-train a visual representation using the Ego4D human video dataset using a combination of time-contrastive learning, video-language alignment, and an L1 penalty to encourage sparse and compact representations. The resulting representation, R3M, can be used as a frozen perception module for downstream policy learning. Across a suite of 12 simulated robot manipulation tasks, we find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo. Furthermore, R3M enables a Franka Emika Panda arm to learn a range of manipulation tasks in a real, cluttered apartment given just 20 demonstrations. Code and pre-trained models are available at https://tinyurl.com/robotr3m.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces R3M, a visual encoder pre-trained on the Ego4D human video dataset via a combination of time-contrastive learning, video-language alignment, and an L1 sparsity penalty. The frozen R3M representation is then used for downstream policy learning. On 12 simulated manipulation tasks, R3M yields >20% higher success rates than training from scratch and >10% gains over CLIP and MoCo. In real-world experiments, a Franka Emika Panda arm learns several manipulation tasks in a cluttered apartment from only 20 demonstrations.
Significance. If the reported transfer holds under controlled conditions, R3M would provide a practical route to data-efficient robot learning by leveraging large-scale human video corpora. The availability of code and pre-trained models strengthens reproducibility and enables direct follow-up work on domain adaptation or fine-tuning.
major comments (2)
- [§4.2] §4.2 (real-robot experiments): success rates are reported for R3M with 20 demonstrations, but no matched real-world baselines for CLIP, MoCo, or training from scratch are provided on the same Franka tasks. This omission prevents quantification of the transfer benefit and leaves the domain-shift assumption untested.
- [§3.2] §3.2 (pre-training objectives): the time-contrastive and video-language losses are defined on egocentric human video; no analysis or ablation quantifies robustness to the shift to fixed third-person robot camera views, lighting, and motion statistics that appear in the real-robot evaluation.
minor comments (2)
- [Figure 3] Figure 3 and Table 2: axis labels and legend entries are too small for print; increase font size and add error bars or statistical significance markers for the 12-task averages.
- [§4.1] §4.1: the exact number of training episodes per simulated task and the precise definition of 'success' (e.g., threshold on final pose error) should be stated explicitly rather than referenced to an appendix.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below, acknowledging limitations where appropriate and outlining revisions.
read point-by-point responses
-
Referee: [§4.2] §4.2 (real-robot experiments): success rates are reported for R3M with 20 demonstrations, but no matched real-world baselines for CLIP, MoCo, or training from scratch are provided on the same Franka tasks. This omission prevents quantification of the transfer benefit and leaves the domain-shift assumption untested.
Authors: We agree that matched real-world baselines would allow direct quantification of transfer gains. Real-robot experiments on the Franka are resource-intensive, which constrained our ability to run full comparisons for all methods. Simulation results already show consistent >10% gains for R3M over CLIP/MoCo and >20% over scratch. In revision we will add explicit discussion of this limitation in §4.2, note the practical success with 20 demos, and include any feasible preliminary real-world data points. revision: partial
-
Referee: [§3.2] §3.2 (pre-training objectives): the time-contrastive and video-language losses are defined on egocentric human video; no analysis or ablation quantifies robustness to the shift to fixed third-person robot camera views, lighting, and motion statistics that appear in the real-robot evaluation.
Authors: The referee is correct that no dedicated ablation isolates robustness to viewpoint, lighting, and motion shifts. The objectives aim to learn temporally consistent and semantically aligned features expected to generalize, and this is supported by sim-to-real transfer in our results. In the revised manuscript we will expand §3.2 with discussion of these factors and add supporting visualizations or limited ablations where space allows. revision: partial
- Full matched real-world baselines for CLIP, MoCo, and training from scratch on the Franka tasks, as these require extensive additional physical robot time and resources beyond the current revision scope.
Circularity Check
No circularity: empirical pre-training objectives are independent of downstream robot-task metrics
full rationale
The paper pre-trains a visual encoder on Ego4D via time-contrastive loss, video-language alignment, and L1 sparsity, none of which are defined using the 12 simulated manipulation tasks or the real Franka apartment setup. Frozen R3M features are then evaluated on separate policy-learning benchmarks against scratch, CLIP, and MoCo baselines. No equation or result reduces to a fitted parameter taken from the target success rates; no self-citation supplies a uniqueness theorem that forces the architecture; and the reported gains are measured on held-out task distributions. The central empirical chain therefore remains self-contained and externally falsifiable.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human video data contains visual features that transfer to robotic manipulation tasks.
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 20 Pith papers
-
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.
-
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.
-
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
-
VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training
VIP learns a visual embedding from human videos whose distance defines dense, smooth rewards for arbitrary goal-image robot tasks without task-specific fine-tuning.
-
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
SayCan combines an LLM's high-level semantic knowledge with robot skill value functions to select only feasible actions, enabling completion of abstract natural-language instructions on a real mobile manipulator.
-
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
-
HumanNet: Scaling Human-centric Video Learning to One Million Hours
HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.
-
Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation
Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.
-
GazeVLA: Learning Human Intention for Robotic Manipulation
GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
-
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
-
WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations
WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match tele...
-
ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models
ProGAL-VLA uses 3D graphs, symbolic sub-goals, and a Grounding Alignment Contrastive loss to ground actions on verified embeddings, raising robustness from 30.3% to 71.5% and ambiguity AUROC to 0.81 on robotic benchmarks.
-
Hierarchical Planning with Latent World Models
Hierarchical planning over multi-scale latent world models enables 70% success on real robotic pick-and-place with goal-only input where flat models achieve 0%, while cutting planning compute up to 4x in simulations.
-
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
-
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
-
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.
-
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
-
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.
-
Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation
Gen2Act enables generalizable robot manipulation for unseen objects and novel motions by using zero-shot human video generation from web data to condition a policy trained on an order of magnitude less robot interaction data.
-
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
A low-cost whole-body teleoperation system enables effective imitation learning for complex bimanual mobile manipulation by co-training on mobile and static demonstration datasets.
Reference graph
Works this paper leans on
- [1]
-
[2]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009
work page 2009
-
[3]
D. Mzurikwao, M. Khan, O. Samuel, J. Cinatl, M. Wass, M. Michaelis, G. Marcelli, and C. S. Ang. Towards image-based cancer cell lines authentication using deep neural networks. Scientific Reports, 10, 11 2020. doi:10.1038/s41598-020-76670-6
-
[4]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, Minnesota, June 2019. Association for Computational Linguistics
work page 2019
-
[5]
Z. Zhang, J. Liu, and N. Razavian. BERT-XML: Large scale automated ICD coding using BERT pretraining. In Proceedings of the 3rd Clinical Natural Language Processing Workshop, pages 24–34, Online, Nov. 2020. Association for Computational Linguistics. doi:10.18653/v1/ 2020.clinicalnlp-1.3. URL https://aclanthology.org/2020.clinicalnlp-1.3
-
[6]
Z. Yang, N. Garcia, C. Chu, M. Otani, Y . Nakashima, and H. Takemura. Bert representations for video question answering. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1545–1554, 2020. doi:10.1109/W ACV45572.2020.9093596
work page doi:10.1109/w 2020
- [7]
-
[8]
A. Mandlekar, J. Booher, M. Spero, A. Tung, A. Gupta, Y . Zhu, A. Garg, S. Savarese, and L. Fei-Fei. Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1048–1055. IEEE, 2019
work page 2019
- [9]
-
[10]
Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets
F. Ebert, Y . Yang, K. Schmeckpeper, B. Bucher, G. Georgakis, K. Daniilidis, C. Finn, and S. Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. ArXiv, abs/2109.13396, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[11]
T. B. Brown et al. Language models are few-shot learners. arXiv:2005.14165, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[12]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021
work page 2021
- [13]
-
[14]
R. Goyal, S. Ebrahimi Kahou, V . Michalski, J. Materzynska, S. Westphal, H. Kim, V . Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision, pages 5842–5850, 2017
work page 2017
- [15]
-
[16]
K. Grauman et al. Ego4D: Around the World in 3,000 Hours of Egocentric Video, 2021
work page 2021
-
[17]
L. Shao, T. Migimatsu, Q. Zhang, K. Yang, and J. Bohg. Concept2robot: Learning manipulation concepts from instructions and human demonstrations. In Proceedings of Robotics: Science and Systems (RSS), 2020
work page 2020
- [18]
-
[19]
P. Sermanet, C. Lynch, Y . Chebotar, J. Hsu, E. Jang, S. Schaal, and S. Levine. Time-contrastive networks: Self-supervised learning from video. Proceedings of International Conference in Robotics and Automation (ICRA), 2018
work page 2018
-
[20]
Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations
A. Rajeswaran, V . Kumar, A. Gupta, J. Schulman, E. Todorov, and S. Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. ArXiv, abs/1709.10087, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [21]
-
[22]
T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning, 2020
work page 2020
- [23]
-
[24]
K. He, H. Fan, Y . Wu, S. Xie, and R. B. Girshick. Momentum contrast for unsupervised visual representation learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9726–9735, 2020
work page 2020
- [25]
-
[26]
A. Srinivas, M. Laskin, and P. Abbeel. Curl: Contrastive unsupervised representations for reinforcement learning. In ICML, 2020
work page 2020
-
[27]
I. Kostrikov, D. Yarats, and R. Fergus. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. ArXiv, abs/2004.13649, 2021
- [28]
-
[29]
DeepMDP: Learning Continuous Latent Space Models for Representation Learning
C. Gelada, S. Kumar, J. Buckman, O. Nachum, and M. G. Bellemare. Deepmdp: Learning continuous latent space models for representation learning. ArXiv, abs/1906.02736, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[30]
Dream to Control: Learning Behaviors by Latent Imagination
D. Hafner, T. P. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination. ArXiv, abs/1912.01603, 2020
work page internal anchor Pith review Pith/arXiv arXiv 1912
- [31]
- [32]
-
[33]
M. Hong, K. Lee, M. Kang, W. Jung, and S. Oh. Dynamics-aware metric embedding: Metric learning in a latent space for visual planning. IEEE Robotics and Automation Letters, 2022
work page 2022
-
[34]
R. Jonschkowski and O. Brock. Learning state representations with robotic priors. Autonomous Robots, 39:407–428, 10 2015. doi:10.1007/s10514-015-9459-7
-
[35]
Y .-C. Lin, A. Zeng, S. Song, P. Isola, and T.-Y . Lin. Learning to see before learning to act: Visual pre-training for manipulation. 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 7286–7293, 2020
work page 2020
-
[36]
M. Shridhar, L. Manuelli, and D. Fox. Cliport: What and where pathways for robotic manipula- tion. In CoRL, 2021
work page 2021
-
[37]
A. Khandelwal, L. Weihs, R. Mottaghi, and A. Kembhavi. Simple but effective: Clip embeddings for embodied ai. ArXiv, abs/2111.09888, 2021
-
[38]
R. Shah and V . Kumar. Rrl: Resnet as representation for reinforcement learning. ArXiv, abs/2107.03380, 2021
- [39]
-
[40]
T. Xiao, I. Radosavovic, T. Darrell, and J. Malik. Masked visual pre-training for motor control. 2022
work page 2022
-
[41]
Y . Liu, A. Gupta, P. Abbeel, and S. Levine. Imitation from observation: Learning to imitate behaviors from raw video via context translation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1118–1125. IEEE, 2018
work page 2018
- [42]
- [43]
-
[44]
T. Yu, C. Finn, S. Dasari, A. Xie, T. Zhang, P. Abbeel, and S. Levine. One-shot imitation from observing humans via domain-adaptive meta-learning. In Proceedings of Robotics: Science and Systems, Pittsburgh, Pennsylvania, June 2018
work page 2018
-
[45]
K. Schmeckpeper, A. Xie, O. Rybkin, S. Tian, K. Daniilidis, S. Levine, and C. Finn. Learning predictive models from observation and interaction. In ECCV, 2020
work page 2020
-
[46]
A. D. Edwards and C. L. Isbell. Perceptual values from observation. arXiv preprint arXiv:1905.07861, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[47]
K. Schmeckpeper, O. Rybkin, K. Daniilidis, S. Levine, and C. Finn. Reinforcement learning with videos: Combining offline observations with interaction. In CoRL, 2020
work page 2020
-
[48]
R. Scalise, J. Thomason, Y . Bisk, and S. Srinivasa. Improving robot success detection using static object data. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2019
work page 2019
-
[49]
S. Pirk, M. Khansari, Y . Bai, C. Lynch, and P. Sermanet. Online object representations with contrastive learning, 2019
work page 2019
- [50]
-
[51]
N. Das, S. Bechtle, T. Davchev, D. Jayaraman, A. Rai, and F. Meier. Model-based inverse reinforcement learning from visual demonstrations, 2021
work page 2021
- [52]
-
[53]
S. Stepputtis, J. Campbell, M. Phielipp, S. Lee, C. Baral, and H. B. Amor. Language-conditioned imitation learning for robot manipulation tasks. ArXiv, abs/2010.12083, 2020
-
[54]
C. Lynch and P. Sermanet. Grounding language in play. ArXiv, abs/2005.07648, 2020
-
[55]
Y . Cui, S. Niekum, A. Gupta, V . Kumar, and A. Rajeswaran. Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation? In L4DC, 2022
work page 2022
-
[56]
S. Nair, E. Mitchell, K. Chen, B. Ichter, S. Savarese, and C. Finn. Learning language-conditioned robot behavior from offline data and crowd-sourced annotation. In CoRL, 2021
work page 2021
-
[57]
L. Pinto and A. Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In IEEE international conference on robotics and automation (ICRA), 2016
work page 2016
- [58]
-
[59]
E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc- z: Zero-shot task generalization with robotic imitation learning. In A. Faust, D. Hsu, and G. Neumann, editors, Proceedings of the 5th Conference on Robot Learning, volume 164 of Proceedings of Machine Learning Research, pages 991–1002. PMLR, 08–11 Nov 2022. URL ht...
work page 2022
-
[60]
X. Wang and A. K. Gupta. Unsupervised learning of visual representations using videos. 2015 IEEE International Conference on Computer Vision (ICCV), pages 2794–2802, 2015
work page 2015
-
[61]
P. Sermanet, K. Xu, and S. Levine. Unsupervised perceptual rewards for imitation learning. Proceedings of Robotics: Science and Systems (RSS), 2017
work page 2017
-
[62]
X. Wang, A. Jabri, and A. A. Efros. Learning correspondence from the cycle-consistency of time. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2561–2571, 2019
work page 2019
- [63]
- [64]
- [65]
- [66]
-
[67]
Representation Learning with Contrastive Predictive Coding
A. van den Oord, Y . Li, and O. Vinyals. Representation learning with contrastive predictive coding. ArXiv, abs/1807.03748, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[68]
S. Ross, G. J. Gordon, and J. A. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS, 2011
work page 2011
-
[69]
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. 12
work page 2016
-
[70]
I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik, and T. Darrell. Real-world robot learning with masked visual pre-training. CoRL, 2022
work page 2022
-
[71]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
V . Sanh, L. Debut, J. Chaumond, and T. Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108, 2019. 13 A R3M Training Details A.1 Data Preprocessing The Ego4D dataset consists of several hour long videos within a certain scene. Within each scene, there are many sub-clips, each with a natural language a...
work page internal anchor Pith review Pith/arXiv arXiv 1910
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.