Recognition: 3 theorem links
· Lean TheoremUniversal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots
Pith reviewed 2026-05-14 21:31 UTC · model grok-4.3
The pith
UMI lets robots learn complex manipulation from portable human gripper demonstrations with zero-shot transfer to new settings and hardware.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
UMI is a data collection and policy learning framework that enables direct skill transfer from in-the-wild human demonstrations collected with hand-held grippers to deployable robot policies. It adds a policy interface with inference-time latency matching and relative-trajectory action representation so that learned policies remain hardware-agnostic and work across multiple robot platforms while generalizing zero-shot to new environments and objects.
What carries the argument
Hand-held grippers for portable demonstration collection together with a policy interface that performs inference-time latency matching and encodes actions as relative trajectories to close the human-to-robot domain gap.
If this is right
- Policies generalize zero-shot to novel environments and objects after training on diverse human demonstrations.
- The same framework supports dynamic, bimanual, precise, and long-horizon behaviors by swapping only the training data.
- Learned policies deploy without modification across multiple robot platforms.
- Data collection becomes portable and low-cost because no robot hardware is required during demonstration gathering.
Where Pith is reading between the lines
- Human-collected data could scale robot training far beyond what robot teleoperation currently allows.
- Testing the same interface on tasks requiring finer finger control would reveal whether relative trajectories remain sufficient.
- The method suggests that careful action representation can matter more for transfer than exact hardware matching.
Load-bearing premise
The gripper interface, latency matching, and relative-trajectory encoding are together sufficient to let policies trained on human data execute reliably on robots despite differences in timing and physical form.
What would settle it
A policy trained on diverse human demonstrations fails to complete the task when deployed on a robot facing a new object or environment that was not seen in training.
read the original abstract
We present Universal Manipulation Interface (UMI) -- a data collection and policy learning framework that allows direct skill transfer from in-the-wild human demonstrations to deployable robot policies. UMI employs hand-held grippers coupled with careful interface design to enable portable, low-cost, and information-rich data collection for challenging bimanual and dynamic manipulation demonstrations. To facilitate deployable policy learning, UMI incorporates a carefully designed policy interface with inference-time latency matching and a relative-trajectory action representation. The resulting learned policies are hardware-agnostic and deployable across multiple robot platforms. Equipped with these features, UMI framework unlocks new robot manipulation capabilities, allowing zero-shot generalizable dynamic, bimanual, precise, and long-horizon behaviors, by only changing the training data for each task. We demonstrate UMI's versatility and efficacy with comprehensive real-world experiments, where policies learned via UMI zero-shot generalize to novel environments and objects when trained on diverse human demonstrations. UMI's hardware and software system is open-sourced at https://umi-gripper.github.io.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Universal Manipulation Interface (UMI), a data collection and policy learning framework that uses hand-held grippers for portable in-the-wild human demonstrations of bimanual and dynamic tasks. It incorporates a policy interface with inference-time latency matching and relative-trajectory action representations to enable hardware-agnostic policies that transfer zero-shot to multiple robot platforms, with experiments showing generalization to novel environments and objects by varying only the training data.
Significance. If the zero-shot transfer results hold under rigorous validation, the work would be significant for scalable robot learning: it decouples data collection from robot hardware, enabling low-cost collection of complex manipulation demonstrations and hardware-agnostic deployment. The open-sourcing of the gripper design and software is a concrete strength that supports reproducibility.
major comments (3)
- [Experiments] Experiments section: the central zero-shot generalization claim rests on the assumption that the gripper interface, latency matching, and relative-trajectory representation close the human-robot domain gap, yet no direct quantitative metrics (end-effector trajectory error distributions, residual latency histograms, or kinematic mismatch norms) are reported comparing human demonstrations to robot executions on matched task instances.
- [Experiments] Experiments section: the evaluation does not include ablations that isolate the contribution of each interface component (gripper design, latency matching, relative actions) to the reported success rates, leaving open the possibility that observed performance is driven by task selection or demonstration style rather than the proposed gap-closure mechanisms.
- [Experiments] The manuscript provides limited detail on baseline methods, exact metrics, data collection protocols, and failure-case analysis, which weakens the support for the hardware-agnostic and zero-shot claims despite the plausible experimental outcomes described in the abstract.
minor comments (2)
- Figure captions and axis labels in the results figures could be expanded to include exact success-rate definitions and number of trials per condition for immediate interpretability.
- The related-work section would benefit from explicit comparison to prior teleoperation interfaces that also target domain-gap reduction, to better situate the novelty of the latency-matching and relative-action choices.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to incorporate additional quantitative analysis, partial ablations, and expanded experimental details where feasible.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the central zero-shot generalization claim rests on the assumption that the gripper interface, latency matching, and relative-trajectory representation close the human-robot domain gap, yet no direct quantitative metrics (end-effector trajectory error distributions, residual latency histograms, or kinematic mismatch norms) are reported comparing human demonstrations to robot executions on matched task instances.
Authors: We agree that direct quantitative metrics would strengthen the evidence for domain gap closure. In the revised manuscript, we have added end-effector trajectory error distributions and residual latency histograms comparing human demonstrations to robot executions on matched task instances in the Experiments section and supplementary material. These metrics confirm that the interface components reduce discrepancies, supporting the zero-shot transfer results. revision: yes
-
Referee: [Experiments] Experiments section: the evaluation does not include ablations that isolate the contribution of each interface component (gripper design, latency matching, relative actions) to the reported success rates, leaving open the possibility that observed performance is driven by task selection or demonstration style rather than the proposed gap-closure mechanisms.
Authors: We acknowledge the value of isolating each component's contribution. Full ablations are challenging due to the integrated nature of the UMI framework, particularly for the gripper design which underpins all data collection. In the revision, we have added partial ablations evaluating the effects of latency matching and relative-trajectory representations on success rates for representative tasks, with discussion of why complete isolation of the gripper is not straightforward. revision: partial
-
Referee: [Experiments] The manuscript provides limited detail on baseline methods, exact metrics, data collection protocols, and failure-case analysis, which weakens the support for the hardware-agnostic and zero-shot claims despite the plausible experimental outcomes described in the abstract.
Authors: We have expanded the Experiments section in the revised manuscript to include detailed descriptions of baseline methods (specifying imitation learning approaches and comparisons), exact success metrics (binary task completion with definitions), data collection protocols (demonstration counts, environment and object diversity), and a new failure-case analysis subsection with quantitative breakdowns and qualitative examples. revision: yes
Circularity Check
No significant circularity; claims grounded in experiments
full rationale
The manuscript presents UMI as a data-collection and policy-learning framework whose central claims (zero-shot generalization to novel environments/objects, hardware-agnostic deployment) are supported by real-world experiments on diverse human demonstrations. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or described structure. Design choices (gripper interface, latency matching, relative actions) are motivated as domain-gap reducers but are not derived from or equivalent to the target results by construction. The derivation chain remains self-contained against external benchmarks via empirical outcomes.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human demonstrations captured via the handheld gripper interface can be mapped to robot actions with minimal unmodeled domain shift.
invented entities (1)
-
Universal Manipulation Interface (UMI) gripper and policy interface
no independent evidence
Lean theorems connected to this paper
-
Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
UMI incorporates a carefully designed policy interface with inference-time latency matching and a relative-trajectory action representation.
-
Foundation.LawOfExistenceexistence_economically_inevitable unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
policies learned via UMI zero-shot generalize to novel environments and objects when trained on diverse human demonstrations.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 24 Pith papers
-
Being-H0.7: A Latent World-Action Model from Egocentric Videos
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
-
Tune to Learn: How Controller Gains Shape Robot Policy Learning
Controller gains affect learnability differently for behavior cloning, RL from scratch, and sim-to-real transfer, so optimal gains depend on the learning paradigm rather than desired task behavior.
-
From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models
A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discr...
-
Learning Tactile-Aware Quadrupedal Loco-Manipulation Policies
A tactile-aware hierarchical policy for quadrupedal loco-manipulation improves real-world contact-rich task performance by 28.54% over vision-only and visuotactile baselines.
-
FingerViP: Learning Real-World Dexterous Manipulation with Fingertip Visual Perception
FingerViP equips each finger with a miniature camera and trains a multi-view diffusion policy that achieves 80.8% success on real-world dexterous tasks previously limited by wrist-camera occlusion.
-
UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception
UMI-3D integrates LiDAR into the UMI hardware for robust multimodal 3D perception in manipulation demonstrations, yielding higher policy success rates and enabling previously infeasible tasks like deformable object handling.
-
XRZero-G0: Pushing the Frontier of Dexterous Robotic Manipulation with Interfaces, Quality and Ratios
XRZero-G0 enables 2000-hour robot-free datasets that, when mixed 10:1 with real-robot data, match full real-robot performance at 1/20th the cost and support zero-shot transfer.
-
WM-DAgger: Enabling Efficient Data Aggregation for Imitation Learning with World Models
WM-DAgger uses world models with corrective action synthesis and consistency-guided filtering to aggregate OOD recovery data for imitation learning, reporting 93.3% success in soft bag pushing with five demonstrations.
-
ActiveGlasses: Learning Manipulation with Active Vision from Ego-centric Human Demonstration
ActiveGlasses learns robot manipulation from ego-centric human demos captured with active vision via smart glasses, achieving zero-shot transfer using object-centric point-cloud policies.
-
EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World
EgoVerse releases 1,362 hours of standardized egocentric human data across 1,965 tasks and shows via multi-lab experiments that robot policy performance scales with human data volume when the data aligns with robot ob...
-
TAMEn: Tactile-Aware Manipulation Engine for Closed-Loop Data Collection in Contact-Rich Tasks
TAMEn supplies a cross-morphology wearable interface and pyramid-structured visuo-tactile data regime that raises bimanual manipulation success rates from 34% to 75% via closed-loop collection.
-
RoSHI: A Versatile Robot-oriented Suit for Human Data In-the-Wild
RoSHI is a hybrid wearable that combines sparse IMUs and egocentric SLAM to capture accurate full-body 3D pose and shape data in natural environments for robot learning.
-
Real-Time Execution of Action Chunking Flow Policies
Real-time chunking (RTC) allows diffusion- and flow-based action chunking policies to execute smoothly and asynchronously, maintaining high success rates on dynamic tasks even with significant inference latency.
-
Unified Video Action Model
UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without p...
-
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
-
Nautilus: From One Prompt to Plug-and-Play Robot Learning
NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
-
FlexiTac: A Low-Cost, Open-Source, Scalable Tactile Sensing Solution for Robotic Systems
FlexiTac is a scalable piezoresistive tactile sensing system with flexible FPC-Velostat-FPC pads and a 100 Hz multi-channel readout board that mounts on rigid or soft grippers and supports visuo-tactile learning.
-
Learning Tactile-Aware Quadrupedal Loco-Manipulation Policies
A hierarchical tactile-aware policy combines human-demonstration training for contact cue prediction with sim-to-real reinforcement learning to improve quadrupedal loco-manipulation performance by 28.54% over vision b...
-
OmniUMI: Towards Physically Grounded Robot Learning via Human-Aligned Multimodal Interaction
OmniUMI introduces a multimodal handheld interface that synchronously records RGB, depth, trajectory, tactile, internal grasp force, and external wrench data for training diffusion policies on contact-rich robot manipulation.
-
Behavior Cloning for Active Perception with Low-Resolution Egocentric Vision
Behavior cloning produces active perception in a plant-centering task where a robot arm uses low-resolution egocentric RGB images to predict joint movements, with relative deltas outperforming absolute positions.
-
Towards Robotic Dexterous Hand Intelligence: A Survey
A structured survey of dexterous robotic hand research that reviews hardware, control methods, data resources, and benchmarks while identifying major limitations and future directions.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
EgoLive: A Large-Scale Egocentric Dataset from Real-World Human Tasks
EgoLive is presented as the largest open-source annotated egocentric dataset for real-world task-oriented human routines, captured with a custom head-mounted device and multi-modal annotations exclusively in unconstra...
-
Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines
A survey of VLA robotics research identifies data infrastructure as the primary bottleneck and distills four open challenges in representation alignment, multimodal supervision, reasoning assessment, and scalable data...
Reference graph
Works this paper leans on
-
[1]
Human-to-robot imitation in the wild
Shikhar Bahl, Abhinav Gupta, and Deepak Pathak. Human-to-robot imitation in the wild. In Proceedings of Robotics: Science and Systems (RSS) , 2022
work page 2022
-
[2]
Affordances from human videos as a versatile representation for robotics
Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13778–13790, 2023
work page 2023
-
[3]
Rt-1: Robotics transformer for real-world control at scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. In Proceedings of Robotics: Science and Systems (RSS) , 2023
work page 2023
-
[4]
Humanoid robot teleoperation with vibrotactile based balancing feedback
Anais Brygo, Ioannis Sarakoglou, Nadia Garcia- Hernandez, and Nikolaos Tsagarakis. Humanoid robot teleoperation with vibrotactile based balancing feedback. In Haptics: Neuroscience, Devices, Modeling, and Ap- plications: 9th International Conference, EuroHaptics 2014, Versailles, France, June 24-26, 2014, Proceedings, Part II 9, pages 266–275. Springer, 2014
work page 2014
-
[5]
Berk Calli, Arjun Singh, Aaron Walsman, Siddhartha Srinivasa, Pieter Abbeel, and Aaron M. Dollar. The ycb object and model set: Towards common benchmarks for manipulation research. In 2015 International Conference on Advanced Robotics (ICAR) , pages 510–517, 2015. doi: 10.1109/ICAR.2015.7251504
-
[6]
Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam
Carlos Campos, Richard Elvira, Juan J G ´omez Rodr´ıguez, Jos ´e MM Montiel, and Juan D Tard ´os. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Transactions on Robotics, 37(6):1874–1890, 2021
work page 2021
-
[7]
Carlos Campos, Richard Elvira, Juan J. G ´omez Rodr´ıguez, Jos´e M. M. Montiel, and Juan D. Tard ´os. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Transactions on Robotics, 37(6):1874–1890, 2021. doi: 10.1109/TRO. 2021.3075644
work page doi:10.1109/tro 2021
-
[8]
Annie S Chen, Suraj Nair, and Chelsea Finn. Learn- ing generalizable robotic reward functions from “in-the- wild” human videos. In Proceedings of Robotics: Science and Systems (RSS) , 2021
work page 2021
-
[9]
Dif- fusion policy: Visuomotor policy learning via action diffusion
Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023
work page 2023
-
[10]
On hand-held grippers and the morphological gap in human manipulation demonstration
Kiran Doshi, Yijiang Huang, and Stelian Coros. On hand-held grippers and the morphological gap in human manipulation demonstration. arXiv preprint arXiv:2311.01832, 2023
-
[11]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021
work page 2021
-
[12]
Ar2-d2: Training a robot without a robot
Jiafei Duan, Yi Ru Wang, Mohit Shridhar, Dieter Fox, and Ranjay Krishna. Ar2-d2: Training a robot without a robot. 2023
work page 2023
-
[13]
Bridge data: Boosting generalization of robotic skills with cross- domain datasets
Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Dani- ilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross- domain datasets. In Proceedings of Robotics: Science and Systems (RSS) , 2022
work page 2022
-
[14]
Low-cost exoskeletons for learning whole-arm manipulation in the wild
Hongjie Fang, Hao-Shu Fang, Yiming Wang, Jieji Ren, Jingjing Chen, Ruo Zhang, Weiming Wang, and Cewu Lu. Low-cost exoskeletons for learning whole-arm ma- nipulation in the wild. arXiv preprint arXiv:2309.14975, 2023
-
[15]
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mo- bile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117, 2024
work page internal anchor Pith review arXiv 2024
-
[16]
S. Garrido-Jurado, R. Mu ˜noz-Salinas, F.J. Madrid- Cuevas, and M.J. Mar ´ın-Jim´enez. Automatic genera- tion and detection of highly reliable fiducial markers under occlusion. Pattern Recognition, 47(6):2280–2292,
-
[17]
doi: https://doi.org/10.1016/ j.patcog.2014.01.005
ISSN 0031-3203. doi: https://doi.org/10.1016/ j.patcog.2014.01.005. URL https://www.sciencedirect. com/science/article/pii/S0031320314000235
work page 2014
-
[18]
Deep Residual Learning for Image Recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. corr abs/1512.03385 (2015), 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[19]
Gpmf introuction: Parser for gpmf™ format- ted telemetry data used within gopro® cameras
GoPro Inc. Gpmf introuction: Parser for gpmf™ format- ted telemetry data used within gopro® cameras. https: //gopro.github.io/gpmf-parser/. Accesssed: 2023-01-31
work page 2023
-
[20]
Bc-z: Zero-shot task generalization with robotic imitation learning
Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning (CoRL), volume 164, pages 991–1002. PMLR, 2022
work page 2022
-
[21]
Giv- ing robots a hand: Broadening generalization via hand- centric human video demonstrations
Moo Jin Kim, Jiajun Wu, and Chelsea Finn. Giv- ing robots a hand: Broadening generalization via hand- centric human video demonstrations. In Deep Reinforce- ment Learning Workshop NeurIPS , 2022
work page 2022
-
[22]
VIP: Towards universal visual reward and representation via value-implicit pre-training
Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, Vikash Kumar, and Amy Zhang. VIP: Towards universal visual reward and representation via value-implicit pre-training. In The Eleventh International Conference on Learning Representations , 2023
work page 2023
-
[23]
Roboturk: A crowdsourcing platform for robotic skill learning through imitation
Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning (CoRL) , volume 87, pages 879–893. PMLR, 2018
work page 2018
-
[24]
R3m: A universal visual representation for robot manipulation
Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation. In Proceedings of The 6th Conference on Robot Learning (CoRL) , volume 205, pages 892–909. PMLR, 2022
work page 2022
-
[25]
Tax-pose: Task-specific cross-pose estimation for robot manipulation
Chuer Pan, Brian Okorn, Harry Zhang, Ben Eisner, and David Held. Tax-pose: Task-specific cross-pose estimation for robot manipulation. In Proceedings of The 6th Conference on Robot Learning (CoRL) , volume 205, pages 1783–1792. PMLR, 2023
work page 2023
-
[26]
The surprising ef- fectiveness of representation learning for visual imitation
Jyothish Pari, Nur Muhammad Shafiullah, Sridhar Pan- dian Arunachalam, and Lerrel Pinto. The surprising ef- fectiveness of representation learning for visual imitation. In Proceedings of Robotics: Science and Systems (RSS) , 2022
work page 2022
-
[27]
Learning of compliant human–robot interaction using full-body haptic interface
Luka Peternel and Jan Babi ˇc. Learning of compliant human–robot interaction using full-body haptic interface. Advanced Robotics, 27(13):1003–1012, 2013
work page 2013
-
[28]
Characterizing input methods for human-to-robot demonstrations
Pragathi Praveena, Guru Subramani, Bilge Mutlu, and Michael Gleicher. Characterizing input methods for human-to-robot demonstrations. In 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 344–353. IEEE, 2019
work page 2019
-
[29]
Dexmv: Im- itation learning for dexterous manipulation from human videos
Yuzhe Qin, Yueh-Hua Wu, Shaowei Liu, Hanwen Jiang, Ruihan Yang, Yang Fu, and Xiaolong Wang. Dexmv: Im- itation learning for dexterous manipulation from human videos. In European Conference on Computer Vision , pages 570–587. Springer, 2022
work page 2022
-
[30]
Learning transferable visual models from natural lan- guage supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. In International conference on ma- chine learning, pages 8748–8763. PMLR, 2021
work page 2021
-
[31]
Recent advances in robot learning from demonstration
Harish Ravichandar, Athanasios S Polydoros, Sonia Chernova, and Aude Billard. Recent advances in robot learning from demonstration. Annual review of control, robotics, and autonomous systems , 3:297–330, 2020
work page 2020
-
[32]
Latent plans for task- agnostic offline reinforcement learning
Erick Rosete-Beas, Oier Mees, Gabriel Kalweit, Joschka Boedecker, and Wolfram Burgard. Latent plans for task- agnostic offline reinforcement learning. In Proceedings of The 6th Conference on Robot Learning (CoRL) , vol- ume 205, pages 1838–1849. PMLR, 2023
work page 2023
-
[33]
Felipe Sanches, Geng Gao, Nathan Elangovan, Ricardo V Godoy, Jayden Chapman, Ke Wang, Patrick Jarvis, and Minas Liarokapis. Scalable. intuitive human to robot skill transfer with wearable human machine interfaces: On complex, dexterous tasks. In 2023 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (IROS), pages 6318–6325. IEEE, 2023
work page 2023
-
[34]
Learning predictive models from observation and interaction
Karl Schmeckpeper, Annie Xie, Oleh Rybkin, Stephen Tian, Kostas Daniilidis, Sergey Levine, and Chelsea Finn. Learning predictive models from observation and interaction. In European Conference on Computer Vision, pages 708–725. Springer, 2020
work page 2020
-
[35]
Reinforcement learn- ing with videos: Combining offline observations with interaction
Karl Schmeckpeper, Oleh Rybkin, Kostas Daniilidis, Sergey Levine, and Chelsea Finn. Reinforcement learn- ing with videos: Combining offline observations with interaction. In Proceedings of the 2020 Conference on Robot Learning (CoRL) , volume 155, pages 339–354. PMLR, 2021
work page 2020
-
[36]
Deep imitation learning for humanoid loco-manipulation through human teleoperation
Mingyo Seo, Steve Han, Kyutae Sim, Seung Hyeon Bang, Carlos Gonzalez, Luis Sentis, and Yuke Zhu. Deep imitation learning for humanoid loco-manipulation through human teleoperation. In 2023 IEEE-RAS 22nd International Conference on Humanoid Robots (Hu- manoids), pages 1–8. IEEE, 2023
work page 2023
-
[37]
Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Lerrel Pinto. On bringing robots home. arXiv preprint arXiv:2311.16098, 2023
-
[38]
Concept2robot: Learning manipu- lation concepts from instructions and human demonstra- tions
Lin Shao, Toki Migimatsu, Qiang Zhang, Karen Yang, and Jeannette Bohg. Concept2robot: Learning manipu- lation concepts from instructions and human demonstra- tions. The International Journal of Robotics Research , 40(12-14):1419–1434, 2021
work page 2021
-
[39]
Videodex: Learning dexterity from internet videos
Kenneth Shaw, Shikhar Bahl, and Deepak Pathak. Videodex: Learning dexterity from internet videos. In Proceedings of The 6th Conference on Robot Learning (CoRL), volume 205, pages 654–665. PMLR, 2023
work page 2023
-
[40]
Distilled feature fields enable few-shot language-guided manipulation
William Shen, Ge Yang, Alan Yu, Jansen Wong, Leslie Pack Kaelbling, and Phillip Isola. Distilled feature fields enable few-shot language-guided manipulation. In Proceedings of The 7th Conference on Robot Learning (CoRL), volume 229, pages 405–424. PMLR, 2023
work page 2023
-
[41]
Neural descriptor fields: Se (3)-equivariant object representations for manipula- tion
Anthony Simeonov, Yilun Du, Andrea Tagliasac- chi, Joshua B Tenenbaum, Alberto Rodriguez, Pulkit Agrawal, and Vincent Sitzmann. Neural descriptor fields: Se (3)-equivariant object representations for manipula- tion. In 2022 International Conference on Robotics and Automation (ICRA), pages 6394–6400. IEEE, 2022
work page 2022
-
[42]
Grasping in the wild: Learning 6dof closed- loop grasping from low-cost demonstrations
Shuran Song, Andy Zeng, Johnny Lee, and Thomas Funkhouser. Grasping in the wild: Learning 6dof closed- loop grasping from low-cost demonstrations. Robotics and Automation Letters , 2020
work page 2020
-
[43]
Trajectory Optimization and Following for a Three Degrees of Freedom Overactuated Floating Platform
H.J. Terry Suh, Naveen Kuppuswamy, Tao Pang, Paul Mitiguy, Alex Alspach, and Russ Tedrake. SEED: Series elastic end effectors in 6d for visuotactile tool use. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4684–4691, 2022. doi: 10.1109/IROS47612.2022.9982092
-
[44]
A force- sensitive exoskeleton for teleoperation: An application in elderly care robotics
Alexander Toedtheide, Xiao Chen, Hamid Sadeghian, Abdeldjallil Naceri, and Sami Haddadin. A force- sensitive exoskeleton for teleoperation: An application in elderly care robotics. In 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages 12624–12630. IEEE, 2023
work page 2023
-
[45]
Mimicplay: Long-horizon imitation learning by watching human play
Chen Wang, Linxi Fan, Jiankai Sun, Ruohan Zhang, Li Fei-Fei, Danfei Xu, Yuke Zhu, and Anima Anand- kumar. Mimicplay: Long-horizon imitation learning by watching human play. In Proceedings of The 7th Conference on Robot Learning (CoRL) , volume 229, pages 201–221. PMLR, 2023
work page 2023
-
[46]
Error-aware imitation learning from teleoperation data for mobile manipulation
Josiah Wong, Albert Tung, Andrey Kurenkov, Ajay Man- dlekar, Li Fei-Fei, Silvio Savarese, and Roberto Mart ´ın- Mart´ın. Error-aware imitation learning from teleoperation data for mobile manipulation. In Proceedings of the 5th Conference on Robot Learning (CoRL) , volume 164, pages 1367–1378. PMLR, 2022
work page 2022
-
[47]
GELLO: A general, low-cost, and intuitive teleoperation framework for robot manipulators
Philipp Wu, Fred Shentu, Xingyu Lin, and Pieter Abbeel. GELLO: A general, low-cost, and intuitive teleoperation framework for robot manipulators. In Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisi- tion @ CoRL2023 , 2023
work page 2023
-
[48]
Keenan A Wyrobek, Eric H Berger, HF Machiel Van der Loos, and J Kenneth Salisbury. Towards a personal robotics development platform: Rationale and design of an intrinsically safe personal robot. In 2008 IEEE International Conference on Robotics and Automation , pages 2165–2170. IEEE, 2008
work page 2008
-
[49]
Masked visual pre-training for motor control,
Tete Xiao, Ilija Radosavovic, Trevor Darrell, and Jitendra Malik. Masked visual pre-training for motor control. arXiv:2203.06173, 2022
-
[50]
Learn- ing by watching: Physical imitation of manipulation skills from human videos
Haoyu Xiong, Quanzhou Li, Yun-Chun Chen, Homanga Bharadhwaj, Samarth Sinha, and Animesh Garg. Learn- ing by watching: Physical imitation of manipulation skills from human videos. In 2021 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (IROS), pages 7827–7834. IEEE, 2021
work page 2021
-
[51]
Sarah Young, Dhiraj Gandhi, Shubham Tulsiani, Abhinav Gupta, Pieter Abbeel, and Lerrel Pinto. Visual imitation made easy. In Conference on Robot Learning (CoRL) , volume 155, pages 1992–2005. PMLR, 2021
work page 1992
-
[52]
Deep imitation learning for complex manipulation tasks from virtual reality teleoperation
Tianhao Zhang, Zoe McCarthy, Owen Jow, Dennis Lee, Xi Chen, Ken Goldberg, and Pieter Abbeel. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In 2018 IEEE International Conference on Robotics and Automation (ICRA) , pages 5628–5635. IEEE, 2018
work page 2018
-
[53]
Benefit of large field-of-view cam- eras for visual odometry
Zichao Zhang, Henri Rebecq, Christian Forster, and Davide Scaramuzza. Benefit of large field-of-view cam- eras for visual odometry. In 2016 IEEE International Conference on Robotics and Automation (ICRA) , pages 801–808, 2016. doi: 10.1109/ICRA.2016.7487210
-
[54]
Learning fine-grained bimanual manipulation with low-cost hardware
Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. In Proceedings of Robotics: Science and Systems (RSS) , 2023
work page 2023
-
[55]
Viola: Imitation learning for vision-based manipulation with object proposal priors
Yifeng Zhu, Abhishek Joshi, Peter Stone, and Yuke Zhu. Viola: Imitation learning for vision-based manipulation with object proposal priors. In Proceedings of The 6th Conference on Robot Learning (CoRL) , volume 205, pages 1199–1210. PMLR, 2023. APPENDIX Please check out our website (https://umi-gripper.github.io) for additional results and comparisons. In...
work page 2023
-
[56]
Camera Latency Measurement : For policy observation across both the UR5 and Franka FR2 platforms, we employ each robot arm with a single wrist-mounted GoPro Hero 9 camera. To obtain real-time video streams from the GoPro, we use a combination of GoPro Media Mod 1.0 (to convert usb-c to HDMI) and Elgato HD60X external capture card (to convert HDMI to USB-3...
-
[57]
Proprioception Latency Measurement : When the robotic hardware directly reports global timestamps, such is the case for Franka FR2 robot, we measure the proprioception latency by subtracting the robot sending timestamp trobot from the policy-received timestamp trecv: lobs = trecv −trobot When the robotic hardware timestamp is unavailable, such as the UR5 ...
-
[58]
Gripper Execution Latency Measurement : To obtain the gripper execution latency laction, we subtract the end-to- end latency le2e by the proprioception latency lobs. To measure le2e, we send a sequence of sinusoidal position commands to the gripper, and then record a sequence of gripper width preconceptions. The le2e can be obtained by computing the optim...
-
[59]
Robot Execution Latency Measurement : Similar to the gripper, we also measure the execution latency of the robot (ether UR5 or Franka) by calculating le2e, as the optimal alignment between a sequence of desired end-effector poses and the measured actual end-effector poses. Due to safety concerns, we directly teleoperate the robot to generate the desired e...
-
[60]
Initial State Selection : For all tasks, we manually select a set of initial states with diverse pose coverage across task scenes (for both the robot and the environment) that are shared across all evaluated methods. During evaluation, we manually match the initial states with a third-person camera to be close to pixel-perfect. We ensure the initial state...
-
[61]
An evaluation episode can be terminated due to: • Safety Concern
Termination Criteria : During evaluation, an operator supervises the robot at all times. An evaluation episode can be terminated due to: • Safety Concern. When the operator deems the robot is about to perform dangerous actions that could potentially break the setup/robot or do any other harm, the episode will be terminated immediately. • Robot Fault. When...
-
[62]
Success Criteria : It is difficult to define automatic and compact success metrics for complex manipulation tasks reported in this paper. Therefore, the operator manually judges the success or failure of each episode using the rubric de- scribed below. While we try to create a concise and objective rubric, it inevitability contains subjective elements. As...
-
[63]
We found this feature to significantly increase mapping robustness in-the-wild
with known sizes to disambiguate possible explanations of feature matches. We found this feature to significantly increase mapping robustness in-the-wild. Note that demonstra- tion videos will not contain these fiducial markers, they are only used for mapping. E. Policy Implementation Details We use Diffusion Policy [9] for all tasks. Detailed hyper- para...
-
[64]
Notably, the dataset collected for each task lacks the scale required for training ViT from scratch
Vision encoder : We utilize the Vision Transformer (ViT) [11] as the vision encoder due to its substantial ca- pacity in comparison to ResNet [17], which proves crucial for tasks demanding intricate perceptual capabilities. Notably, the dataset collected for each task lacks the scale required for training ViT from scratch. To address this limitation, we e...
-
[65]
Frequency: For most quasi-static tasks, a frequency of 10Hz proves sufficient for both observation and action. However, a frequency of 20Hz is employed for the dynamic tossing task, which requires highly reactive behaviors
-
[66]
However, during execution, we are not bound to follow the same dt
Speed: The output of Diffusion Policy is a sequence of actions, specifically the target pose, with an implicit dtout put between two steps determined by the demonstration dataset. However, during execution, we are not bound to follow the same dt. By adjusting the dtexecution, we can achieve different execution speeds compared to the human demonstration. I...
-
[67]
Image Augmentation : We employ a set of image aug- mentations to enhance the diversity of our training data, thereby improving the robustness and generalization capa- bilities of our policy. The augmentation pipeline includes a RandomCrop operation with a ratio of 0.95, a RandomRotation operation with degrees ranging from -5.0 to 5.0, and a Color- Jitter ...
-
[68]
Soft Compliant Fingers: We used the same soft fingers on both UMI data collection grippers as well as deployed robotic grippers. Printed with 95A TPU material, the rib- like pattern on the finger maintains rigidity on the fingertip while conforming to the object geometry for a more secure grasp (Fig. A3). When deployed to robots that lack force- torque co...
-
[69]
Franka Mount: Due to FR2’s limited end-effector pitch (FR2 is designed for top-down pick and place, while the UMI gripper is mostly held horizontally), we had to design and 3D print a custom mounting adapter that rotates WSG50 gripper 90-degree rotation with respect to the robot’s end-effector flange
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.