pith. sign in

arxiv: 2507.00990 · v3 · pith:RQZ3TJEJnew · submitted 2025-07-01 · 💻 cs.RO · cs.AI· cs.CV

Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations

Pith reviewed 2026-05-19 06:32 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV
keywords robotic manipulationimitation learningvideo generationdiffusion models6D pose trackingvision-language modelstrajectory retargeting
0
0 comments X

The pith

Robots achieve manipulation performance matching real demonstrations by imitating filtered AI-generated videos

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a system that lets robots carry out tasks such as pouring, wiping, and mixing by following videos created by a diffusion model instead of recordings of human actions. A language command and initial scene image prompt the model to produce candidate videos; a vision-language model discards those that fail to match the command. A 6D pose tracker then recovers object motion paths from the retained videos, and these paths are mapped to the robot through an embodiment-agnostic retargeting step. Real-robot experiments establish that the filtered generated videos produce success rates comparable to real demonstrations and that success rises when the generated videos are of higher quality. The same pipeline also surpasses alternatives that rely on vision-language models for direct keypoint prediction or on dense feature tracking for trajectory recovery.

Core claim

By generating demonstration videos with an off-the-shelf diffusion model, automatically filtering them with a vision-language model, extracting 6D object trajectories, and retargeting those trajectories to the robot, the method produces real-world manipulation performance that equals the performance obtained from genuine human demonstrations, with effectiveness increasing as video generation quality improves.

What carries the argument

The RIGVid pipeline that turns a language command and scene image into filtered generated videos, extracts object trajectories via 6D pose tracking, and retargets the trajectories to the robot in an embodiment-agnostic manner.

If this is right

  • Filtered generated videos achieve performance equivalent to real demonstrations across real-world evaluations of pouring, wiping, and mixing.
  • Robot success rates increase as the quality of the generated videos improves.
  • Generated videos outperform more compact alternatives such as keypoint prediction using vision-language models.
  • Strong 6D pose tracking yields better trajectory extraction than dense feature point tracking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could reduce the expense of collecting robot training data by substituting synthetic videos for physical recordings.
  • Continued improvement in video generation models would likely expand the range of tasks that can be taught this way without any additional real-world data.
  • Because retargeting does not depend on robot embodiment, the same generated videos might be reused across different robot hardware platforms.

Load-bearing premise

The 6D pose tracker extracts reliable object trajectories from the generated videos and these trajectories can be retargeted to the robot without embodiment-specific failures or safety issues.

What would settle it

A side-by-side real-robot experiment in which success rates with filtered generated videos fall substantially below those with real demonstrations, or in which 6D tracking errors produce visibly incorrect trajectories that cause task failure.

Figures

Figures reproduced from arXiv: 2507.00990 by Hanlin Mai, Shivansh Patel, Shraddhaa Mohan, Svetlana Lazebnik, Unnat Jain, Yunzhu Li.

Figure 1
Figure 1. Figure 1: RIGVid overview. Given an initial scene image and depth, we generate a video conditioned on a language command. A VLM￾based automatic filtering step (not shown) can be used to reject videos that fail to follow the prompt. A monocular depth estimator recovers depth for each frame of the generated video, and these depth maps are combined with the corresponding RGB frames to produce 6D Object Pose Trajectory.… view at source ↗
Figure 2
Figure 2. Figure 2: Re-targeting RIGVid to a robot trajectory. Assum￾ing a fixed transformation between the end-effector and the object after grasping, the 6D Object Pose Trajectory (orange arrow) is re-targeted to the robot (blue arrow). This formulation is embodi￾ment agnostic and can be transferred to a different robot. to the gripper at the moment it is grasped and (2) the off￾set between the gripper and the robot’s end-e… view at source ↗
Figure 3
Figure 3. Figure 3: RIGVid is robust to perturbations. A human pushes the robot during execution (image 1), causing the object to deviate from the planned trajectory. When the deviation is detected (image 2), the robot backtracks to the last successfully executed trajectory point (image 3) and then resumes the planned motion (image 4). 3.4. Closed Loop Execution A core strength of our approach is its ability to operate in a c… view at source ↗
Figure 4
Figure 4. Figure 4: Evaluation tasks. We evaluate RIGVid on everyday manipulation tasks of varying difficulty. During deployment, the system continuously tracks the ob￾ject’s 6D pose in real time using FoundationPose to update the robot’s end-effector trajectory as the task progresses. This feedback allows the robot to dynamically adjust its mo￾tions: if the object deviates from the planned trajectory due to external perturba… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of video generation for three models. Sora (top) drastically alters the scene layout and object size. Kling v1.5 (middle) does not fully follow the prompt (water not poured over the plant) and exhibits physically implausible behaviors (water pouring out of the top of the kettle but not the spout). Kling v1.6 (bottom) produces the most consistent and realistic result. to physically im… view at source ↗
Figure 6
Figure 6. Figure 6: Filtering statistics. Kling V1.6 videos have the high￾est pass rate, demonstrating more accurate adherence to language commands. What are the filtering statistics for different video gen￾eration models? Confirming the trends described above, [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: RIGVid performance vs. video quality. The dashed lines separate performance on generated videos from real videos. Kling V1.6 produces most reliable videos and leads to highest RIGVid success. Filtered videos perform on par with real ones. UF denotes unfiltered and F denotes filtered. ing suggests that, at current model quality, generated videos are already sufficient for visual imitation, substantially re￾… view at source ↗
Figure 8
Figure 8. Figure 8: RIGVid vs. ReKep Success Rates. RIGVid outper￾forms SOTA VLM-based trajectory prediction method ReKep. ence time. We take the state-of-the-art ReKep [49] method as a representative of this line of work, and compare against it in [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparative evaluation of trajectory extraction methods. RIGVid consistently achieves higher success rates across all four tasks; relative improvements are higher as tasks become harder (i.e., from left to right). no other way to get the goal image, we set it to the last frame of the generated video. Using only this pair of im￾ages, Track2Act uses a learned model to predict a dense grid of 2D point tracks,… view at source ↗
Figure 10
Figure 10. Figure 10: Analyzing intermediate visual representations. Only Gen2Act and our 6D Object Pose Trajectory can correctly track the position and rotation of the watering can, leading to a successful execution. Check the description in the main paper for detailed discussions of the failure modes of the alternative methods [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: RIGVid’s embodiment-agnostic capabilities and examples on solving complex, open-world tasks. RIGVid can readily work on ALOHA setup [134] as shown on top left. On the bottom left, RIGVid is retargeted to the bimanual ALOHA setup. On the right, it generates trajectories for diverse manipulation tasks—including wiping, mixing, and ironing—without using any physical demonstrations. sulting in faulty object l… view at source ↗
Figure 12
Figure 12. Figure 12: Examples of prompting GPT o1 to filter generated videos. We sample frames from the generated video and prompt GPT o1 to assess whether the specified task is performed success￾fully in the video. The top example passes the filtering, while the bottom does not. from RGBD observations. For the pouring task, we evalu￾ate our method using trajectories obtained via BundleSDF over 10 trials and observe a success… view at source ↗
Figure 13
Figure 13. Figure 13: ReKep’s output for the pouring task and the result￾ing robot execution (top-right). The VLM predicts to grasp at keypoint 1, move keypoint 8 above 15 and 7 during transport, and above 15 and 4 for pouring—leading to failed execution [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Examples of ReKep’s Keypoint Locations. The key￾point placements are often suboptimal, except for sweeping task, where the keypoints are reasonable. Gen2Act with BootsTAP Gen2Act with Cotracker3 RIGVid [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Gen2Act with BootsTAP, CoTracker, and RIGVid. Blue points denote the tracked points used for PnP; red points rep￾resent the reprojected 3D points. For a good PnP solution, these should align, as seen in the first frame. For Gen2Act, the blue points drift significantly from the red ones in later frames, indicat￾ing failure in pose estimation due to tracking loss, which leads to failed robot execution. duri… view at source ↗
Figure 16
Figure 16. Figure 16: Additional examples of RIGVid’s robustness. In the top row, RIGVid recovers from a faulty initial grasp by reorienting the object before continuing execution. In the bottom row, it cor￾rects for external disturbances on the object when a human pushes it mid-execution, realigning and successfully completing the task. I. Errors from Depth Estimation 0 20 40 60 80 100 Generated Videos with Predicted Depth Re… view at source ↗
Figure 18
Figure 18. Figure 18: Errors in Monocular Depth Estimation. In the gen￾erated video (top), the depth of the spatula changes only slightly despite a large visual change. In the real video (bottom), the spat￾ula’s head is predicted to lie farther away, contradicting the visual appearance. changes from 40.1 cm to 38.2 cm–a 1.9 cm difference over just 0.066 seconds–which is physically implausible for the generated video. We find s… view at source ↗
Figure 19
Figure 19. Figure 19: Flickering in Depth Prediction. We show three consecutive frames of the video and its corresponding predicted depth. The depth of the watering can change noticeably across frames—appearing significantly whiter in the third frame despite minimal actual motion. We observe this behavior in both gener￾ated and real videos. aged over ten pouring trajectories from generated videos. MegaPose yields an average tr… view at source ↗
Figure 20
Figure 20. Figure 20: Qualitative Comparison of Different Video Generative Models. Videos from the three video generation models are shown using evenly sampled frames, along with VBench++ [50] metrics: video-text consistency, image-to-video subject consistency, and subject consistency. Kling v1.6 scores highest on these metrics, followed by Kling v1.5 and then Sora [PITH_FULL_IMAGE:figures/full_fig_p022_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Qualitative comparison of video generation. Sora-generated videos often alter the scene layout and objects. Kling V1.5 produces more plausible results but includes physically implausible elements. Kling V1.6 better preserves scene fidelity and closely follows the human command [PITH_FULL_IMAGE:figures/full_fig_p023_21.png] view at source ↗
read the original abstract

This work introduces Robots Imitating Generated Videos (RIGVid), a system that enables robots to perform complex manipulation tasks--such as pouring, wiping, and mixing--purely by imitating AI-generated videos, without requiring any physical demonstrations or robot-specific training. Given a language command and an initial scene image, a video diffusion model generates potential demonstration videos, and a vision-language model (VLM) automatically filters out results that do not follow the command. A 6D pose tracker then extracts object trajectories from the video, and the trajectories are retargeted to the robot in an embodiment-agnostic fashion. Through extensive real-world evaluations, we show that filtered generated videos are as effective as real demonstrations, and that performance improves with generation quality. We also show that relying on generated videos outperforms more compact alternatives such as keypoint prediction using VLMs, and that strong 6D pose tracking outperforms other ways to extract trajectories, such as dense feature point tracking. These findings suggest that videos produced by a state-of-the-art off-the-shelf model can offer an effective source of supervision for robotic manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces RIGVid, a system for robotic manipulation that generates candidate demonstration videos using an off-the-shelf video diffusion model conditioned on a language command and initial scene image, filters them via a VLM to retain only those matching the command, extracts 6D object trajectories with a pose tracker, and retargets those trajectories to the robot in an embodiment-agnostic manner. Through real-world experiments on tasks such as pouring, wiping, and mixing, it claims that filtered generated videos achieve performance comparable to real physical demonstrations, that results improve with higher generation quality, and that the approach outperforms alternatives such as VLM keypoint prediction or dense feature tracking.

Significance. If the empirical claims hold under detailed scrutiny, the work would be significant for demonstrating that synthetic video generation can serve as a viable, scalable substitute for physical demonstration collection in imitation learning. This could lower the cost and hardware requirements for training complex manipulation policies and highlight the utility of combining generative models with classical tracking and retargeting pipelines.

major comments (2)
  1. [§4] §4 (Real-world Evaluations): The central claim that filtered generated videos are 'as effective as real demonstrations' is load-bearing for the contribution, yet the manuscript provides no quantitative success rates, trial counts, error bars, or statistical tests comparing the two conditions. Without these data, the equivalence result and the statement that 'performance improves with generation quality' cannot be assessed for robustness or effect size.
  2. [§3.2] §3.2 (6D Pose Tracking): The method assumes that a standard 6D pose tracker can extract accurate, low-jitter trajectories from diffusion-generated videos despite potential frame-to-frame geometric or lighting inconsistencies. No ablation or metric (e.g., mean pose error or tracking success rate) is reported comparing tracker output on generated versus real videos; if tracking noise is materially higher on generated footage, downstream retargeting would introduce errors absent from the real-demonstration baseline, undermining the equivalence result.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'embodiment-agnostic fashion' for retargeting is used without a short description of the mapping procedure; adding one sentence would improve accessibility.
  2. [Figure 3] Figure 3 or equivalent experimental figure: Captions should explicitly state the number of trials per condition and whether success is defined by task completion within a time limit or by a distance threshold.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback, which identifies key areas where additional quantitative details can strengthen the presentation of our empirical results. We address each major comment below and commit to revisions that improve clarity and rigor without altering the core claims.

read point-by-point responses
  1. Referee: [§4] §4 (Real-world Evaluations): The central claim that filtered generated videos are 'as effective as real demonstrations' is load-bearing for the contribution, yet the manuscript provides no quantitative success rates, trial counts, error bars, or statistical tests comparing the two conditions. Without these data, the equivalence result and the statement that 'performance improves with generation quality' cannot be assessed for robustness or effect size.

    Authors: We agree that explicit quantitative comparisons are essential for substantiating the central claim. While the manuscript describes real-world experiments on pouring, wiping, and mixing and states that filtered generated videos achieve comparable performance, we acknowledge that success rates, trial counts, error bars, and statistical tests are not presented in tabular or statistical form in the main text. In the revised version, we will add a new table in §4 reporting per-task success rates (as percentages) for both generated-video and real-demonstration conditions, along with the number of trials conducted (typically 10 per condition), standard deviations, and results of statistical tests (e.g., two-sample t-tests or Wilcoxon rank-sum tests) to evaluate equivalence. We will also include results from videos generated at different quality levels to support the claim that performance improves with generation quality. These additions will allow readers to assess effect sizes and robustness directly. revision: yes

  2. Referee: [§3.2] §3.2 (6D Pose Tracking): The method assumes that a standard 6D pose tracker can extract accurate, low-jitter trajectories from diffusion-generated videos despite potential frame-to-frame geometric or lighting inconsistencies. No ablation or metric (e.g., mean pose error or tracking success rate) is reported comparing tracker output on generated versus real videos; if tracking noise is materially higher on generated footage, downstream retargeting would introduce errors absent from the real-demonstration baseline, undermining the equivalence result.

    Authors: We appreciate the referee’s point that direct validation of the pose tracker on generated videos is necessary to rule out confounding tracking errors. The current manuscript relies on an off-the-shelf 6D pose tracker and reports strong end-to-end task performance, but does not provide separate tracking-quality metrics. In revision we will add an ablation subsection (or appendix) that reports mean pose error, frame-to-frame jitter, and tracking success rate on a held-out set of both generated and real videos for the evaluated tasks. If materially higher noise is observed on generated footage, we will discuss any smoothing or filtering steps applied during retargeting and quantify its effect on final policy performance. This will clarify whether the equivalence result holds independently of tracking differences. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with external validation

full rationale

The paper presents an empirical system (RIGVid) that uses off-the-shelf video diffusion models to generate candidate demonstrations from language and scene input, applies VLM filtering, extracts trajectories via 6D pose tracking, and retargets them for robot execution. Real-world evaluations directly compare success rates against real physical demonstrations and alternative trajectory extraction methods (e.g., keypoint prediction, dense feature tracking). No mathematical derivations, equations, or fitted parameters are described that reduce any claimed result to inputs defined by the same data. The central claim of equivalence to real demos rests on external, falsifiable robot trials rather than self-referential definitions or self-citation chains. Self-citations, if present for the diffusion or tracking modules, are not load-bearing for the equivalence result. This is a standard empirical robotics paper whose performance claims are independently testable outside the paper's own fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that current video diffusion models produce videos whose object motions are physically plausible enough for a 6D tracker to extract usable trajectories, plus the assumption that VLM filtering reliably removes invalid videos.

axioms (2)
  • domain assumption Generated videos contain extractable 6D object trajectories that correspond to feasible real-world actions.
    Invoked when the 6D pose tracker is applied to synthetic video output.
  • domain assumption VLM-based filtering selects videos that are both command-compliant and robot-executable.
    Central to the claim that filtered videos match real demonstrations.

pith-pipeline@v0.9.0 · 5742 in / 1276 out tokens · 22559 ms · 2026-05-19T06:32:16.009241+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PlayWorld: Learning Robot World Models from Autonomous Play

    cs.RO 2026-03 unverdicted novelty 7.0

    PlayWorld learns high-fidelity robot world models from unsupervised self-play, producing physically consistent video predictions that outperform models trained on human data and enabling 65% better real-world policy p...

  2. Imagine2Real: Towards Zero-shot Humanoid-Object Interaction via Video Generative Priors

    cs.RO 2026-05 unverdicted novelty 6.0

    Imagine2Real enables zero-shot humanoid-object interaction by unifying motions as 4D point trajectories, tracking only base/hands/object keypoints inside a BFM latent space, and training with progressive simple reward...

  3. AnchorD: Metric Grounding of Monocular Depth Using Factor Graphs

    cs.RO 2026-05 unverdicted novelty 6.0

    AnchorD anchors monocular depth priors in metric sensor data via patch-wise affine alignment using factor graph optimization, improving accuracy on non-Lambertian objects and introducing a new benchmark dataset with d...

  4. Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

    eess.IV 2026-03 unverdicted novelty 6.0

    Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.

  5. Imagine2Real: Towards Zero-shot Humanoid-Object Interaction via Video Generative Priors

    cs.RO 2026-05 unverdicted novelty 5.0

    Imagine2Real is a zero-shot humanoid-object interaction method that unifies robot and object motion as 4D point trajectories, tracks only sparse keypoints inside a behavior foundation model latent space, and trains wi...

  6. From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data

    cs.RO 2026-04 accept novelty 5.0

    A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.

  7. Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

    cs.RO 2025-08 unverdicted novelty 5.0

    This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.

  8. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

140 extracted references · 140 canonical work pages · cited by 7 Pith papers · 16 internal anchors

  1. [1]

    https://www.klingai.com/ , 2024

    Kling ai. https://www.klingai.com/ , 2024. Ac- cessed: 2024-02-10. 1

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 3

  3. [3]

    Composi- tional foundation models for hierarchical planning

    Anurag Ajay, Seungwook Han, Yilun Du, Shuang Li, Abhi Gupta, Tommi Jaakkola, Josh Tenenbaum, Leslie Kael- bling, Akash Srivastava, and Pulkit Agrawal. Composi- tional foundation models for hierarchical planning. Ad- vances in Neural Information Processing Systems , 36: 22304–22325, 2023. 3

  4. [4]

    Nil: No-data imitation learning by leveraging pre-trained video diffusion models

    Mert Albaba, Chenhao Li, Markos Diomataris, Omid Taheri, Andreas Krause, and Michael Black. Nil: No-data imitation learning by leveraging pre-trained video diffusion models. arXiv preprint arXiv:2503.10626, 2025. 3

  5. [5]

    Flowcontrol: Optical flow based visual servoing

    Max Argus, Lukas Hermann, Jon Long, and Thomas Brox. Flowcontrol: Optical flow based visual servoing. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7534–7541. IEEE, 2020. 2

  6. [6]

    Zs6d: Zero-shot 6d object pose estimation using vision transform- ers

    Philipp Ausserlechner, David Haberger, Stefan Thalham- mer, Jean-Baptiste Weibel, and Markus Vincze. Zs6d: Zero-shot 6d object pose estimation using vision transform- ers. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 463–469. IEEE, 2024. 3

  7. [7]

    Screwmimic: Bimanual imitation from human videos with screw space projection

    Arpit Bahety, Priyanka Mandikal, Ben Abbatematteo, and Roberto Mart´ın-Mart´ın. Screwmimic: Bimanual imitation from human videos with screw space projection. arXiv preprint arXiv:2405.03666, 2024. 1

  8. [8]

    Human-to-robot imitation in the wild,

    Shikhar Bahl, Abhinav Gupta, and Deepak Pathak. Human-to-robot imitation in the wild. arXiv preprint arXiv:2207.09450, 2022. 1, 2

  9. [9]

    Affordances from human videos as a versatile representation for robotics

    Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13778–13790, 2023. 1, 3

  10. [10]

    Video pretraining (vpt): Learning to act by watching unlabeled online videos

    Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampe- dro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems, 35:24639–24654, 2022. 3

  11. [11]

    VideoPhy: Evaluating Physical Commonsense for Video Generation

    Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Eval- uating physical commonsense for video generation. arXiv preprint arXiv:2406.03520, 2024. 1

  12. [12]

    Dream to manipulate: Compositional world models em- powering robot imitation learning with imagination

    Leonardo Barcellona, Andrii Zadaianchuk, Davide Allegro, Samuele Papa, Stefano Ghidoni, and Efstratios Gavves. Dream to manipulate: Compositional world models em- powering robot imitation learning with imagination. arXiv preprint arXiv:2412.14957, 2024. 3

  13. [13]

    Zero-shot robot manipulation from pas- sive human videos.arXiv preprint arXiv:2302.02011, 2023

    Homanga Bharadhwaj, Abhinav Gupta, Shubham Tulsiani, and Vikash Kumar. Zero-shot robot manipulation from pas- sive human videos.arXiv preprint arXiv:2302.02011, 2023. 1

  14. [14]

    Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

    Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios en- ables generalizable robot manipulation. arXiv preprint arXiv:2409.16283, 2024. 1, 2, 3, 8, 18

  15. [15]

    Track2act: Predicting point tracks from internet videos enables diverse zero-shot robot manip- ulation, 2024

    Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, and Shubham Tulsiani. Track2act: Predicting point tracks from internet videos enables diverse zero-shot robot manip- ulation, 2024. 1, 2, 7, 18

  16. [16]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators

  17. [17]

    Reconstruct locally, localize glob- ally: A model free method for object pose estimation

    Ming Cai and Ian Reid. Reconstruct locally, localize glob- ally: A model free method for object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 3153–3163, 2020. 3

  18. [18]

    A tutorial on task-parameterized move- ment learning and retrieval

    Sylvain Calinon. A tutorial on task-parameterized move- ment learning and retrieval. Intelligent service robotics, 9: 1–29, 2016. 3

  19. [19]

    Freeze: Training-free zero-shot 6d pose estimation with geometric and vision foundation models

    Andrea Caraffa, Davide Boscaini, Amir Hamza, and Fabio Poiesi. Freeze: Training-free zero-shot 6d pose estimation with geometric and vision foundation models. European Conference on Computer Vision (ECCV), 2024. 3

  20. [20]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the International Conference on Com- puter Vision (ICCV), 2021. 6, 21

  21. [21]

    Learning video-conditioned policies for unseen manipu- lation tasks

    Elliot Chane-Sane, Cordelia Schmid, and Ivan Laptev. Learning video-conditioned policies for unseen manipu- lation tasks. In 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages 909–916. IEEE,

  22. [22]

    Se- mantic visual navigation by watching youtube videos

    Matthew Chang, Arjun Gupta, and Saurabh Gupta. Se- mantic visual navigation by watching youtube videos. In NeurIPS, 2020. 1, 3

  23. [23]

    Goat: Go to any thing,

    Matthew Chang, Theophile Gervet, Mukul Khanna, Sriram Yenamandra, Dhruv Shah, So Yeon Min, Kavit Shah, Chris Paxton, Saurabh Gupta, Dhruv Batra, Roozbeh Mottaghi, Jitendra Malik, and Devendra Singh Chaplot. Goat: Go to any thing. arXiv preprint arXiv:2311.06430, 2023. 4

  24. [24]

    Cheng, Y

    Xuxin Cheng, Yandong Ji, Junming Chen, Ruihan Yang, Ge Yang, and Xiaolong Wang. Expressive whole-body con- trol for humanoid robots. arXiv preprint arXiv:2402.16796,

  25. [25]

    Nonparametric motion retargeting for humanoid robots on shared latent space

    Sungjoon Choi, Matthew KXJ Pan, and Joohyung Kim. Nonparametric motion retargeting for humanoid robots on shared latent space. In Robotics: science and systems ,

  26. [26]

    Transformers for one- shot visual imitation

    Sudeep Dasari and Abhinav Gupta. Transformers for one- shot visual imitation. In Conference on Robot Learning , pages 2071–2084. PMLR, 2021. 1, 2

  27. [27]

    An unbiased look at datasets for visuo- motor pre-training

    Sudeep Dasari, Mohan Kumar Srirama, Unnat Jain, and Abhinav Gupta. An unbiased look at datasets for visuo- motor pre-training. In Conference on Robot Learning , pages 1183–1198. PMLR, 2023. 3

  28. [28]

    Bootstap: Boot- strapped training for tracking-any-point

    Carl Doersch, Yi Yang, Dilara Gokay, Pauline Luc, Skanda Koppula, Ankush Gupta, Joseph Heyward, Ross Goroshin, Jo˜ao Carreira, and Andrew Zisserman. Bootstap: Boot- strapped training for tracking-any-point. arXiv preprint arXiv:2402.00847, 2024. 18

  29. [29]

    arXiv preprint arXiv:2310.10625 (2023)

    Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B Tenenbaum, et al. Video language plan- ning. arXiv preprint arXiv:2310.10625, 2023. 3

  30. [30]

    Learning universal policies via text-guided video genera- tion

    Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video genera- tion. Advances in Neural Information Processing Systems, 36, 2024. 1, 3

  31. [31]

    Anygrasp: Robust and efficient grasp percep- tion in spatial and temporal domains

    Hao-Shu Fang, Chenxi Wang, Hongjie Fang, Minghao Gou, Jirong Liu, Hengxu Yan, Wenhai Liu, Yichen Xie, and Cewu Lu. Anygrasp: Robust and efficient grasp percep- tion in spatial and temporal domains. IEEE Transactions on Robotics, 2023. 4

  32. [32]

    Zhang, P

    Chelsea Finn, Tianhe Yu, T. Zhang, P. Abbeel, and Sergey Levine. One-shot visual imitation learning via meta- learning. In CoRL, 2017. 3

  33. [33]

    Dense Object Nets: Learning Dense Visual Object Descriptors By and For Robotic Manipulation

    Peter R Florence, Lucas Manuelli, and Russ Tedrake. Dense object nets: Learning dense visual object descrip- tors by and for robotic manipulation. arXiv preprint arXiv:1806.08756, 2018. 2

  34. [34]

    Humanplus: Hu- manoid shadowing and imitation from humans.arXiv preprint arXiv:2406.10454, 2024

    Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. Humanplus: Humanoid shadowing and im- itation from humans. arXiv preprint arXiv:2406.10454 ,

  35. [35]

    Flip: Flow-centric generative planning as general-purpose manipulation world model.arXiv preprint arXiv:2412.08261, 2024

    Chongkai Gao, Haozhuo Zhang, Zhixuan Xu, Zhehao Cai, and Lin Shao. Flip: Flow-centric generative planning for general-purpose manipulation tasks. arXiv preprint arXiv:2412.08261, 2024. 2

  36. [36]

    Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938, 2025

    Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. Adaworld: Learning adaptable world models with latent actions. arXiv preprint arXiv:2503.18938, 2025. 1

  37. [37]

    Navigating to objects in the real world

    Theophile Gervet, Soumith Chintala, Dhruv Batra, Jitendra Malik, and Devendra Singh Chaplot. Navigating to objects in the real world. Science Robotics, 2023. 4

  38. [38]

    Retargetting motion to new characters

    Michael Gleicher. Retargetting motion to new characters. In Proceedings of the 25th annual conference on Computer graphics and interactive techniques, pages 33–42, 1998. 3

  39. [39]

    T 2 V P hys B ench: A first-principles benchmark for physical consistency in text-to-video generation

    Xuyang Guo, Jiayan Huo, Zhenmei Shi, Zhao Song, Jiahao Zhang, and Jiale Zhao. T2vphysbench: A first-principles benchmark for physical consistency in text-to-video gener- ation. arXiv preprint arXiv:2505.00337, 2025. 1

  40. [40]

    Multiple view ge- ometry in computer vision

    Richard Hartley and Andrew Zisserman. Multiple view ge- ometry in computer vision . Cambridge university press,

  41. [41]

    Learning human- to-humanoid real-time whole-body teleoperation

    Tairan He, Zhengyi Luo, Wenli Xiao, Chong Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. Learning human- to-humanoid real-time whole-body teleoperation. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8944–8951. IEEE, 2024. 3

  42. [42]

    Onepose++: Keypoint- free one-shot object pose estimation without cad models

    Xingyi He, Jiaming Sun, Yuang Wang, Di Huang, Hu- jun Bao, and Xiaowei Zhou. Onepose++: Keypoint- free one-shot object pose estimation without cad models. Advances in Neural Information Processing Systems , 35: 35103–35115, 2022. 3

  43. [43]

    Pvn3d: A deep point-wise 3d keypoints voting network for 6dof pose estimation

    Yisheng He, Wei Sun, Haibin Huang, Jianran Liu, Hao- qiang Fan, and Jian Sun. Pvn3d: A deep point-wise 3d keypoints voting network for 6dof pose estimation. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11632–11641, 2020. 3

  44. [44]

    Ffb6d: A full flow bidirectional fusion network for 6d pose estimation

    Yisheng He, Haibin Huang, Haoqiang Fan, Qifeng Chen, and Jian Sun. Ffb6d: A full flow bidirectional fusion network for 6d pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3003–3013, 2021. 3

  45. [45]

    Fs6d: Few-shot 6d pose estimation of novel objects

    Yisheng He, Yao Wang, Haoqiang Fan, Jian Sun, and Qifeng Chen. Fs6d: Few-shot 6d pose estimation of novel objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 6814– 6824, 2022. 3

  46. [46]

    Spot: Se (3) pose trajectory diffusion for object-centric manipulation,

    Cheng-Chun Hsu, Bowen Wen, Jie Xu, Yashraj Narang, Xi- aolong Wang, Yuke Zhu, Joydeep Biswas, and Stan Birch- field. Spot: Se (3) pose trajectory diffusion for object- centric manipulation. arXiv preprint arXiv:2411.00965 ,

  47. [47]

    Online human walking imitation in task and joint space based on quadratic programming

    Kai Hu, Christian Ott, and Dongheui Lee. Online human walking imitation in task and joint space based on quadratic programming. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pages 3458–3464. IEEE,

  48. [48]

    VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models.arXiv preprint arXiv:2307.05973, 2023. 7

  49. [49]

    ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

    Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio-temporal reasoning of rela- tional keypoint constraints for robotic manipulation. arXiv preprint arXiv:2409.01652, 2024. 2, 7

  50. [50]

    Vbench++: Comprehensive and versatile bench- mark suite for video generative models.arXiv preprint arXiv:2411.13503, 2024

    Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying- Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Zi- wei Liu. Vbench++: Comprehensive and versatile bench- mark suite for video generative models. arXiv preprint arXiv:2411.13503, 2024. 6, 21, 22

  51. [51]

    Motiongpt: Human motion as a foreign lan- guage

    Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign lan- guage. Advances in Neural Information Processing Sys- tems, 36:20067–20079, 2023. 3

  52. [52]

    Robo-abc: Affordance general- ization beyond categories via semantic correspondence for robot manipulation

    Yuanchen Ju, Kaizhe Hu, Guowei Zhang, Gu Zhang, Min- grun Jiang, and Huazhe Xu. Robo-abc: Affordance general- ization beyond categories via semantic correspondence for robot manipulation. In European Conference on Computer Vision, pages 222–239. Springer, 2024. 3

  53. [53]

    Cotracker3: Simpler and better point tracking by pseudo-labelling real videos

    Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker3: Simpler and better point tracking by pseudo- labelling real videos. arXiv preprint arXiv:2410.11831 ,

  54. [54]

    Karamcheti, S

    Siddharth Karamcheti, Suraj Nair, Annie S Chen, Thomas Kollar, Chelsea Finn, Dorsa Sadigh, and Percy Liang. Language-driven representation learning for robotics.arXiv preprint arXiv:2302.12766, 2023. 3

  55. [55]

    Egomimic: Scaling imitation learning via egocentric video,

    Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Dan- fei Xu. Egomimic: Scaling imitation learning via egocen- tric video. arXiv preprint arXiv:2410.24221, 2024. 1, 2

  56. [56]

    Video depth without video models,

    Bingxin Ke, Dominik Narnhofer, Shengyu Huang, Lei Ke, Torben Peters, Katerina Fragkiadaki, Anton Obukhov, and Konrad Schindler. Video depth without video models,

  57. [57]

    3d gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1,

  58. [58]

    Robot see robot do: Imitating articulated object manipu- lation with monocular 4d reconstruction

    Justin Kerr, Chung Min Kim, Mingxuan Wu, Brent Yi, Qianqian Wang, Ken Goldberg, and Angjoo Kanazawa. Robot see robot do: Imitating articulated object manipu- lation with monocular 4d reconstruction. arXiv preprint arXiv:2409.18121, 2024. 2, 3, 8, 18

  59. [59]

    Garfield: Group anything with radiance fields

    Chung Min Kim, Mingxuan Wu, Justin Kerr, Ken Gold- berg, Matthew Tancik, and Angjoo Kanazawa. Garfield: Group anything with radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21530–21539, 2024. 18

  60. [60]

    Learning to act from actionless videos through dense correspondences

    Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, and Joshua B Tenenbaum. Learning to act from actionless videos through dense correspondences. arXiv preprint arXiv:2310.08576, 2023. 2, 3, 8, 18

  61. [61]

    Optimization-based lo- comotion planning, estimation, and control design for the atlas humanoid robot

    Scott Kuindersma, Robin Deits, Maurice Fallon, Andr ´es Valenzuela, Hongkai Dai, Frank Permenter, Twan Koolen, Pat Marion, and Russ Tedrake. Optimization-based lo- comotion planning, estimation, and control design for the atlas humanoid robot. Autonomous robots, 40:429–455,

  62. [62]

    Cosypose: Consistent multi-view multi-object 6d pose estimation

    Yann Labb ´e, Justin Carpentier, Mathieu Aubry, and Josef Sivic. Cosypose: Consistent multi-view multi-object 6d pose estimation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16, pages 574–591. Springer, 2020. 3

  63. [63]

    Mega- pose: 6d pose estimation of novel objects via render & compare

    Yann Labb ´e, Lucas Manuelli, Arsalan Mousavian, Stephen Tyree, Stan Birchfield, Jonathan Tremblay, Justin Carpen- tier, Mathieu Aubry, Dieter Fox, and Josef Sivic. Mega- pose: 6d pose estimation of novel objects via render & compare. In Proceedings of the 6th Conference on Robot Learning (CoRL), 2022. 3, 4, 20

  64. [64]

    Kinematic motion retargeting for contact- rich anthropomorphic manipulations

    Arjun S Lakshmipathy, Jessica K Hodgins, and Nancy S Pollard. Kinematic motion retargeting for contact- rich anthropomorphic manipulations. arXiv preprint arXiv:2402.04820, 2024. 3

  65. [65]

    Phantom: Training robots without robots using only human videos, 2025

    Marion Lepert, Jiaying Fang, and Jeannette Bohg. Phan- tom: Training robots without robots using only human videos. arXiv preprint arXiv:2503.00779, 2025. 1

  66. [66]

    Ep n p: An accurate o (n) solution to the p n p problem

    Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Ep n p: An accurate o (n) solution to the p n p problem. In- ternational journal of computer vision , 81:155–166, 2009. 3

  67. [67]

    Nerf- pose: A first-reconstruct-then-regress approach for weakly- supervised 6d object pose estimation

    Fu Li, Shishir Reddy Vutukur, Hao Yu, Ivan Shugurov, Benjamin Busam, Shaowu Yang, and Slobodan Ilic. Nerf- pose: A first-reconstruct-then-regress approach for weakly- supervised 6d object pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 2123–2133, 2023. 3

  68. [68]

    One-shot open affordance learning with foundation models

    Gen Li, Deqing Sun, Laura Sevilla-Lara, and Varun Jam- pani. One-shot open affordance learning with foundation models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 3086– 3096, 2024. 3

  69. [69]

    Learning precise affordances from ego- centric videos for robotic manipulation

    Gen Li, Nikolaos Tsagkas, Jifei Song, Ruaridh Mon- Williams, Sethu Vijayakumar, Kun Shao, and Laura Sevilla-Lara. Learning precise affordances from ego- centric videos for robotic manipulation. arXiv preprint arXiv:2408.10123, 2024. 3

  70. [70]

    Okami: Teaching humanoid robots manipulation skills through single video imitation

    Jinhan Li, Yifeng Zhu, Yuqi Xie, Zhenyu Jiang, Mingyo Seo, Georgios Pavlakos, and Yuke Zhu. Okami: Teaching humanoid robots manipulation skills through single video imitation. In 8th Annual Conference on Robot Learning ,

  71. [71]

    Amt: All-pairs multi- field transforms for efficient frame interpolation

    Zhen Li, Zuo-Liang Zhu, Ling-Hao Han, Qibin Hou, Chun- Le Guo, and Ming-Ming Cheng. Amt: All-pairs multi- field transforms for efficient frame interpolation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 21

  72. [72]

    Dreamitate: Real-world visuomotor policy learning via video generation

    Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sud- hakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl V ondrick. Dreamitate: Real-world visuomotor policy learn- ing via video generation. arXiv preprint arXiv:2406.16862,

  73. [73]

    Dynamic movement primitive based motion retargeting for dual-arm sign language mo- tions

    Yuwei Liang, Weijie Li, Yue Wang, Rong Xiong, Yichao Mao, and Jiafan Zhang. Dynamic movement primitive based motion retargeting for dual-arm sign language mo- tions. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 8195–8201. IEEE, 2021. 3

  74. [74]

    Reconx: Reconstruct any scene from sparse views with video diffusion model, 2025

    Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, and Yueqi Duan. Re- conx: Reconstruct any scene from sparse views with video diffusion model. arXiv preprint arXiv:2408.16767, 2024. 1

  75. [75]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023. 4

  76. [76]

    Unopose: Unseen object pose estimation with an unposed rgb-d refer- ence image

    Xingyu Liu, Gu Wang, Ruida Zhang, Chenyangguang Zhang, Federico Tombari, and Xiangyang Ji. Unopose: Unseen object pose estimation with an unposed rgb-d refer- ence image. arXiv preprint arXiv:2411.16106, 2024. 3

  77. [77]

    Imitation from observation: Learning to imitate behaviors from raw video via context translation

    YuXuan Liu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Imitation from observation: Learning to imitate behaviors from raw video via context translation. In 2018 IEEE International Conference on Robotics and Automa- tion (ICRA), pages 1118–1125. IEEE, 2018. 3

  78. [78]

    Gen6d: General- izable model-free 6-dof object pose estimation from rgb im- ages

    Yuan Liu, Yilin Wen, Sida Peng, Cheng Lin, Xiaoxiao Long, Taku Komura, and Wenping Wang. Gen6d: General- izable model-free 6-dof object pose estimation from rgb im- ages. In European Conference on Computer Vision, pages 298–315. Springer, 2022. 3

  79. [79]

    Perpetual humanoid control for real-time simulated avatars

    Zhengyi Luo, Jinkun Cao, Kris Kitani, Weipeng Xu, et al. Perpetual humanoid control for real-time simulated avatars. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10895–10904, 2023. 3

  80. [80]

    Roboturk: A crowdsourcing platform for robotic skill learning through imitation

    Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Em- mons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning , pages 879–

Showing first 80 references.