MonoDuo: Using One Robot Arm to Learn Bimanual Policies
Pith reviewed 2026-06-29 07:19 UTC · model grok-4.3
The pith
Single-arm robot demonstrations paired with humans can train bimanual policies that transfer zero-shot to real two-arm robots.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MonoDuo collects paired single-arm robot and human data for bimanual tasks, converts it into synthetic demonstrations for target bimanual robots through hand-pose estimation, image and point-cloud segmentation, and inpainting, and trains policies on these demonstrations that support zero-shot deployment on unseen bimanual configurations with success rates up to 70 percent and substantial gains from few-shot finetuning.
What carries the argument
The synthetic demonstration generation pipeline that augments single-arm robot plus human collaboration data into kinematically grounded bimanual demonstrations for the target robot.
If this is right
- Bimanual policies can be trained without any real two-arm robot data and deployed directly on new robot hardware.
- Twenty-five real demonstrations from the target robot produce large performance gains over training from scratch.
- The approach covers tasks such as box lifting, backpack packing, cloth folding, jacket zipping, and plate handover.
- Single-arm robots already present in labs become a practical data source for bimanual skill learning.
Where Pith is reading between the lines
- The same single-arm plus human collection pattern could extend to multi-robot coordination tasks beyond two arms.
- Improving the accuracy of the hand-pose and inpainting steps would likely raise zero-shot success rates further.
- The method points toward hybrid human-robot data pipelines that reduce dependence on scarce multi-robot hardware.
- Testing the pipeline on robots with very different kinematics from the source arm would reveal the limits of the synthetic transfer.
Load-bearing premise
The vision-based steps that create synthetic bimanual demonstrations from single-arm and human data preserve the necessary movement constraints so that policies transfer to the real target robot.
What would settle it
Policies trained solely on the synthetic data achieve zero success on the physical bimanual robot while policies trained from scratch on real bimanual data succeed, even after the few-shot stage.
Figures
read the original abstract
Bimanual coordination is essential for many real-world manipulation tasks, yet learning bimanual robot policies is limited by the scarcity of bimanual robots and datasets. Single-arm robots, however, are widely available in research labs. Can we leverage them to train bimanual robot policies? We present MonoDuo, a framework for learning bimanual manipulation policies using single-arm robot demonstrations paired with human collaboration. MonoDuo collects data by teleoperating a single-arm robot to perform one side of a bimanual task while a human performs the other, then swapping roles to cover both sides. RGB-D observations from a wrist-mounted and fixed camera are augmented into synthetic demonstrations for target bimanual robots using state-of-the-art hand pose estimation, image and point cloud segmentation, and inpainting. These synthetic demonstrations, grounded in real robot kinematics, are used to train bimanual policies. We evaluate MonoDuo on five tasks: box lifting, backpack packing, cloth folding, jacket zipping, and plate handover. Compared to approaches relying solely on human bimanual videos, MonoDuo enables zero-shot deployment on unseen bimanual robot configurations, achieving success rates up to 70%. With only 25 target robot demonstrations, few-shot finetuning further boosts success rates by 65-70% over training from scratch, demonstrating MonoDuo's effectiveness in efficiently transferring knowledge from single-arm robot data to bimanual robot policies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents MonoDuo, a framework that collects single-arm robot demonstrations paired with human actions for bimanual tasks, then uses hand-pose estimation, segmentation, and inpainting to create synthetic bimanual demonstrations for training policies on target bimanual robots. It evaluates this on five tasks (box lifting, backpack packing, cloth folding, jacket zipping, plate handover) and claims zero-shot success rates up to 70% on unseen bimanual configurations, with few-shot finetuning using 25 demonstrations yielding 65-70% improvements over training from scratch.
Significance. If the synthetic data fidelity holds, the approach could meaningfully reduce reliance on scarce bimanual robot hardware by repurposing widely available single-arm platforms, providing a scalable path to bimanual policy learning. The reported few-shot gains indicate that the generated demonstrations supply a useful inductive bias beyond pure human video data.
major comments (3)
- [Abstract] Abstract: the zero-shot success rates up to 70% and few-shot gains of 65-70% are stated without any mention of the number of trials, evaluation protocol, failure modes, or statistical measures. This information is load-bearing for the central transfer claim, as it is required to assess whether the synthetic demonstrations preserve kinematic and dynamic constraints.
- [Methods] Methods (synthetic data pipeline): no error metrics, ground-truth trajectory comparisons, or ablations are reported for the hand-pose estimation, segmentation, and inpainting steps. Systematic errors in joint angles or contact geometry would directly undermine the zero-shot deployment result on unseen bimanual configurations.
- [Experiments] Experiments: the comparison against human-bimanual-video baselines is asserted but no quantitative tables, success-rate breakdowns per task, or controls isolating the single-arm robot contribution versus synthetic augmentation are provided.
minor comments (1)
- [Abstract] Abstract: the description of the data collection and augmentation pipeline is compressed; separating the method overview from the quantitative claims would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on evaluation clarity and pipeline validation. We address each major comment below, providing clarifications from the manuscript and committing to targeted revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: the zero-shot success rates up to 70% and few-shot gains of 65-70% are stated without any mention of the number of trials, evaluation protocol, failure modes, or statistical measures. This information is load-bearing for the central transfer claim, as it is required to assess whether the synthetic demonstrations preserve kinematic and dynamic constraints.
Authors: The abstract is a concise summary; full evaluation details appear in Section 4. Each task used 10 independent trials on the target bimanual setup, with success defined as task completion within time limits without object drops or constraint violations. Means and standard deviations are in Table 1, with qualitative failure mode discussion in Section 4.3. We will revise the abstract to note 'over 10 trials per task' to improve standalone readability. revision: yes
-
Referee: [Methods] Methods (synthetic data pipeline): no error metrics, ground-truth trajectory comparisons, or ablations are reported for the hand-pose estimation, segmentation, and inpainting steps. Systematic errors in joint angles or contact geometry would directly undermine the zero-shot deployment result on unseen bimanual configurations.
Authors: The manuscript prioritizes end-to-end policy transfer as the key validation. We agree intermediate metrics would strengthen the work and will add in revision: hand-pose estimation error against manual annotations on held-out frames, plus an ablation removing each pipeline stage (pose estimation, segmentation, inpainting) and reporting resulting policy success rates. This directly addresses potential systematic errors in kinematics and contacts. revision: yes
-
Referee: [Experiments] Experiments: the comparison against human-bimanual-video baselines is asserted but no quantitative tables, success-rate breakdowns per task, or controls isolating the single-arm robot contribution versus synthetic augmentation are provided.
Authors: Table 1 already reports per-task zero-shot and few-shot success rates for MonoDuo versus human-video baselines across all five tasks. We will expand the table with explicit breakdowns and add a control ablation isolating the single-arm robot grounding by comparing against a human-video-only variant without robot kinematics. This clarifies the contribution of the synthetic pipeline. revision: partial
Circularity Check
No circularity detected in derivation chain
full rationale
The paper presents an empirical framework: single-arm robot + human collaboration data is augmented via hand-pose estimation, segmentation, and inpainting to create synthetic bimanual demonstrations, which are then used to train policies evaluated on real tasks. No equations, parameter fits, or self-citations are described as load-bearing steps. Claims rest on measured success rates (zero-shot up to 70%, few-shot gains) rather than any reduction of outputs to inputs by construction. The central assumption about synthetic data fidelity is an empirical prerequisite, not a definitional or fitted tautology.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A system for imitation learning of contact-rich bimanual manipulation policies,
S. Stepputtis, M. Bandari, S. Schaal, and H. B. Amor, “A system for imitation learning of contact-rich bimanual manipulation policies,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , IEEE, 2022, pp. 11 810–11 817
2022
-
[2]
Stabilize to act: Learning to coordinate for bimanual manipulation,
J. Grannen, Y . Wu, B. Vu, and D. Sadigh, “Stabilize to act: Learning to coordinate for bimanual manipulation,” in Conference on Robot Learning, PMLR, 2023, pp. 563–576
2023
-
[3]
Learning fine-grained bimanual manipulation with low-cost hardware,
T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” in RSS, 2023
2023
-
[4]
Low-cost exoskeletons for learning whole-arm manipulation in the wild,
H. Fang, H. -S. Fang, Y . Wang, J. Ren, J. Chen, R. Zhang, W. Wang, and C. Lu, “Low-cost exoskeletons for learning whole-arm manipulation in the wild,” in ICRA, 2023
2023
- [5]
-
[6]
Gello: A general, low-cost, and intuitive teleoperation framework for robot manipu- lators,
P. Wu, Y . Shentu, Z. Yi, X. Lin, and P. Abbeel, “Gello: A general, low-cost, and intuitive teleoperation framework for robot manipu- lators,” in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , IEEE, 2024, pp. 12 156–12 163
2024
-
[7]
Open teach: A versatile teleoperation system for robotic manipulation,
A. Iyer, Z. Peng, Y . Dai, I. Guzey, S. Haldar, S. Chintala, and L. Pinto, “Open teach: A versatile teleoperation system for robotic manipulation,” arXiv preprint arXiv:2403.07870 , 2024
-
[8]
Learning visuotactile skills with two multifingered hands,
T. Lin, Y . Zhang, Q. Li, H. Qi, B. Yi, S. Levine, and J. Ma- lik, “Learning visuotactile skills with two multifingered hands,” arXiv:2404.16823, 2024
-
[9]
Dynamic handover: Throw and catch with bimanual hands,
B. Huang, Y . Chen, T. Wang, Y . Qin, Y . Yang, N. Atanasov, and X. Wang, “Dynamic handover: Throw and catch with bimanual hands,” arXiv preprint arXiv:2309.05655 , 2023
-
[10]
Twisting lids off with two hands,
T. Lin, Z.-H. Yin, H. Qi, P. Abbeel, and J. Malik, “Twisting lids off with two hands,” arXiv:2403.02338, 2024
-
[11]
Sim-to-real reinforcement learning for vision-based dexterous manipulation on humanoids,
T. Lin, K. Sachdev, L. Fan, J. Malik, and Y . Zhu, “Sim-to-real reinforcement learning for vision-based dexterous manipulation on humanoids,” arXiv:2502.20396, 2025
-
[12]
Learning by watching: Physical imitation of manipulation skills from human videos,
H. Xiong, Q. Li, Y .-C. Chen, H. Bharadhwaj, S. Sinha, and A. Garg, “Learning by watching: Physical imitation of manipulation skills from human videos,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , IEEE, 2021, pp. 7827–7834
2021
-
[13]
arXiv preprint arXiv:2207.09450 , year=
S. Bahl, A. Gupta, and D. Pathak, “Human-to-robot imitation in the wild,” arXiv preprint arXiv:2207.09450 , 2022
-
[14]
arXiv preprint arXiv:2302.12422 , year=
C. Wang, L. Fan, J. Sun, R. Zhang, L. Fei-Fei, D. Xu, Y . Zhu, and A. Anandkumar, “Mimicplay: Long-horizon imitation learning by watching human play,” arXiv preprint arXiv:2302.12422 , 2023
-
[15]
Okami: Teaching humanoid robots manipulation skills through single video imitation,
J. Li, Y . Zhu, Y . Xie, Z. Jiang, M. Seo, G. Pavlakos, and Y . Zhu, “Okami: Teaching humanoid robots manipulation skills through single video imitation,” in 8th Annual Conference on Robot Learning , 2024
2024
-
[16]
Screwmimic: Bimanual imitation from human videos with screw space projection,
A. Bahety, P. Mandikal, B. Abbatematteo, and R. Martín-Martín, “Screwmimic: Bimanual imitation from human videos with screw space projection,” arXiv preprint arXiv:2405.03666 , 2024
-
[17]
Vision-based manipulation from single human video with open-world object graphs,
Y . Zhu, A. Lim, P. Stone, and Y . Zhu, “Vision-based manipulation from single human video with open-world object graphs,” arXiv preprint arXiv:2405.20321, 2024
-
[18]
You only teach once: Learn one-shot bimanual robotic manipulation from video demonstrations,
H. Zhou, R. Wang, Y . Tai, Y . Deng, G. Liu, and K. Jia, “You only teach once: Learn one-shot bimanual robotic manipulation from video demonstrations,” arXiv preprint arXiv:2501.14208 , 2025
-
[19]
Object-centric dexterous manipulation from human motion data,
Y . Chen, C. Wang, Y . Yang, and C. K. Liu, “Object-centric dexterous manipulation from human motion data,” arXiv preprint arXiv:2411.04005, 2024
-
[20]
arxiv preprint arXiv:2403.07788 , year=
C. Wang, H. Shi, W. Wang, R. Zhang, L. Fei-Fei, and C. K. Liu, “Dexcap: Scalable and portable mocap data collection system for dexterous manipulation,” arXiv preprint arXiv:2403.07788 , 2024
-
[21]
A survey of imitation learning: Algorithms, recent developments, and challenges,
M. Zare, P. M. Kebria, A. Khosravi, and S. Nahavandi, “A survey of imitation learning: Algorithms, recent developments, and challenges,” IEEE Transactions on Cybernetics , 2024
2024
-
[22]
Implicit behavioral cloning,
P. Florence, C. Lynch, A. Zeng, O. A. Ramirez, A. Wahid, L. Downs, A. Wong, J. Lee, I. Mordatch, and J. Tompson, “Implicit behavioral cloning,” in Conference on robot learning , PMLR, 2022, pp. 158–168
2022
-
[23]
Diffusion policy: Visuomotor policy learning via action diffusion,
C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” in Learning agile robotic locomotion skills by imitating animals, 2023
2023
-
[24]
O. X.-E. Collaboration et al., Open X-Embodiment: Robotic learning datasets and RT-X models , IEEE International Conference on Robotics and Automation, 2024
2024
-
[25]
Multi- embodiment legged robot control as a sequence modeling problem,
C. Yu, W. Zhang, H. Lai, Z. Tian, L. Kneip, and J. Wang, “Multi- embodiment legged robot control as a sequence modeling problem,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2023, pp. 7250–7257
2023
-
[26]
Hardware conditioned policies for multi-robot transfer learning,
T. Chen, A. Murali, and A. Gupta, “Hardware conditioned policies for multi-robot transfer learning,” Advances in Neural Information Processing Systems, vol. 31, 2018
2018
-
[27]
Unigrasp: Learning a unified model to grasp with multifingered robotic hands,
L. Shao, F. Ferreira, M. Jorda, V . Nambiar, J. Luo, E. Solowjow, J. A. Ojea, O. Khatib, and J. Bohg, “Unigrasp: Learning a unified model to grasp with multifingered robotic hands,” IEEE Robotics and Automation Letters , vol. 5, no. 2, pp. 2286–2293, 2020
2020
-
[28]
Adagrasp: Learning an adaptive gripper-aware grasping policy,
Z. Xu, B. Qi, S. Agrawal, and S. Song, “Adagrasp: Learning an adaptive gripper-aware grasping policy,” in 2021 IEEE International Conference on Robotics and Automation (ICRA) , IEEE, 2021, pp. 4620–4626
2021
-
[29]
Nervenet: Learning structured policy with graph neural networks,
T. Wang, R. Liao, J. Ba, and S. Fidler, “Nervenet: Learning structured policy with graph neural networks,” in International conference on learning representations, 2018
2018
-
[30]
Graph networks as learnable physics engines for inference and control,
A. Sanchez-Gonzalez, N. Heess, J. T. Springenberg, J. Merel, M. Riedmiller, R. Hadsell, and P. Battaglia, “Graph networks as learnable physics engines for inference and control,” in Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause, Eds., ser. Proceedings of Machine Learning Research, vol. 80, PMLR, Oct. 2018, pp. 4...
2018
-
[31]
Learning to control self-assembling morphologies: A study of generalization via modularity,
D. Pathak, C. Lu, T. Darrell, P. Isola, and A. A. Efros, “Learning to control self-assembling morphologies: A study of generalization via modularity,” Advances in Neural Information Processing Systems , vol. 32, 2019
2019
-
[32]
One policy to control them all: Shared modular policies for agent-agnostic control,
W. Huang, I. Mordatch, and D. Pathak, “One policy to control them all: Shared modular policies for agent-agnostic control,” in International Conference on Machine Learning , PMLR, 2020, pp. 4455–4464
2020
-
[33]
My body is a cage: The role of morphology in graph- based incompatible control,
V . Kurin, M. Igl, T. Rocktaschel, W. Boehmer, and S. Whiteson, “My body is a cage: The role of morphology in graph- based incompatible control,” in Proceedings of the International Conference on Learning Representations, OpenReview, 2021
2021
-
[34]
Jacquard: A large scale dataset for robotic grasp detection,
A. Depierre, E. Dellandréa, and L. Chen, “Jacquard: A large scale dataset for robotic grasp detection,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , IEEE, 2018, pp. 3511–3516
2018
-
[35]
Scalable deep reinforcement learning for vision-based robotic manipulation,
D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V . Vanhoucke, et al., “Scalable deep reinforcement learning for vision-based robotic manipulation,” in Conference on robot learning , PMLR, 2018, pp. 651–673
2018
-
[36]
Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,
S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,” The International journal of robotics research, vol. 37, no. 4-5, pp. 421–436, 2018
2018
-
[37]
ACRONYM: A large-scale grasp dataset based on simulation,
C. Eppner, A. Mousavian, and D. Fox, “ACRONYM: A large-scale grasp dataset based on simulation,” in 2021 IEEE Int. Conf. on Robotics and Automation, ICRA , 2020
2021
- [38]
-
[39]
RH20T: A robotic dataset for learning diverse skills in one-shot,
H.-S. Fang, H. Fang, Z. Tang, J. Liu, J. Wang, H. Zhu, and C. Lu, “RH20T: A robotic dataset for learning diverse skills in one-shot,” in RSS 2023 Workshop on Learning for Task and Motion Planning , 2023
2023
-
[40]
Bridge data: Boosting generalization of robotic skills with cross-domain datasets,
F. Ebert, Y . Yang, K. Schmeckpeper, B. Bucher, G. Georgakis, K. Daniilidis, C. Finn, and S. Levine, “Bridge data: Boosting generalization of robotic skills with cross-domain datasets,” in Robotics: Science and Systems (RSS) XVIII , 2022
2022
-
[41]
Bridgedata v2: A dataset for robot learning at scale,
H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen- Estruch, A. W. He, V . Myers, M. J. Kim, M. Du, et al., “Bridgedata v2: A dataset for robot learning at scale,” in Conference on Robot Learning, PMLR, 2023, pp. 1723–1736
2023
-
[42]
Bc-z: Zero-shot task generalization with robotic imitation learning,
E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn, “Bc-z: Zero-shot task generalization with robotic imitation learning,” in Conference on Robot Learning, PMLR, 2022, pp. 991–1002
2022
-
[43]
RT- 1: Robotics transformer for real-world control at scale,
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al. , “RT- 1: Robotics transformer for real-world control at scale,” Robotics: Science and Systems (RSS) , 2023
2023
-
[44]
Rt-2: Vision-language-action models transfer web knowledge to robotic control,
B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” in Conference on Robot Learning, PMLR, 2023, pp. 2165–2183
2023
-
[45]
VIMA: General robot manipulation with multimodal prompts,
Y . Jiang, A. Gupta, Z. Zhang, G. Wang, Y . Dou, Y . Chen, L. Fei- Fei, A. Anandkumar, Y . Zhu, and L. Fan, “VIMA: General robot manipulation with multimodal prompts,” International Conference on Machine Learning (ICML) , 2023
2023
-
[46]
GNM: A general navigation model to drive any robot,
D. Shah, A. Sridhar, A. Bhorkar, N. Hirose, and S. Levine, “GNM: A general navigation model to drive any robot,” in 2023 IEEE International Conference on Robotics and Automation (ICRA) , IEEE, 2023, pp. 7226–7233
2023
-
[47]
ViNT: A Foundation Model for Visual Navigation,
D. Shah, A. Sridhar, N. Dashora, K. Stachowicz, K. Black, N. Hirose, and S. Levine, “ViNT: A Foundation Model for Visual Navigation,” in 7th Annual Conference on Robot Learning (CoRL) , 2023
2023
-
[48]
Interactive language: Talking to robots in real time,
C. Lynch, A. Wahid, J. Tompson, T. Ding, J. Betker, R. Baruch, T. Armstrong, and P. Florence, “Interactive language: Talking to robots in real time,” IEEE Robotics and Automation Letters , 2023
2023
-
[49]
Cliport: What and where pathways for robotic manipulation,
M. Shridhar, L. Manuelli, and D. Fox, “Cliport: What and where pathways for robotic manipulation,” in Conference on Robot Learn- ing, PMLR, 2022, pp. 894–906
2022
-
[50]
Open-world object manipulation using pre-trained vision-language models,
A. Stone, T. Xiao, Y . Lu, K. Gopalakrishnan, K. -H. Lee, Q. Vuong, P. Wohlhart, S. Kirmani, B. Zitkovich, F. Xia, et al., “Open-world object manipulation using pre-trained vision-language models,” in Conference on Robot Learning , PMLR, 2023, pp. 3397–3417
2023
-
[51]
Perceiver-actor: A multi-task transformer for robotic manipulation,
M. Shridhar, L. Manuelli, and D. Fox, “Perceiver-actor: A multi-task transformer for robotic manipulation,” in Proceedings of the 6th Conference on Robot Learning (CoRL) , 2022
2022
-
[52]
A generalist agent,
S. Reed et al. , “A generalist agent,” Transactions on Machine Learning Research, 2022, ISSN : 2835-8856
2022
-
[53]
Real-world robot learning with masked visual pre-training,
I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik, and T. Darrell, “Real-world robot learning with masked visual pre-training,” in Conference on Robot Learning , 2022
2022
-
[54]
Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking,
H. Bharadhwaj, J. Vakil, M. Sharma, A. Gupta, S. Tulsiani, and V . Kumar, “Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking,” in 2024 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2024, pp. 4788–4795
2024
-
[55]
PaLI-X: On Scaling up a Multilingual Vision and Language Model
X. Chen et al. , Pali-x: On scaling up a multilingual vision and language model, 2023. arXiv: 2305.18565 [cs.CV]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
Palm-e: An embodied multimodal language model,
D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al., “Palm-e: An embodied multimodal language model,” in International Conference on Machine Learning , PMLR, 2023, pp. 8469–8488
2023
-
[57]
Rovi- aug: Robot and viewpoint augmentation for cross-embodiment robot learning,
L. Y . Chen, C. Xu, K. Dharmarajan, M. Z. Irshad, R. Cheng, K. Keutzer, M. Tomizuka, Q. Vuong, and K. Goldberg, “Rovi- aug: Robot and viewpoint augmentation for cross-embodiment robot learning,” in Conference on Robot Learning (CoRL) , Munich, Germany, 2024
2024
-
[58]
Mirage: Cross-embodiment zero-shot policy transfer with cross-painting,
L. Y . Chen, K. Hari, K. Dharmarajan, C. Xu, Q. Vuong, and K. Goldberg, “Mirage: Cross-embodiment zero-shot policy transfer with cross-painting,” in Proceedings of Robotics: Science and Systems , Delft, Netherlands, 2024
2024
-
[59]
Shadow: Leveraging segmentation masks for zero-shot cross-embodiment policy transfer,
M. Lepert, R. Doshi, and J. Bohg, “Shadow: Leveraging segmentation masks for zero-shot cross-embodiment policy transfer,” inConference on Robot Learning (CoRL) , Munich, Germany, 2024
2024
-
[60]
Phantom: Training Robots Without Robots Using Only Human Videos
M. Lepert, J. Fang, and J. Bohg, Phantom: Training robots without robots using only human videos , 2025. arXiv: 2503 . 00779 [cs.RO]. [Online]. Available: https://arxiv.org/abs/ 2503.00779
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[61]
EgoMimic: Scaling imitation learning via egocentric video.arXiv preprint arXiv:2410.24221, 2024
S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu, Egomimic: Scaling imitation learning via egocentric video, 2024. arXiv: 2410.24221 [cs.RO] . [Online]. Available: https://arxiv.org/abs/2410.24221
-
[62]
Masquerade: Learning from In-the-wild Human Videos using Data-Editing
M. Lepert, J. Fang, and J. Bohg, Masquerade: Learning from in-the- wild human videos using data-editing , 2025. arXiv: 2508.09976 [cs.RO]. [Online]. Available: https://arxiv.org/abs/ 2508.09976
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
arXiv preprint arXiv:2403.12943 , year=
V . Jain, M. Attarian, N. J. Joshi, A. Wahid, D. Driess, Q. Vuong, P. R. Sanketi, P. Sermanet, S. Welker, C. Chan, et al., “Vid2robot: End-to-end video-conditioned policy learning with cross-attention transformers,” arXiv preprint arXiv:2403.12943 , 2024
- [64]
-
[65]
Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation,
R. Doshi, H. Walke, O. Mees, S. Dasari, and S. Levine, “Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation,” arXiv preprint arXiv:2408.11812 , 2024
-
[66]
Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning,
Z. Jiang, Y . Xie, K. Lin, Z. Xu, W. Wan, A. Mandlekar, L. Fan, and Y . Zhu, “Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning,” in 2025 IEEE International Conference on Robotics and Automation (ICRA) , 2025
2025
-
[67]
Anybimanual: Transferring unimanual policy for general bimanual manipulation,
G. Lu, T. Yu, H. Deng, S. S. Chen, Y . Tang, and Z. Wang, “Anybimanual: Transferring unimanual policy for general bimanual manipulation,” arXiv preprint arXiv:2412.06779 , 2024
-
[68]
Lfdt: Learning dual-arm manipulation from demonstration translated from a human and robotic arm,
M. Kobayashi, J. Yamada, M. Hamaya, and K. Tanaka, “Lfdt: Learning dual-arm manipulation from demonstration translated from a human and robotic arm,” in 2023 IEEE-RAS 22nd International Conference on Humanoid Robots (Humanoids) , 2023, pp. 1–8. DOI: 10.1109/Humanoids57100.2023.10375192
-
[69]
Unpaired image- to-image translation using cycle-consistent adversarial networks,
J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image- to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232
2017
-
[70]
Polybot: Training one policy across robots while embracing variability,
J. H. Yang, D. Sadigh, and C. Finn, “Polybot: Training one policy across robots while embracing variability,” in Conference on Robot Learning, PMLR, 2023, pp. 2955–2974
2023
-
[71]
Pushing the limits of cross-embodiment learning for manipulation and navigation,
J. Yang, C. Glossop, A. Bhorkar, D. Shah, Q. Vuong, C. Finn, D. Sadigh, and S. Levine, “Pushing the limits of cross-embodiment learning for manipulation and navigation,” 2024
2024
-
[72]
Reconstructing hands in 3D with transformers,
G. Pavlakos, D. Shan, I. Radosavovic, A. Kanazawa, D. Fouhey, and J. Malik, “Reconstructing hands in 3D with transformers,” in CVPR, 2024
2024
-
[73]
Method for registration of 3-d shapes,
P. J. Besl and N. D. McKay, “Method for registration of 3-d shapes,” in Sensor fusion IV: control paradigms and data structures , Spie, vol. 1611, 1992, pp. 586–606
1992
-
[74]
Object modelling by registration of multiple range images,
Y . Chen and G. Medioni, “Object modelling by registration of multiple range images,” Image and vision computing , vol. 10, no. 3, pp. 145–155, 1992
1992
-
[75]
Embodied hands: Modeling and capturing hands and bodies together,
J. Romero, D. Tzionas, and M. J. Black, “Embodied hands: Modeling and capturing hands and bodies together,” ACM Transactions on Graphics, (Proc. SIGGRAPH Asia) , 245:1–245:17, vol. 36, no. 6, Nov. 2017
2017
-
[76]
SAM 2: Segment Anything in Images and Videos
N. Ravi et al., “Sam 2: Segment anything in images and videos,” arXiv preprint arXiv:2408.00714, 2024. [Online]. Available: https: //arxiv.org/abs/2408.00714
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[77]
Towards an end-to-end framework for flow-guided video inpainting,
Z. Li, C. -Z. Lu, J. Qin, C. -L. Guo, and M. -M. Cheng, “Towards an end-to-end framework for flow-guided video inpainting,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2022
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.