LaST-HD: Learning Latent Physical Reasoning from Scalable Human Data for Robot Manipulation
Pith reviewed 2026-06-26 07:56 UTC · model grok-4.3
The pith
LaST-HD aligns human-hand and robot data in a shared latent forward-dynamics space so robots internalize physical dynamics from scalable human demonstrations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LaST-HD trains an auxiliary action-conditioned world model on unpaired human-hand and robot trajectories to synthesize unified latent targets in a shared forward-dynamics space. After cross-embodiment representations are aligned in this space, the targets supervise the latent reasoning process, enabling the model to internalize shared physical dynamics and support efficient human-hand action learning. The method further includes the OOL Glove for precise keypoint capture and a progressive mixed-to-human training schedule that improves generalization to novel objects and scenes while reaching over 90 percent accuracy with only 20 minutes of additional human data.
What carries the argument
Auxiliary action-conditioned world model that generates unified latent targets from unpaired trajectories to align cross-embodiment representations in a shared forward-dynamics space.
If this is right
- Generalization to novel objects, scenes, and positions improves when training uses only human-hand demonstrations after the latent alignment.
- Online correction with 20 minutes of OOL glove data enables over 90 percent accuracy in previously unseen environments.
- The progressive recipe of mixed human-robot co-training followed by human-hand correction reduces reliance on large robot-specific datasets.
- Latent supervision replaces direct kinematic mimicking as the mechanism for cross-embodiment transfer.
Where Pith is reading between the lines
- The same auxiliary world-model alignment could be tested on additional robotic morphologies such as arms or mobile bases if the forward-dynamics space remains embodiment-agnostic.
- If the latent targets prove robust, the method implies that abundant internet-scale human video could substitute for paired robot demonstrations in many manipulation tasks.
- The approach opens the possibility of measuring physical-dynamics similarity between any two embodiments by comparing their latent trajectories under the same world model.
Load-bearing premise
The auxiliary world model trained on unpaired trajectories produces latent targets that faithfully represent the shared physical dynamics between human hands and robots.
What would settle it
Training LaST-HD with the auxiliary model's latent targets yields no measurable gain in generalization or task accuracy over a baseline that uses only kinematic retargeting or separate embodiment-specific models.
Figures
read the original abstract
Human-hand demonstrations provide a direct and scalable source of physical interaction data for robot learning. While manual retargeting is indispensable for establishing kinematic action correspondence across different morphologies, robust transfer requires going beyond geometry to address the underlying alignment of physical dynamics between human and robot manipulation. To address this, we introduce LaST-HD, a novel human-to-robot action learning paradigm that extends reasoning-before-acting VLA by aligning human-hand and robot demonstrations in a shared latent reasoning space. Rather than mimicking human kinematics, LaST-HD trains an auxiliary action-conditioned world model on unpaired human-hand and robot trajectories to synthesize unified latent targets. After aligning cross-embodiment representations in this shared forward-dynamics space, these targets supervise LaST-HD's latent reasoning process, enabling it to internalize shared physical dynamics and drive efficient human-hand action learning. Moreover, we develop Out-of-Lab (OOL) Glove, a low-cost motion-capture glove tailored to LaST-HD for human-hand data collection. The captured human data provide precise keypoints and serve as universal action supervision across grippers and dexterous hands. Armed with the aligned latent space and high-fidelity human-hand data, we develop a progressive mixed-to-human training recipe comprising mixed human-robot co-training and human-hand online correction post-training. Through mixed co-training, LaST-HD improves generalization to novel objects, scenes, and positions using only human-hand demonstrations. With online correction, LaST-HD further adapts to novel environments and achieves over 90\% accuracy using only 20 minutes of OOL glove data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LaST-HD, a human-to-robot action learning method that extends reasoning-before-acting VLAs by training an auxiliary action-conditioned world model on unpaired human-hand and robot trajectories to produce unified latent targets in a shared forward-dynamics space. These targets align cross-embodiment representations and supervise the model's latent reasoning to internalize shared physical dynamics. The work also presents the OOL Glove for scalable human data collection and a progressive mixed-to-human training recipe, claiming over 90% accuracy on novel objects/scenes with only 20 minutes of human data.
Significance. If the auxiliary world model reliably produces latent targets that capture shared physical dynamics rather than embodiment-specific features, the approach could reduce reliance on robot-specific demonstrations and improve sample efficiency in manipulation learning. The OOL Glove and mixed-training recipe are practical contributions for data collection. The significance is currently limited by the absence of verification for the shared-dynamics assumption and experimental details.
major comments (2)
- [Abstract] Abstract: The central claim rests on the auxiliary action-conditioned world model 'synthesiz[ing] unified latent targets' that capture shared physical dynamics across embodiments from unpaired trajectories. No description is given of the world-model objective (e.g., presence or absence of a cross-embodiment consistency term), architecture, or any post-training check that human and robot latents for equivalent physical interactions (identical object contact) lie close together. This assumption is load-bearing; if the model learns separate forward predictors, the 'shared forward-dynamics space' does not exist and the supervision signal reduces to embodiment-specific imitation.
- [Abstract] Abstract: The performance claim ('over 90% accuracy using only 20 minutes of OOL glove data') and the 'progressive mixed-to-human training recipe' are stated without any experimental protocol, task definitions, baselines, number of trials, error bars, or ablation results. This prevents assessment of whether the latent-alignment mechanism actually drives the reported gains.
minor comments (1)
- [Abstract] Abstract: 'VLA' is introduced without expansion.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments correctly identify areas where the abstract could better convey key technical details and experimental rigor. We respond to each major comment below and indicate planned revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim rests on the auxiliary action-conditioned world model 'synthesiz[ing] unified latent targets' that capture shared physical dynamics across embodiments from unpaired trajectories. No description is given of the world-model objective (e.g., presence or absence of a cross-embodiment consistency term), architecture, or any post-training check that human and robot latents for equivalent physical interactions (identical object contact) lie close together. This assumption is load-bearing; if the model learns separate forward predictors, the 'shared forward-dynamics space' does not exist and the supervision signal reduces to embodiment-specific imitation.
Authors: We agree the abstract is too concise on these points. The full manuscript (Section 3.2) specifies that the auxiliary world model is a transformer-based action-conditioned latent dynamics predictor trained solely with a next-state reconstruction loss on unpaired human-hand and robot trajectories; no explicit cross-embodiment consistency loss is used. Alignment arises from the shared action keypoint representation and joint optimization. The original submission did not include a post-training verification (e.g., latent distance metrics or visualizations for matched physical contacts). We will add this analysis in the revision to substantiate the shared-dynamics claim. revision: yes
-
Referee: [Abstract] Abstract: The performance claim ('over 90% accuracy using only 20 minutes of OOL glove data') and the 'progressive mixed-to-human training recipe' are stated without any experimental protocol, task definitions, baselines, number of trials, error bars, or ablation results. This prevents assessment of whether the latent-alignment mechanism actually drives the reported gains.
Authors: The abstract summarizes results; the full experimental protocol, task definitions (novel objects/scenes/positions), baselines, trial counts, error bars, and ablations appear in Section 4 and the supplementary material. We will revise the abstract to include a brief statement of the evaluation setup (e.g., number of trials and key metrics) so readers can better contextualize the claims without needing to consult the body immediately. revision: yes
Circularity Check
No circularity: method described at conceptual level without equations or self-referential derivations
full rationale
The provided abstract and description contain no equations, fitted parameters, or explicit derivation steps. The auxiliary action-conditioned world model is introduced as an assumption that produces unified latent targets from unpaired trajectories, but this is presented as a modeling choice rather than a result derived from or equivalent to its own outputs. No self-citations, uniqueness theorems, or renamings of known results are invoked in a load-bearing way within the given text. The central claim therefore remains a high-level architectural proposal without internal reduction to its inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Karamcheti, S
S. Karamcheti, S. Nair, A. Balakrishna, P. Liang, T. Kollar, and D. Sadigh. Prismatic vlms: Investigating the design space of visually-conditioned language models. InForty-first Interna- tional Conference on Machine Learning, 2024
2024
-
[2]
S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang,...
Pith/arXiv arXiv 2025
-
[3]
L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdul- mohsin, M. Tschannen, E. Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024
Pith/arXiv arXiv 2024
-
[4]
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
Pith/arXiv arXiv 2024
-
[5]
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. pi0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024
Pith/arXiv arXiv 2024
-
[6]
J. Liu, H. Chen, P. An, Z. Liu, R. Zhang, C. Gu, X. Li, Z. Guo, S. Chen, M. Liu, et al. Hy- bridvla: Collaborative diffusion and autoregression in a unified vision-language-action model. arXiv preprint arXiv:2503.10631, 2025
Pith/arXiv arXiv 2025
-
[7]
H. Chen, J. Liu, C. Gu, Z. Liu, R. Zhang, X. Li, X. He, Y . Guo, C.-W. Fu, S. Zhang, et al. Fast- in-slow: A dual-system foundation model unifying fast manipulation within slow reasoning. arXiv preprint arXiv:2506.01953, 2025
arXiv 2025
-
[8]
Khazatsky, K
A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. 2024
2024
-
[9]
K. Wu, C. Hou, J. Liu, Z. Che, X. Ju, et al. Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation. InRobotics: Science and Systems (RSS)
-
[10]
Robotics: Science and Systems Foundation, 2025
2025
-
[11]
Open X-Embodiment Collaboration, A. Padalkar, A. Pooley, et al. Open X-Embodiment: Robotic learning datasets and RT-X models.https://arxiv.org/abs/2310.08864, 2023
Pith/arXiv arXiv 2023
-
[12]
A. S. Chen, S. Nair, and C. Finn. Learning generalizable robotic reward functions from” in- the-wild” human videos.RSS, 2021
2021
-
[13]
S. Bahl, A. Gupta, and D. Pathak. Human-to-robot imitation in the wild.arXiv preprint arXiv:2207.09450, 2022. 9
arXiv 2022
-
[14]
R.-Z. Qiu, S. Yang, X. Cheng, C. Chawla, J. Li, T. He, G. Yan, D. J. Yoon, R. Hoque, L. Paulsen, et al. Humanoid policy˜ human policy.arXiv preprint arXiv:2503.13441, 2025
arXiv 2025
-
[15]
Guzey, Y
I. Guzey, Y . Dai, G. Savva, R. Bhirangi, and L. Pinto. Bridging the human to robot dexterity gap through object-oriented rewards. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 3344–3351. IEEE, 2025
2025
-
[16]
M. Lepert, J. Fang, and J. Bohg. Phantom: Training robots without robots using only human videos.arXiv preprint arXiv:2503.00779, 2025
Pith/arXiv arXiv 2025
-
[17]
Kareer, D
S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Con- ference on Robotics and Automation (ICRA), pages 13226–13233. IEEE, 2025
2025
-
[18]
S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3m: A universal visual represen- tation for robot manipulation. InCoRL, 2022
2022
-
[19]
Bharadhwaj, R
H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani. Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation. InEuropean Conference on Computer Vision, pages 306–324. Springer, 2024
2024
-
[20]
G. A. Team. Gen-1: Scaling embodied foundation models to mastery.Generalist AI Blog,
-
[21]
https://generalistai.com/blog/apr-02-2026-GEN-1
2026
-
[22]
P. Intelligence, B. Ai, A. Amin, R. Aniceto, A. Balakrishna, G. Balke, K. Black, G. Bokinsky, S. Cao, T. Charbonnier, et al.π 0.7: A steerable generalist robotic foundation model with emergent capabilities.arXiv preprint arXiv:2604.15483, 2026
Pith/arXiv arXiv 2026
-
[23]
Kareer, K
S. Kareer, K. Pertsch, J. Darpinian, J. Hoffman, D. Xu, S. Levine, C. Finn, and S. Nair. Emer- gence of human to robot transfer in vision-language-action models
-
[24]
From human skill to robotic mastery.https://cypypccpy.github.io/ tech-blog.github.io/, Mar
PsiBot Team. From human skill to robotic mastery.https://cypypccpy.github.io/ tech-blog.github.io/, Mar. 2026. Accessed: 2026-05-20
2026
-
[25]
H. Bi, L. Wu, T. Lin, H. Tan, Z. Su, H. Su, and J. Zhu. H-rdt: Human manipulation en- hanced bimanual robotic manipulation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18135–18143, 2026
2026
- [26]
-
[27]
Z. Liu, J. Liu, H. Chen, J. Yu, Z. Guo, C. Hou, C. Gu, X. Mi, R. Zhang, K. Wu, et al. Last {0}: Latent spatio-temporal chain-of-thought for robotic vision-language-action model.arXiv preprint arXiv:2601.05248, 2026
arXiv 2026
-
[28]
Y . Guo, L. X. Shi, J. Chen, and C. Finn. Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025
Pith/arXiv arXiv 2025
-
[29]
L. McInnes, J. Healy, and J. Melville. Umap: Uniform manifold approximation and projection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018
Pith/arXiv arXiv 2018
-
[30]
Zitkovich, T
B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In7th Annual Conference on Robot Learning, 2023
2023
-
[31]
M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025. 10
Pith/arXiv arXiv 2025
-
[32]
J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025
Pith/arXiv arXiv 2025
-
[33]
Y . Su, N. Liu, D. Chen, Z. Zhao, K. Wu, M. Li, Z. Xu, Z. Che, and J. Tang. Freqpolicy: Efficient flow-based visuomotor policy via frequency consistency. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[34]
J. Wen, M. Zhu, Y . Zhu, Z. Tang, J. Li, Z. Zhou, C. Li, X. Liu, Y . Peng, C. Shen, et al. Diffusion-vla: Scaling robot foundation models via unified diffusion and autoregression.arXiv preprint arXiv:2412.03293, 2024
arXiv 2024
-
[35]
P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, et al.π 0.5: a vision-language-action model with open-world generalization, 2025. URLhttps://arxiv.org/abs/2504.16054
Pith/arXiv arXiv 2025
-
[36]
F. Lin, R. Nai, Y . Hu, J. You, J. Zhao, and Y . Gao. Onetwovla: A unified vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2505.11917, 2025
arXiv 2025
-
[37]
M. Zawalski, W. Chen, K. Pertsch, O. Mees, C. Finn, and S. Levine. Robotic control via embodied chain-of-thought reasoning, 2025. URLhttps://arxiv.org/abs/2407.08693
Pith/arXiv arXiv 2025
-
[38]
Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finn, et al. Cot- vla: Visual chain-of-thought reasoning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025
2025
-
[39]
S. Gao, S. Zhou, Y . Du, J. Zhang, and C. Gan. Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938, 2025
arXiv 2025
-
[40]
W. Zhang, H. Liu, Z. Qi, Y . Wang, X. Yu, J. Zhang, R. Dong, J. He, H. Wang, Z. Zhang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge. arXiv preprint arXiv:2507.04447, 2025
Pith/arXiv arXiv 2025
-
[41]
Z. Liu, J. Liu, J. Xu, N. Han, C. Gu, H. Chen, K. Zhou, R. Zhang, K. C. Hsieh, K. Wu, et al. Mla: A multisensory language-action model for multimodal understanding and forecasting in robotic manipulation.arXiv preprint arXiv:2509.26642, 2025
arXiv 2025
-
[42]
J. Cen, C. Yu, H. Yuan, Y . Jiang, S. Huang, J. Guo, X. Li, Y . Song, H. Luo, F. Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025
Pith/arXiv arXiv 2025
-
[43]
C. Gu, J. Liu, H. Chen, R. Huang, Q. Wuwu, Z. Liu, X. Li, Y . Li, R. Zhang, P. Jia, et al. Man- ualvla: A unified vla model for chain-of-thought manual generation and robotic manipulation. arXiv preprint arXiv:2512.02013, 2025
arXiv 2025
-
[44]
C.-P. Huang, Y .-H. Wu, M.-H. Chen, Y .-C. F. Wang, and F.-E. Yang. Thinkact: Vision-language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815, 2025
Pith/arXiv arXiv 2025
-
[45]
J. Cai, Z. Cai, J. Cao, Y . Chen, Z. He, L. Jiang, H. Li, H. Li, Y . Li, Y . Liu, et al. Internvla- a1: Unifying understanding, generation and action for robotic manipulation.arXiv preprint arXiv:2601.02456, 2026
arXiv 2026
-
[46]
J. Lyu, K. Liu, X. Zhang, H. Liao, Y . Feng, W. Zhu, T. Shen, J. Chen, J. Zhang, Y . Dong, et al. Lda-1b: Scaling latent dynamics action model via universal embodied data ingestion.arXiv preprint arXiv:2602.12215, 2026
Pith/arXiv arXiv 2026
-
[47]
Banerjee, S
P. Banerjee, S. Shkodrani, P. Moulon, S. Hampali, S. Han, F. Zhang, L. Zhang, J. Fountain, E. Miller, S. Basol, et al. Hot3d: Hand and object tracking in 3d from egocentric multi- view videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7061–7071, 2025. 11
2025
-
[48]
Grauman, A
K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19383–19400, 2024
2024
-
[49]
Grauman, A
K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022
2022
-
[50]
Damen, H
D. Damen, H. Doughty, G. Maria Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European Conference on Computer Vision (ECCV), pages 720–736, 2018
2018
-
[51]
Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training. InThe Eleventh International Conference on Learning Representations, 2022
2022
-
[52]
Majumdar, K
A. Majumdar, K. Yadav, S. Arnaud, Y . J. Ma, C. Chen, S. Silwal, A. Jain, V .-P. Berges, P. Abbeel, J. Malik, D. Batra, Y . Lin, O. Maksymets, A. Rajeswaran, and F. Meier. Where are we in the search for an artificial visual cortex for embodied intelligence? 2023
2023
-
[53]
C. Wang, L. Fan, J. Sun, R. Zhang, L. Fei-Fei, D. Xu, Y . Zhu, and A. Anandkumar. Mimicplay: Long-horizon imitation learning by watching human play.arXiv preprint arXiv:2302.12422, 2023
arXiv 2023
-
[54]
S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from human videos as a versatile representation for robotics. InCVPR, 2023
2023
-
[55]
S. Ye, J. Jang, B. Jeon, S. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Lin, et al. Latent action pretraining from videos.arXiv preprint arXiv:2410.11758, 2024
Pith/arXiv arXiv 2024
-
[56]
S. Gao, W. Liang, K. Zheng, A. Malik, S. Ye, S. Yu, W.-C. Tseng, Y . Dong, K. Mo, C.-H. Lin, et al. Dreamdojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026
Pith/arXiv arXiv 2026
-
[57]
J. Duan, Y . R. Wang, M. Shridhar, D. Fox, and R. Krishna. Ar2-d2: Training a robot without a robot.arXiv preprint arXiv:2306.13818, 2023
arXiv 2023
-
[58]
Y . Park, J. S. Bhatia, L. Ankile, and P. Agrawal. Dexhub and dart: Towards internet scale robot data collection.arXiv preprint arXiv:2411.02214, 2024
arXiv 2024
-
[59]
T. Tao, M. K. Srirama, J. J. Liu, K. Shaw, and D. Pathak. Dexwild: Dexterous human interac- tions for in-the-wild robot policies.arXiv preprint arXiv:2505.07813, 2025
Pith/arXiv arXiv 2025
-
[60]
R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, et al. Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025
Pith/arXiv arXiv 2025
-
[61]
X. Chen, Z. Wu, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, and C. Ruan. Janus-pro: Uni- fied multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025
Pith/arXiv arXiv 2025
-
[62]
X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre- training. InInternational Conference on Computer Vision (ICCV), 2023
2023
-
[63]
G. Luo, L. Dunlap, D. H. Park, A. Holynski, and T. Darrell. Diffusion hyperfeatures: Search- ing through time and space for semantic correspondence.Advances in Neural Information Processing Systems, 36:47500–47510, 2023. 12
2023
-
[64]
L. Tang, M. Jia, Q. Wang, C. P. Phoo, and B. Hariharan. Emergent correspondence from image diffusion.Advances in neural information processing systems, 36:1363–1389, 2023
2023
-
[65]
X. Xu, Y . Hou, Z. Liu, and S. Song. Compliant residual dagger: Improving real-world contact- rich manipulation with human corrections.Advances in Neural Information Processing Sys- tems, 38:139559–139581, 2026
2026
-
[66]
M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finn, and J. Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026
Pith/arXiv arXiv 2026
-
[67]
Zhang, J
J. Zhang, J. Deng, C. Ma, and R. A. Potamias. Hawor: World-space hand motion reconstruc- tion from egocentric videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1805–1815, 2025
2025
-
[68]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
Pith/arXiv arXiv 2025
-
[69]
O’Neill, A
A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024
2024
-
[70]
Ebert, Y
F. Ebert, Y . Yang, K. Schmeckpeper, B. Bucher, G. Georgakis, K. Daniilidis, C. Finn, and S. Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. InRSS, 2022
2022
-
[71]
Walke, K
H. Walke, K. Black, A. Lee, M. J. Kim, M. Du, C. Zheng, T. Zhao, P. Hansen-Estruch, Q. Vuong, A. He, V . Myers, K. Fang, C. Finn, and S. Levine. Bridgedata v2: A dataset for robot learning at scale, 2023
2023
-
[72]
Z. J. Cui, Y . Wang, N. M. M. Shafiullah, and L. Pinto. From play to policy: Conditional behavior generation from uncurated robot data.arXiv preprint arXiv:2210.10047, 2022
arXiv 2022
-
[73]
D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V . Vanhoucke, et al. QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation.arXiv preprint arXiv:1806.10293, 2018
Pith/arXiv arXiv 2018
-
[74]
Belkhale, Y
S. Belkhale, Y . Cui, and D. Sadigh. Hydra: Hybrid robot actions for imitation learning.arxiv, 2023
2023
-
[75]
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, J. Ibarz, et al. Rt-1: Robotics transformer for real-world control at scale. InarXiv preprint arXiv:2212.06817, 2022
Pith/arXiv arXiv 2022
-
[76]
Dasari, F
S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn. Robonet: Large-scale multi-robot learning. InConference on Robot Learning, pages 885–897. PMLR, 2020
2020
-
[77]
S. Dass, J. Yapeter, J. Zhang, J. Zhang, K. Pertsch, S. Nikolaidis, and J. J. Lim. CLVR jaco play dataset, 2023. URLhttps://github.com/clvrai/clvr_jaco_play_dataset
2023
-
[78]
Lynch, A
C. Lynch, A. Wahid, J. Tompson, T. Ding, J. Betker, R. Baruch, T. Armstrong, and P. Florence. Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023
2023
-
[79]
N. M. M. Shafiullah, A. Rai, H. Etukuru, Y . Liu, I. Misra, S. Chintala, and L. Pinto. On bringing robots home, 2023. 13
2023
-
[80]
E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. InConference on Robot Learn- ing, pages 991–1002. PMLR, 2022
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.