Recognition: 2 theorem links
· Lean TheoremRobo3R: Enhancing Robotic Manipulation with Accurate Feed-Forward 3D Reconstruction
Pith reviewed 2026-05-16 02:35 UTC · model grok-4.3
The pith
Robo3R predicts accurate metric-scale 3D geometry directly from RGB images and robot states for real-time robotic use.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Robo3R jointly infers scale-invariant local geometry and relative camera poses from RGB images and robot states, unifies them into the scene representation in the canonical robot frame via a learned global similarity transformation, employs a masked point head for fine-grained point clouds and a keypoint-based PnP formulation to refine extrinsics, and when trained on the Robo3R-4M synthetic dataset outperforms existing reconstruction methods and depth sensors on accuracy and on tasks including imitation learning, sim-to-real transfer, grasp synthesis, and collision-free motion planning.
What carries the argument
The masked point head for sharp point clouds combined with keypoint-based PnP refinement, unified by a learned global similarity transformation into the robot frame.
Load-bearing premise
The high-fidelity synthetic Robo3R-4M dataset supplies training data that generalizes to real-world robotic environments and physical interactions without large domain gaps.
What would settle it
A side-by-side real-robot grasping experiment in which depth-sensor input yields higher success rates than Robo3R reconstructions on the same hardware and scenes.
Figures
read the original abstract
3D spatial perception is fundamental to generalizable robotic manipulation, yet obtaining reliable, high-quality 3D geometry remains challenging. Depth sensors suffer from noise and material sensitivity, while existing reconstruction models lack the precision and metric consistency required for physical interaction. We introduce Robo3R, a feed-forward, manipulation-ready 3D reconstruction model that predicts accurate, metric-scale scene geometry directly from RGB images and robot states in real time. Robo3R jointly infers scale-invariant local geometry and relative camera poses, which are unified into the scene representation in the canonical robot frame via a learned global similarity transformation. To meet the precision demands of manipulation, Robo3R employs a masked point head for sharp, fine-grained point clouds, and a keypoint-based Perspective-n-Point (PnP) formulation to refine camera extrinsics and global alignment. Trained on Robo3R-4M, a curated large-scale synthetic dataset with four million high-fidelity annotated frames, Robo3R consistently outperforms state-of-the-art reconstruction methods and depth sensors. Across downstream tasks including imitation learning, sim-to-real transfer, grasp synthesis, and collision-free motion planning, we observe consistent gains in performance, suggesting the promise of this alternative 3D sensing module for robotic manipulation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Robo3R, a feed-forward 3D reconstruction model for robotic manipulation that predicts accurate, metric-scale scene geometry directly from RGB images and robot states in real time. It jointly infers scale-invariant local geometry and relative camera poses, unifies them into a canonical robot frame via a learned global similarity transformation, employs a masked point head for fine-grained point clouds and a keypoint-based PnP formulation for extrinsic refinement, and is trained on the Robo3R-4M synthetic dataset of four million high-fidelity frames. The central claim is that Robo3R consistently outperforms state-of-the-art reconstruction methods and depth sensors, yielding performance gains across downstream robotic tasks including imitation learning, sim-to-real transfer, grasp synthesis, and collision-free motion planning.
Significance. If the empirical claims are substantiated, Robo3R would represent a meaningful advance in robotic 3D perception by supplying a real-time, feed-forward alternative to noisy depth sensors that maintains metric consistency and integrates robot state for canonical alignment. This could improve precision in physical manipulation tasks and reduce reliance on hardware-specific sensing, particularly if the synthetic-to-real transfer holds under realistic conditions.
major comments (2)
- [Abstract] Abstract: the assertion that Robo3R 'consistently outperforms state-of-the-art reconstruction methods and depth sensors' and delivers 'consistent gains in performance' across downstream tasks is presented without any quantitative metrics, error bars, ablation studies, or evaluation-protocol details, leaving the central empirical claims unsupported in the visible text.
- [Evaluation] The central generalization claim (synthetic training on Robo3R-4M yielding metric-accurate geometry for physical robot tasks) lacks supporting evidence such as domain-randomization ablations, real-sensor noise modeling, or quantitative real-world tables reporting Chamfer distances and pose errors against depth-sensor baselines; without these, the precision advantage may not survive material reflectance, calibration drift, or lighting variations.
minor comments (1)
- [Method] Clarify the exact formulation of the learned global similarity transformation and its interaction with the PnP refinement step to avoid ambiguity in how metric scale is recovered.
Simulated Author's Rebuttal
Thank you for your review. We appreciate the opportunity to clarify and strengthen the empirical support in our paper. We respond to each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion that Robo3R 'consistently outperforms state-of-the-art reconstruction methods and depth sensors' and delivers 'consistent gains in performance' across downstream tasks is presented without any quantitative metrics, error bars, ablation studies, or evaluation-protocol details, leaving the central empirical claims unsupported in the visible text.
Authors: We agree that the abstract should provide more concrete support for the claims. The detailed quantitative metrics, error bars, ablation studies, and evaluation protocols are presented in Section 4 (Evaluation) and the appendix. In the revised manuscript, we have updated the abstract to include specific quantitative results, such as the reported improvements in 3D reconstruction accuracy and downstream task performance, to make the claims more substantiated in the visible text. revision: yes
-
Referee: [Evaluation] The central generalization claim (synthetic training on Robo3R-4M yielding metric-accurate geometry for physical robot tasks) lacks supporting evidence such as domain-randomization ablations, real-sensor noise modeling, or quantitative real-world tables reporting Chamfer distances and pose errors against depth-sensor baselines; without these, the precision advantage may not survive material reflectance, calibration drift, or lighting variations.
Authors: The manuscript does include quantitative evaluations on physical robots for sim-to-real transfer, grasp synthesis, and motion planning, showing gains over depth sensors. However, we acknowledge the value of additional ablations. We have added domain-randomization ablations and real-sensor noise modeling in the revised version. Additionally, we include a new table with real-world Chamfer distances and pose errors compared to depth-sensor baselines to demonstrate robustness to material reflectance, calibration drift, and lighting variations. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents an empirical feed-forward reconstruction model trained on the external Robo3R-4M synthetic dataset and evaluated on separate downstream robotic tasks (imitation learning, grasp synthesis, motion planning). No load-bearing step reduces by construction to self-definition, fitted inputs renamed as predictions, or self-citation chains; the architecture (masked point head, learned similarity transform, PnP refinement) is motivated by task requirements rather than tautologically derived from the reported results. Performance gains are shown via comparisons to baselines on held-out evaluations.
Axiom & Free-Parameter Ledger
free parameters (1)
- global similarity transformation parameters
axioms (1)
- domain assumption RGB images combined with robot states contain sufficient information to reconstruct accurate metric-scale 3D geometry
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Robo3R jointly infers scale-invariant local geometry and relative camera poses, which are unified into the scene representation in the canonical robot frame via a learned global similarity transformation... keypoint-based Perspective-n-Point (PnP) formulation to refine camera extrinsics
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Trained on Robo3R-4M, a curated large-scale synthetic dataset with four million high-fidelity annotated frames
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction
Asymmetric token reduction, with distinct merging for queries and pruning for key-values plus layer-wise adaptation, delivers up to 28x speedup on 1000-frame 3D reconstruction inputs while preserving competitive quality.
Reference graph
Works this paper leans on
-
[1]
In Robotics: Science and Systems (RSS), 2024
Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π 0: A vision- language-action flow model for general robot control. In Robotics: Science and Systems (RSS), 2024
work page 2024
-
[2]
Depth pro: Sharp monocular metric depth in less than a second
Aleksei Bochkovskii, Ama ˜AG ¸ l Delaunoy, Hugo Ger- main, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. InInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[3]
Neu- ral mp: A generalist neural motion planner
Murtaza Dalal, Jiahui Yang, Russell Mendonca, Youssef Khaky, Ruslan Salakhutdinov, and Deepak Pathak. Neu- ral mp: A generalist neural motion planner. InInter- national Conference on Intelligent Robots and Systems (IROS), 2025
work page 2025
-
[4]
Objaverse: A universe of annotated 3d objects
Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Input Imagesw/o MPHOurs 1.5 mm-diameter ropes Fig. 8:Reconstruction w/ and w/o the masked point head. Proceedings of the IEEE/CVF Conference on Computer ...
work page 2023
-
[5]
Siyan Dong, Shuzhe Wang, Shaohui Liu, Lulu Cai, Qingnan Fan, Juho Kannala, and Yanchao Yang. Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
work page 2025
-
[6]
Digital twin cata- log: A large-scale photorealistic 3d object digital twin dataset
Zhao Dong, Ka Chen, Zhaoyang Lv, Hong-Xing Yu, Yunzhi Zhang, Cheng Zhang, Yufeng Zhu, Stephen Tian, Zhengqin Li, Geordie Moffatt, et al. Digital twin cata- log: A large-scale photorealistic 3d object digital twin dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
work page 2025
-
[7]
Hao-Shu Fang, Chenxi Wang, Hongjie Fang, Minghao Gou, Jirong Liu, Hengxu Yan, Wenhai Liu, Yichen Xie, and Cewu Lu. Anygrasp: Robust and efficient grasp perception in spatial and temporal domains.IEEE Transactions on Robotics, 39(5):3929–3945, 2023
work page 2023
-
[8]
St4rtrack: Simultaneous 4d recon- struction and tracking in the world
Haiwen Feng, Junyi Zhang, Qianqian Wang, Yufei Ye, Pengcheng Yu, Michael J Black, Trevor Darrell, and Angjoo Kanazawa. St4rtrack: Simultaneous 4d recon- struction and tracking in the world. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025
work page 2025
-
[9]
Act3d: Infinite resolution action detection transformer for robotic manipulation
Theophile Gervet, Zhou Xian, Nikolaos Gkanatsios, and Katerina Fragkiadaki. Act3d: Infinite resolution action detection transformer for robotic manipulation. InCon- ference on Robot Learning (CoRL), 2023
work page 2023
-
[10]
Rvt: Robotic view transformer for 3d object manipulation
Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, and Dieter Fox. Rvt: Robotic view transformer for 3d object manipulation. InConference on Robot Learning (CoRL), 2023
work page 2023
-
[11]
Rvt-2: Learning precise manip- ulation from few demonstrations
Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu-Wei Chao, and Dieter Fox. Rvt-2: Learning precise manip- ulation from few demonstrations. InRobotics: Science and Systems (RSS), 2024
work page 2024
-
[12]
3d diffuser actor: Policy diffusion with 3d scene representations
Tsung-Wei Ke, Nikolaos Gkanatsios, and Katerina Fragkiadaki. 3d diffuser actor: Policy diffusion with 3d scene representations. InConference on Robot Learning (CoRL), 2024
work page 2024
-
[13]
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction.arXiv preprint arXiv:2509.13414, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Pyroki: A modular toolkit for robot kinematic optimization
Chung Min Kim, Brent Yi, Hongsuk Choi, Yi Ma, Ken Goldberg, and Angjoo Kanazawa. Pyroki: A modular toolkit for robot kinematic optimization. in 2025 ieee. InInternational Conference on Intelligent Robots and Systems (IROS), 2025
work page 2025
-
[15]
Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Ep n p: An accurate o (n) solution to the p n p problem.International journal of computer vision, 81 (2):155–166, 2009
work page 2009
-
[16]
Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, and Haoang Li. Spatial forcing: Implicit spatial representation align- ment for vision-language-action model.arXiv preprint arXiv:2510.12276, 2025
-
[17]
Gendexgrasp: Generalizable dexterous grasping
Puhao Li, Tengyu Liu, Yuyang Li, Yiran Geng, Yixin Zhu, Yaodong Yang, and Siyuan Huang. Gendexgrasp: Generalizable dexterous grasping. InInternational Con- ference on Robotics and Automation (ICRA), 2023
work page 2023
-
[18]
Depth Anything 3: Recovering the Visual Space from Any Views
Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Tao Lin, Gen Li, Yilei Zhong, Yanwen Zou, Yuxin Du, Jiting Liu, Encheng Gu, and Bo Zhao. Evo-0: Vision- language-action model with implicit spatial understand- ing.arXiv preprint arXiv:2507.00416, 2025
-
[20]
Minghuan Liu, Zhengbang Zhu, Xiaoshen Han, Peng Hu, Haotong Lin, Xinyao Li, Jingxiao Chen, Jiafeng Xu, Yichu Yang, Yunfeng Lin, et al. Manipulation as in simulation: Enabling accurate geometry perception in robots.arXiv preprint arXiv:2509.02530, 2025
-
[21]
Xinhang Liu, Yuxi Xiao, Donny Y Chen, Jiashi Feng, Yu-Wing Tai, Chi-Keung Tang, and Bingyi Kang. Trace anything: Representing any video in 4d via trajectory fields.arXiv preprint arXiv:2510.13802, 2025
-
[22]
Yiyang Lu, Yufeng Tian, Zhecheng Yuan, Xianbang Wang, Pu Hua, Zhengrong Xue, and Huazhe Xu. H 3dp: Triply-hierarchical diffusion policy for visuomotor learn- ing.arXiv preprint arXiv:2505.07819, 2025
-
[23]
Dextrah-g: Pixels- to-action dexterous arm-hand grasping with geometric fabrics
Tyler Ga Wei Lum, Martin Matak, Viktor Makoviy- chuk, Ankur Handa, Arthur Allshire, Tucker Hermans, Nathan D Ratliff, and Karl Van Wyk. Dextrah-g: Pixels- to-action dexterous arm-hand grasping with geometric fabrics. InConference on Robot Learning (CoRL), 2024
work page 2024
-
[24]
Juncheng Mu, Sizhe Yang, Hojin Bae, Feiyu Jia, Qingwei Ben, Boyi Li, Huazhe Xu, and Jiangmiao Pang. One-policy-fits-all: Geometry-aware action la- tents for cross-embodiment manipulation.arXiv preprint arXiv:2603.14522, 2026
-
[25]
Quanhao Qian, Guoyang Zhao, Gongjie Zhang, Jiuniu Wang, Ran Xu, Junlong Gao, and Deli Zhao. Gp3: A 3d geometry-aware policy with multi-view images for robotic manipulation.arXiv preprint arXiv:2509.15733, 2025
-
[26]
Dexpoint: Generalizable point cloud reinforcement learning for sim-to-real dexterous manipulation
Yuzhe Qin, Binghao Huang, Zhao-Heng Yin, Hao Su, and Xiaolong Wang. Dexpoint: Generalizable point cloud reinforcement learning for sim-to-real dexterous manipulation. InConference on Robot Learning (CoRL), 2023
work page 2023
-
[27]
Perceiver-actor: A multi-task transformer for robotic ma- nipulation
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic ma- nipulation. InConference on Robot Learning (CoRL), 2023
work page 2023
-
[28]
Dynamic point maps: A versatile representation for dynamic 3d reconstruction
Edgar Sucar, Zihang Lai, Eldar Insafutdinov, and Andrea Vedaldi. Dynamic point maps: A versatile representation for dynamic 3d reconstruction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025
work page 2025
-
[29]
Curobo: Parallelized collision-free robot motion generation
Balakumar Sundaralingam, Siva Kumar Sastry Hari, Adam Fishman, Caelan Garrett, Karl Van Wyk, Valts Blukis, Alexander Millane, Helen Oleynikova, Ankur Handa, Fabio Ramos, et al. Curobo: Parallelized collision-free robot motion generation. InInternational Conference on Robotics and Automation (ICRA), 2023
work page 2023
-
[30]
Masked depth modeling for spatial perception.arXiv preprint arXiv:2601.17895, 2026
Bin Tan, Changjiang Sun, Xiage Qin, Hanat Adai, Zelin Fu, Tianxiang Zhou, Han Zhang, Yinghao Xu, Xing Zhu, Yujun Shen, et al. Masked depth modeling for spatial perception.arXiv preprint arXiv:2601.17895, 2026
-
[31]
Vggt: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
work page 2025
-
[32]
Continuous 3d perception model with persistent state
Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
work page 2025
-
[33]
Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simulation
Ruicheng Wang, Jialiang Zhang, Jiayi Chen, Yinzhen Xu, Puhao Li, Tengyu Liu, and He Wang. Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simulation. InInternational Conference on Robotics and Automation (ICRA), 2023
work page 2023
-
[34]
Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xi- ang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2025
work page 2025
-
[35]
Moge-2: Accurate monocular geometry with metric scale and sharp details
Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jian- feng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details. InAdvances in Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[36]
Dust3r: Geometric 3d vision made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[37]
Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He.pi 3: Permutation- equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Zhenyu Wei, Zhixuan Xu, Jingxiang Guo, Yiwen Hou, Chongkai Gao, Zhehao Cai, Jiayu Luo, and Lin Shao. D (r, o) grasp: A unified representation of robot and object interaction for cross-embodiment dexterous grasping. In International Conference on Robotics and Automation (ICRA), 2025
work page 2025
-
[39]
Foundationstereo: Zero-shot stereo matching
Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. Foundationstereo: Zero-shot stereo matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
work page 2025
-
[40]
Chaineddif- fuser: Unifying trajectory diffusion and keypose predic- tion for robotic manipulation
Zhou Xian, Nikolaos Gkanatsios, Theophile Gervet, Tsung-Wei Ke, and Katerina Fragkiadaki. Chaineddif- fuser: Unifying trajectory diffusion and keypose predic- tion for robotic manipulation. InConference on Robot Learning (CoRL), 2023
work page 2023
-
[41]
Pixel-perfect depth with semantics-prompted diffusion transformers
Gangwei Xu, Haotong Lin, Hongcheng Luo, Xianqi Wang, Jingfeng Yao, Lianghui Zhu, Yuechuan Pu, Cheng Chi, Haiyang Sun, Bing Wang, et al. Pixel-perfect depth with semantics-prompted diffusion transformers. InAdvances in Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[42]
Yinzhen Xu, Weikang Wan, Jialiang Zhang, Haoran Liu, Zikang Shan, Hao Shen, Ruicheng Wang, Haoran Geng, Yijia Weng, Jiayi Chen, et al. Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
work page 2023
-
[43]
Dnact: Diffusion guided multi-task 3d policy learning
Ge Yan, Yueh-Hua Wu, and Xiaolong Wang. Dnact: Diffusion guided multi-task 3d policy learning. InInter- national Conference on Intelligent Robots and Systems (IROS), 2025
work page 2025
-
[44]
Maniflow: A general robot manipulation policy via consistency flow training
Ge Yan, Jiyue Zhu, Yuquan Deng, Shiqi Yang, Ri-Zhao Qiu, Xuxin Cheng, Marius Memmel, Ranjay Krishna, Ankit Goyal, Xiaolong Wang, et al. Maniflow: A general robot manipulation policy via consistency flow training. InConference on Robot Learning (CoRL), 2025
work page 2025
-
[45]
Deep reactive policy: Learning reactive manipulator motion planning for dynamic environments
Jiahui Yang, Jason Jingzhou Liu, Yulong Li, Youssef Khaky, Kenneth Shaw, and Deepak Pathak. Deep reactive policy: Learning reactive manipulator motion planning for dynamic environments. InConference on Robot Learning (CoRL), 2025
work page 2025
-
[46]
Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2. InAdvances in Neural Information Process- ing Systems (NeurIPS), 2024
work page 2024
-
[47]
Sizhe Yang, Yiman Xie, Zhixuan Liang, Yang Tian, Jia Zeng, Dahua Lin, and Jiangmiao Pang. Ultradex- grasp: Learning universal dexterous grasping for bi- manual robots with synthetic data.arXiv preprint arXiv:2603.05312, 2026
-
[48]
Learning to manipulate anywhere: A visual generalizable framework for reinforcement learning
Zhecheng Yuan, Tianming Wei, Shuiqi Cheng, Gu Zhang, Yuanpei Chen, and Huazhe Xu. Learning to manipulate anywhere: A visual generalizable framework for reinforcement learning. InConference on Robot Learning (CoRL), 2024
work page 2024
-
[49]
Zhecheng Yuan, Tianming Wei, Langzhe Gu, Pu Hua, Tianhai Liang, Yuanpei Chen, and Huazhe Xu. Hermes: Human-to-robot embodied learning from multi-source motion data for mobile dexterous manipulation.arXiv preprint arXiv:2508.20085, 2025
-
[50]
Gnfactor: Multi-task real robot learning with generalizable neural feature fields
Yanjie Ze, Ge Yan, Yueh-Hua Wu, Annabella Macaluso, Yuying Ge, Jianglong Ye, Nicklas Hansen, Li Erran Li, and Xiaolong Wang. Gnfactor: Multi-task real robot learning with generalizable neural feature fields. In Conference on robot learning (CoRL), 2023
work page 2023
-
[51]
3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations
Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. InRobotics: Science and Systems (RSS), 2024
work page 2024
-
[52]
Efficiently reconstructing dynamic scenes one d4rt at a time.arXiv preprint arXiv:2512.08924, 2025
Chuhan Zhang, Guillaume Le Moing, Skanda Koppula, Ignacio Rocco, Liliane Momeni, Junyu Xie, Shuyang Sun, Rahul Sukthankar, Jo ¨elle K Barral, Raia Hadsell, et al. Efficiently reconstructing dynamic scenes one d4rt at a time.arXiv preprint arXiv:2512.08924, 2025
-
[53]
Generative visual foresight meets task-agnostic pose estimation in robotic table-top manipulation
Chuye Zhang, Xiaoxiong Zhang, Wei Pan, Linfang Zheng, and Wei Zhang. Generative visual foresight meets task-agnostic pose estimation in robotic table-top manipulation. InConference on robot learning (CoRL), 2025
work page 2025
-
[54]
Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes
Jialiang Zhang, Haoran Liu, Danshi Li, XinQiang Yu, Haoran Geng, Yufei Ding, Jiayi Chen, and He Wang. Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes. InConference on Robot Learning (CoRL), 2024
work page 2024
-
[55]
Monst3r: A simple approach for estimating geometry in the presence of motion
Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion. In International Conference on Learning Representations (ICLR), 2025. APPENDIX In this Appendix, we provide details on synthetic data generation (Appendix A...
work page 2025
-
[56]
We model the sensors as pinhole cameras with randomized intrinsic param- eters
Camera Configuration:The simulation is equipped with a multi-camera system yielding RGB images. We model the sensors as pinhole cameras with randomized intrinsic param- eters. For each episode, we perturb the focal lengths (f x, fy) and principal points (c x, cy) of the camera intrinsic matrixK. Additionally, we randomize the focus distance and f-number t...
-
[57]
Lighting Configuration:The lighting system consists of three types of light sources: Dome, Sphere, and Distant lights. This setup allows us to effectively simulate a wide range of lighting conditions, as described below: •Dome Light (Environment Map):We utilize High Dynamic Range Images (HDRI) loaded from a predefined asset list. For each episode, a rando...
work page 2000
-
[58]
Scene Composition and Material Randomization:The scene consists of a robot manipulator, manipulable objects, and background elements such as a tabletop. •Robot:The robot’s joint configuration is initialized via an inverse kinematics (IK) solver to reach random valid end- effector poses. We apply material randomization to the robot’s visual mesh, adding ji...
-
[59]
Imitation Learning Details:We select four tasks, Sweep Bean, Insert Screw, Breakfast, and BiDex Pour, to validate the application of Robo3R in imitation learning. Sweep Bean, Insert Screw, and Breakfast are conducted on a single-arm platform, while BiDex Pour is performed on a bimanual robot platform. For the single-arm platform, we use two RealSense D455...
-
[60]
Sim-to-Real Transfer Details:We assess whether Robo3R is effective in narrowing the sim-to-real visual gap in two tasks: Push Cube and Pick Cube, as depicted in Fig. 10. The sim-to-real visual gaps produced by different methods are visualized in Fig. 11. Compared to RGB images and point clouds acquired by depth cameras, Robo3R achieves a substantially sma...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.