pith. machine review for the scientific record. sign in

arxiv: 2602.10101 · v2 · submitted 2026-02-10 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

Robo3R: Enhancing Robotic Manipulation with Accurate Feed-Forward 3D Reconstruction

Authors on Pith no claims yet

Pith reviewed 2026-05-16 02:35 UTC · model grok-4.3

classification 💻 cs.RO
keywords 3D reconstructionrobotic manipulationfeed-forward modelpoint cloudmetric scalegrasp synthesismotion planningsim-to-real
0
0 comments X

The pith

Robo3R predicts accurate metric-scale 3D geometry directly from RGB images and robot states for real-time robotic use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Robo3R as a feed-forward model that reconstructs scene geometry at metric scale without relying on depth sensors. It processes single RGB images plus robot states to output local geometry and poses, then aligns them into the robot's canonical frame through a learned similarity transform. A masked point head produces sharp point clouds while a keypoint PnP step refines camera alignment. The model is trained on a four-million-frame synthetic dataset and tested on downstream manipulation tasks. If the approach holds, robots could achieve more reliable 3D perception using ordinary cameras across imitation learning, grasping, and planning.

Core claim

Robo3R jointly infers scale-invariant local geometry and relative camera poses from RGB images and robot states, unifies them into the scene representation in the canonical robot frame via a learned global similarity transformation, employs a masked point head for fine-grained point clouds and a keypoint-based PnP formulation to refine extrinsics, and when trained on the Robo3R-4M synthetic dataset outperforms existing reconstruction methods and depth sensors on accuracy and on tasks including imitation learning, sim-to-real transfer, grasp synthesis, and collision-free motion planning.

What carries the argument

The masked point head for sharp point clouds combined with keypoint-based PnP refinement, unified by a learned global similarity transformation into the robot frame.

Load-bearing premise

The high-fidelity synthetic Robo3R-4M dataset supplies training data that generalizes to real-world robotic environments and physical interactions without large domain gaps.

What would settle it

A side-by-side real-robot grasping experiment in which depth-sensor input yields higher success rates than Robo3R reconstructions on the same hardware and scenes.

Figures

Figures reproduced from arXiv: 2602.10101 by Dahua Lin, Hao Li, Jiangmiao Pang, Jia Zeng, Juncheng Mu, Linning Xu, Sizhe Yang.

Figure 1
Figure 1. Figure 1: Overview. Robo3R enables manipulation-ready 3D reconstruction from RGB frames in real time. By achieving accurate metric-scale 3D geometry in the canonical robot frame, Robo3R eliminates the need for depth sensors and calibration, while improving accuracy and robustness in challenging manipulation scenarios. These features lead to notable improvements in downstream applications such as imitation learning, … view at source ↗
Figure 2
Figure 2. Figure 2: Method Overview. RGB images and robot states are encoded and fused. The transformer backbone processes the resulting features through alternating global and frame-wise attention. The masked point head decodes scale-invariant local geometry, while the relative pose head outputs relative poses for registering points across multiple views. S.T. tokens read out the global similarity transformation, which maps … view at source ↗
Figure 3
Figure 3. Figure 3: Masked point head. To address the over-smoothing problem for dense prediction, we propose a masked point head that decomposes point prediction into depth, normalized im￾age coordinate, and mask predictions. Through unprojection, masking, and combination, we obtain sharp points with fine￾grained geometric details. Robot State [0.1, 0.32, -0.5…] Heatmap 2D Keypoint 3D Keypoint R t [PITH_FULL_IMAGE:figures/f… view at source ↗
Figure 4
Figure 4. Figure 4: Extrinsic estimation module. The extrinsic estimation module extracts robot keypoints and accurately estimates the camera extrinsics by solving the Perspective-n-Point (PnP) problem; the camera extrinsics are used to refine the global similarity transformation. (MLP) with GeLU activations. The image and state fea￾tures are then fused via element-wise addition to obtain the combined features F ∈ R N× H 14 ×… view at source ↗
Figure 5
Figure 5. Figure 5: Data samples. The dataset showcases a diverse array of assets with extensive randomization, encompassing rich modalities and comprehensive annotations. camera intrinsics and extrinsics. Representative data samples are illustrated in [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative Comparisons of 3D geometry. We use a RealSense D455 as the depth camera and manually align the scale for point clouds reconstructed by π 3 . We evaluate the methods on several challenging scenarios, including a tiny object (row 1), a scene with a mirror and a transparent cup (row 2), and a cluttered environment (row 3). only 2 mm, requiring millimeter-level reconstruction and manipulation preci… view at source ↗
Figure 7
Figure 7. Figure 7: Imitation learning tasks. We design four manipulation tasks for imitation learning evaluation: Sweep Bean, Insert Screw, Breakfast, and BiDex Pour. 2) Sim-to-Real Transfer: Experimental setup and baselines. We collect 200 demon￾strations for each of the “Push Cube” and “Pick Cube” tasks in NVIDIA Isaac Sim. The baseline methods acquire data using either RGB or depth cameras in simulation and deploy policie… view at source ↗
Figure 8
Figure 8. Figure 8: Reconstruction w/ and w/o the masked point head. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. [5] Siyan Dong, Shuzhe Wang, Shaohui Liu, Lulu Cai, Qingnan Fan, Juho Kannala, and Yanchao Yang. Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization. In Proceedings of the IEEE/CVF Conference o… view at source ↗
Figure 9
Figure 9. Figure 9: Hardware setup. Our experimental platform features both single-arm and dual-arm robots. We use RealSense D455 cameras to capture images of the scene. with binocular input, meeting the real-time requirements of robotic manipulation. D. Real-World Manipulation Experiment Details The robotic platforms used in our experiments consist of a single-arm Franka Research 3 equipped with a parallel gripper and a bima… view at source ↗
Figure 11
Figure 11. Figure 11: Sim-to-real visual gap. The left column shows data obtained from simulation, while the right column presents data from the real world. The first row displays observations from the RGB camera, the second row shows point clouds from the depth camera, and the third row illustrates point clouds reconstructed by Robo3R from RGB images [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Reconstruction results in diverse scenarios. strong generalization capabilities to other robot types, such as a wheeled dual-arm robot (Row 1), as well as to indoor scenes without any robots present (Row 2) [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
read the original abstract

3D spatial perception is fundamental to generalizable robotic manipulation, yet obtaining reliable, high-quality 3D geometry remains challenging. Depth sensors suffer from noise and material sensitivity, while existing reconstruction models lack the precision and metric consistency required for physical interaction. We introduce Robo3R, a feed-forward, manipulation-ready 3D reconstruction model that predicts accurate, metric-scale scene geometry directly from RGB images and robot states in real time. Robo3R jointly infers scale-invariant local geometry and relative camera poses, which are unified into the scene representation in the canonical robot frame via a learned global similarity transformation. To meet the precision demands of manipulation, Robo3R employs a masked point head for sharp, fine-grained point clouds, and a keypoint-based Perspective-n-Point (PnP) formulation to refine camera extrinsics and global alignment. Trained on Robo3R-4M, a curated large-scale synthetic dataset with four million high-fidelity annotated frames, Robo3R consistently outperforms state-of-the-art reconstruction methods and depth sensors. Across downstream tasks including imitation learning, sim-to-real transfer, grasp synthesis, and collision-free motion planning, we observe consistent gains in performance, suggesting the promise of this alternative 3D sensing module for robotic manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Robo3R, a feed-forward 3D reconstruction model for robotic manipulation that predicts accurate, metric-scale scene geometry directly from RGB images and robot states in real time. It jointly infers scale-invariant local geometry and relative camera poses, unifies them into a canonical robot frame via a learned global similarity transformation, employs a masked point head for fine-grained point clouds and a keypoint-based PnP formulation for extrinsic refinement, and is trained on the Robo3R-4M synthetic dataset of four million high-fidelity frames. The central claim is that Robo3R consistently outperforms state-of-the-art reconstruction methods and depth sensors, yielding performance gains across downstream robotic tasks including imitation learning, sim-to-real transfer, grasp synthesis, and collision-free motion planning.

Significance. If the empirical claims are substantiated, Robo3R would represent a meaningful advance in robotic 3D perception by supplying a real-time, feed-forward alternative to noisy depth sensors that maintains metric consistency and integrates robot state for canonical alignment. This could improve precision in physical manipulation tasks and reduce reliance on hardware-specific sensing, particularly if the synthetic-to-real transfer holds under realistic conditions.

major comments (2)
  1. [Abstract] Abstract: the assertion that Robo3R 'consistently outperforms state-of-the-art reconstruction methods and depth sensors' and delivers 'consistent gains in performance' across downstream tasks is presented without any quantitative metrics, error bars, ablation studies, or evaluation-protocol details, leaving the central empirical claims unsupported in the visible text.
  2. [Evaluation] The central generalization claim (synthetic training on Robo3R-4M yielding metric-accurate geometry for physical robot tasks) lacks supporting evidence such as domain-randomization ablations, real-sensor noise modeling, or quantitative real-world tables reporting Chamfer distances and pose errors against depth-sensor baselines; without these, the precision advantage may not survive material reflectance, calibration drift, or lighting variations.
minor comments (1)
  1. [Method] Clarify the exact formulation of the learned global similarity transformation and its interaction with the PnP refinement step to avoid ambiguity in how metric scale is recovered.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your review. We appreciate the opportunity to clarify and strengthen the empirical support in our paper. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that Robo3R 'consistently outperforms state-of-the-art reconstruction methods and depth sensors' and delivers 'consistent gains in performance' across downstream tasks is presented without any quantitative metrics, error bars, ablation studies, or evaluation-protocol details, leaving the central empirical claims unsupported in the visible text.

    Authors: We agree that the abstract should provide more concrete support for the claims. The detailed quantitative metrics, error bars, ablation studies, and evaluation protocols are presented in Section 4 (Evaluation) and the appendix. In the revised manuscript, we have updated the abstract to include specific quantitative results, such as the reported improvements in 3D reconstruction accuracy and downstream task performance, to make the claims more substantiated in the visible text. revision: yes

  2. Referee: [Evaluation] The central generalization claim (synthetic training on Robo3R-4M yielding metric-accurate geometry for physical robot tasks) lacks supporting evidence such as domain-randomization ablations, real-sensor noise modeling, or quantitative real-world tables reporting Chamfer distances and pose errors against depth-sensor baselines; without these, the precision advantage may not survive material reflectance, calibration drift, or lighting variations.

    Authors: The manuscript does include quantitative evaluations on physical robots for sim-to-real transfer, grasp synthesis, and motion planning, showing gains over depth sensors. However, we acknowledge the value of additional ablations. We have added domain-randomization ablations and real-sensor noise modeling in the revised version. Additionally, we include a new table with real-world Chamfer distances and pose errors compared to depth-sensor baselines to demonstrate robustness to material reflectance, calibration drift, and lighting variations. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical feed-forward reconstruction model trained on the external Robo3R-4M synthetic dataset and evaluated on separate downstream robotic tasks (imitation learning, grasp synthesis, motion planning). No load-bearing step reduces by construction to self-definition, fitted inputs renamed as predictions, or self-citation chains; the architecture (masked point head, learned similarity transform, PnP refinement) is motivated by task requirements rather than tautologically derived from the reported results. Performance gains are shown via comparisons to baselines on held-out evaluations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that RGB images plus robot states suffice for metric 3D geometry and that synthetic data generalizes to real manipulation; the learned global similarity transformation is a fitted component without independent verification.

free parameters (1)
  • global similarity transformation parameters
    Learned parameters that map local geometry and poses into the canonical robot frame; their values are determined during training.
axioms (1)
  • domain assumption RGB images combined with robot states contain sufficient information to reconstruct accurate metric-scale 3D geometry
    Core premise enabling the entire feed-forward pipeline from these inputs alone.

pith-pipeline@v0.9.0 · 5539 in / 1472 out tokens · 149261 ms · 2026-05-16T02:35:53.703522+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction

    cs.CV 2026-05 unverdicted novelty 6.0

    Asymmetric token reduction, with distinct merging for queries and pruning for key-values plus layer-wise adaptation, delivers up to 28x speedup on 1000-frame 3D reconstruction inputs while preserving competitive quality.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    In Robotics: Science and Systems (RSS), 2024

    Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π 0: A vision- language-action flow model for general robot control. In Robotics: Science and Systems (RSS), 2024

  2. [2]

    Depth pro: Sharp monocular metric depth in less than a second

    Aleksei Bochkovskii, Ama ˜AG ¸ l Delaunoy, Hugo Ger- main, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. InInternational Conference on Learning Representations (ICLR), 2025

  3. [3]

    Neu- ral mp: A generalist neural motion planner

    Murtaza Dalal, Jiahui Yang, Russell Mendonca, Youssef Khaky, Ruslan Salakhutdinov, and Deepak Pathak. Neu- ral mp: A generalist neural motion planner. InInter- national Conference on Intelligent Robots and Systems (IROS), 2025

  4. [4]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Input Imagesw/o MPHOurs 1.5 mm-diameter ropes Fig. 8:Reconstruction w/ and w/o the masked point head. Proceedings of the IEEE/CVF Conference on Computer ...

  5. [5]

    Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization

    Siyan Dong, Shuzhe Wang, Shaohui Liu, Lulu Cai, Qingnan Fan, Juho Kannala, and Yanchao Yang. Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  6. [6]

    Digital twin cata- log: A large-scale photorealistic 3d object digital twin dataset

    Zhao Dong, Ka Chen, Zhaoyang Lv, Hong-Xing Yu, Yunzhi Zhang, Cheng Zhang, Yufeng Zhu, Stephen Tian, Zhengqin Li, Geordie Moffatt, et al. Digital twin cata- log: A large-scale photorealistic 3d object digital twin dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  7. [7]

    Anygrasp: Robust and efficient grasp perception in spatial and temporal domains.IEEE Transactions on Robotics, 39(5):3929–3945, 2023

    Hao-Shu Fang, Chenxi Wang, Hongjie Fang, Minghao Gou, Jirong Liu, Hengxu Yan, Wenhai Liu, Yichen Xie, and Cewu Lu. Anygrasp: Robust and efficient grasp perception in spatial and temporal domains.IEEE Transactions on Robotics, 39(5):3929–3945, 2023

  8. [8]

    St4rtrack: Simultaneous 4d recon- struction and tracking in the world

    Haiwen Feng, Junyi Zhang, Qianqian Wang, Yufei Ye, Pengcheng Yu, Michael J Black, Trevor Darrell, and Angjoo Kanazawa. St4rtrack: Simultaneous 4d recon- struction and tracking in the world. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

  9. [9]

    Act3d: Infinite resolution action detection transformer for robotic manipulation

    Theophile Gervet, Zhou Xian, Nikolaos Gkanatsios, and Katerina Fragkiadaki. Act3d: Infinite resolution action detection transformer for robotic manipulation. InCon- ference on Robot Learning (CoRL), 2023

  10. [10]

    Rvt: Robotic view transformer for 3d object manipulation

    Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, and Dieter Fox. Rvt: Robotic view transformer for 3d object manipulation. InConference on Robot Learning (CoRL), 2023

  11. [11]

    Rvt-2: Learning precise manip- ulation from few demonstrations

    Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu-Wei Chao, and Dieter Fox. Rvt-2: Learning precise manip- ulation from few demonstrations. InRobotics: Science and Systems (RSS), 2024

  12. [12]

    3d diffuser actor: Policy diffusion with 3d scene representations

    Tsung-Wei Ke, Nikolaos Gkanatsios, and Katerina Fragkiadaki. 3d diffuser actor: Policy diffusion with 3d scene representations. InConference on Robot Learning (CoRL), 2024

  13. [13]

    MapAnything: Universal Feed-Forward Metric 3D Reconstruction

    Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction.arXiv preprint arXiv:2509.13414, 2025

  14. [14]

    Pyroki: A modular toolkit for robot kinematic optimization

    Chung Min Kim, Brent Yi, Hongsuk Choi, Yi Ma, Ken Goldberg, and Angjoo Kanazawa. Pyroki: A modular toolkit for robot kinematic optimization. in 2025 ieee. InInternational Conference on Intelligent Robots and Systems (IROS), 2025

  15. [15]

    Ep n p: An accurate o (n) solution to the p n p problem.International journal of computer vision, 81 (2):155–166, 2009

    Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Ep n p: An accurate o (n) solution to the p n p problem.International journal of computer vision, 81 (2):155–166, 2009

  16. [16]

    Spatial forcing: Implicit spatial representation align- ment for vision-language-action model.arXiv preprint arXiv:2510.12276, 2025

    Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, and Haoang Li. Spatial forcing: Implicit spatial representation align- ment for vision-language-action model.arXiv preprint arXiv:2510.12276, 2025

  17. [17]

    Gendexgrasp: Generalizable dexterous grasping

    Puhao Li, Tengyu Liu, Yuyang Li, Yiran Geng, Yixin Zhu, Yaodong Yang, and Siyuan Huang. Gendexgrasp: Generalizable dexterous grasping. InInternational Con- ference on Robotics and Automation (ICRA), 2023

  18. [18]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

  19. [19]

    Evo-0: Vision- language-action model with implicit spatial understand- ing.arXiv preprint arXiv:2507.00416, 2025

    Tao Lin, Gen Li, Yilei Zhong, Yanwen Zou, Yuxin Du, Jiting Liu, Encheng Gu, and Bo Zhao. Evo-0: Vision- language-action model with implicit spatial understand- ing.arXiv preprint arXiv:2507.00416, 2025

  20. [20]

    Manipulation as in simulation: Enabling accurate geometry perception in robots.arXiv preprint arXiv:2509.02530, 2025

    Minghuan Liu, Zhengbang Zhu, Xiaoshen Han, Peng Hu, Haotong Lin, Xinyao Li, Jingxiao Chen, Jiafeng Xu, Yichu Yang, Yunfeng Lin, et al. Manipulation as in simulation: Enabling accurate geometry perception in robots.arXiv preprint arXiv:2509.02530, 2025

  21. [21]

    Trace anything: Representing any video in 4d via trajectory fields.arXiv preprint arXiv:2510.13802, 2025

    Xinhang Liu, Yuxi Xiao, Donny Y Chen, Jiashi Feng, Yu-Wing Tai, Chi-Keung Tang, and Bingyi Kang. Trace anything: Representing any video in 4d via trajectory fields.arXiv preprint arXiv:2510.13802, 2025

  22. [22]

    H 3dp: Triply-hierarchical diffusion policy for visuomotor learn- ing.arXiv preprint arXiv:2505.07819, 2025

    Yiyang Lu, Yufeng Tian, Zhecheng Yuan, Xianbang Wang, Pu Hua, Zhengrong Xue, and Huazhe Xu. H 3dp: Triply-hierarchical diffusion policy for visuomotor learn- ing.arXiv preprint arXiv:2505.07819, 2025

  23. [23]

    Dextrah-g: Pixels- to-action dexterous arm-hand grasping with geometric fabrics

    Tyler Ga Wei Lum, Martin Matak, Viktor Makoviy- chuk, Ankur Handa, Arthur Allshire, Tucker Hermans, Nathan D Ratliff, and Karl Van Wyk. Dextrah-g: Pixels- to-action dexterous arm-hand grasping with geometric fabrics. InConference on Robot Learning (CoRL), 2024

  24. [24]

    One-policy-fits-all: Geometry-aware action la- tents for cross-embodiment manipulation.arXiv preprint arXiv:2603.14522, 2026

    Juncheng Mu, Sizhe Yang, Hojin Bae, Feiyu Jia, Qingwei Ben, Boyi Li, Huazhe Xu, and Jiangmiao Pang. One-policy-fits-all: Geometry-aware action la- tents for cross-embodiment manipulation.arXiv preprint arXiv:2603.14522, 2026

  25. [25]

    Gp3: A 3d geometry-aware policy with multi-view images for robotic manipulation.arXiv preprint arXiv:2509.15733, 2025

    Quanhao Qian, Guoyang Zhao, Gongjie Zhang, Jiuniu Wang, Ran Xu, Junlong Gao, and Deli Zhao. Gp3: A 3d geometry-aware policy with multi-view images for robotic manipulation.arXiv preprint arXiv:2509.15733, 2025

  26. [26]

    Dexpoint: Generalizable point cloud reinforcement learning for sim-to-real dexterous manipulation

    Yuzhe Qin, Binghao Huang, Zhao-Heng Yin, Hao Su, and Xiaolong Wang. Dexpoint: Generalizable point cloud reinforcement learning for sim-to-real dexterous manipulation. InConference on Robot Learning (CoRL), 2023

  27. [27]

    Perceiver-actor: A multi-task transformer for robotic ma- nipulation

    Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic ma- nipulation. InConference on Robot Learning (CoRL), 2023

  28. [28]

    Dynamic point maps: A versatile representation for dynamic 3d reconstruction

    Edgar Sucar, Zihang Lai, Eldar Insafutdinov, and Andrea Vedaldi. Dynamic point maps: A versatile representation for dynamic 3d reconstruction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

  29. [29]

    Curobo: Parallelized collision-free robot motion generation

    Balakumar Sundaralingam, Siva Kumar Sastry Hari, Adam Fishman, Caelan Garrett, Karl Van Wyk, Valts Blukis, Alexander Millane, Helen Oleynikova, Ankur Handa, Fabio Ramos, et al. Curobo: Parallelized collision-free robot motion generation. InInternational Conference on Robotics and Automation (ICRA), 2023

  30. [30]

    Masked depth modeling for spatial perception.arXiv preprint arXiv:2601.17895, 2026

    Bin Tan, Changjiang Sun, Xiage Qin, Hanat Adai, Zelin Fu, Tianxiang Zhou, Han Zhang, Yinghao Xu, Xing Zhu, Yujun Shen, et al. Masked depth modeling for spatial perception.arXiv preprint arXiv:2601.17895, 2026

  31. [31]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  32. [32]

    Continuous 3d perception model with persistent state

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  33. [33]

    Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simulation

    Ruicheng Wang, Jialiang Zhang, Jiayi Chen, Yinzhen Xu, Puhao Li, Tengyu Liu, and He Wang. Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simulation. InInternational Conference on Robotics and Automation (ICRA), 2023

  34. [34]

    Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision

    Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xi- ang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2025

  35. [35]

    Moge-2: Accurate monocular geometry with metric scale and sharp details

    Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jian- feng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  36. [36]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  37. [37]

    Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He.pi 3: Permutation- equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, 2025

  38. [38]

    D (r, o) grasp: A unified representation of robot and object interaction for cross-embodiment dexterous grasping

    Zhenyu Wei, Zhixuan Xu, Jingxiang Guo, Yiwen Hou, Chongkai Gao, Zhehao Cai, Jiayu Luo, and Lin Shao. D (r, o) grasp: A unified representation of robot and object interaction for cross-embodiment dexterous grasping. In International Conference on Robotics and Automation (ICRA), 2025

  39. [39]

    Foundationstereo: Zero-shot stereo matching

    Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. Foundationstereo: Zero-shot stereo matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  40. [40]

    Chaineddif- fuser: Unifying trajectory diffusion and keypose predic- tion for robotic manipulation

    Zhou Xian, Nikolaos Gkanatsios, Theophile Gervet, Tsung-Wei Ke, and Katerina Fragkiadaki. Chaineddif- fuser: Unifying trajectory diffusion and keypose predic- tion for robotic manipulation. InConference on Robot Learning (CoRL), 2023

  41. [41]

    Pixel-perfect depth with semantics-prompted diffusion transformers

    Gangwei Xu, Haotong Lin, Hongcheng Luo, Xianqi Wang, Jingfeng Yao, Lianghui Zhu, Yuechuan Pu, Cheng Chi, Haiyang Sun, Bing Wang, et al. Pixel-perfect depth with semantics-prompted diffusion transformers. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  42. [42]

    Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy

    Yinzhen Xu, Weikang Wan, Jialiang Zhang, Haoran Liu, Zikang Shan, Hao Shen, Ruicheng Wang, Haoran Geng, Yijia Weng, Jiayi Chen, et al. Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  43. [43]

    Dnact: Diffusion guided multi-task 3d policy learning

    Ge Yan, Yueh-Hua Wu, and Xiaolong Wang. Dnact: Diffusion guided multi-task 3d policy learning. InInter- national Conference on Intelligent Robots and Systems (IROS), 2025

  44. [44]

    Maniflow: A general robot manipulation policy via consistency flow training

    Ge Yan, Jiyue Zhu, Yuquan Deng, Shiqi Yang, Ri-Zhao Qiu, Xuxin Cheng, Marius Memmel, Ranjay Krishna, Ankit Goyal, Xiaolong Wang, et al. Maniflow: A general robot manipulation policy via consistency flow training. InConference on Robot Learning (CoRL), 2025

  45. [45]

    Deep reactive policy: Learning reactive manipulator motion planning for dynamic environments

    Jiahui Yang, Jason Jingzhou Liu, Yulong Li, Youssef Khaky, Kenneth Shaw, and Deepak Pathak. Deep reactive policy: Learning reactive manipulator motion planning for dynamic environments. InConference on Robot Learning (CoRL), 2025

  46. [46]

    Depth anything v2

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2. InAdvances in Neural Information Process- ing Systems (NeurIPS), 2024

  47. [47]

    Ultradex- grasp: Learning universal dexterous grasping for bi- manual robots with synthetic data.arXiv preprint arXiv:2603.05312, 2026

    Sizhe Yang, Yiman Xie, Zhixuan Liang, Yang Tian, Jia Zeng, Dahua Lin, and Jiangmiao Pang. Ultradex- grasp: Learning universal dexterous grasping for bi- manual robots with synthetic data.arXiv preprint arXiv:2603.05312, 2026

  48. [48]

    Learning to manipulate anywhere: A visual generalizable framework for reinforcement learning

    Zhecheng Yuan, Tianming Wei, Shuiqi Cheng, Gu Zhang, Yuanpei Chen, and Huazhe Xu. Learning to manipulate anywhere: A visual generalizable framework for reinforcement learning. InConference on Robot Learning (CoRL), 2024

  49. [49]

    Hermes: Human-to-robot embodied learning from multi-source motion data for mobile dexterous manipulation.arXiv preprint arXiv:2508.20085, 2025

    Zhecheng Yuan, Tianming Wei, Langzhe Gu, Pu Hua, Tianhai Liang, Yuanpei Chen, and Huazhe Xu. Hermes: Human-to-robot embodied learning from multi-source motion data for mobile dexterous manipulation.arXiv preprint arXiv:2508.20085, 2025

  50. [50]

    Gnfactor: Multi-task real robot learning with generalizable neural feature fields

    Yanjie Ze, Ge Yan, Yueh-Hua Wu, Annabella Macaluso, Yuying Ge, Jianglong Ye, Nicklas Hansen, Li Erran Li, and Xiaolong Wang. Gnfactor: Multi-task real robot learning with generalizable neural feature fields. In Conference on robot learning (CoRL), 2023

  51. [51]

    3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations

    Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. InRobotics: Science and Systems (RSS), 2024

  52. [52]

    Efficiently reconstructing dynamic scenes one d4rt at a time.arXiv preprint arXiv:2512.08924, 2025

    Chuhan Zhang, Guillaume Le Moing, Skanda Koppula, Ignacio Rocco, Liliane Momeni, Junyu Xie, Shuyang Sun, Rahul Sukthankar, Jo ¨elle K Barral, Raia Hadsell, et al. Efficiently reconstructing dynamic scenes one d4rt at a time.arXiv preprint arXiv:2512.08924, 2025

  53. [53]

    Generative visual foresight meets task-agnostic pose estimation in robotic table-top manipulation

    Chuye Zhang, Xiaoxiong Zhang, Wei Pan, Linfang Zheng, and Wei Zhang. Generative visual foresight meets task-agnostic pose estimation in robotic table-top manipulation. InConference on robot learning (CoRL), 2025

  54. [54]

    Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes

    Jialiang Zhang, Haoran Liu, Danshi Li, XinQiang Yu, Haoran Geng, Yufei Ding, Jiayi Chen, and He Wang. Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes. InConference on Robot Learning (CoRL), 2024

  55. [55]

    Monst3r: A simple approach for estimating geometry in the presence of motion

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion. In International Conference on Learning Representations (ICLR), 2025. APPENDIX In this Appendix, we provide details on synthetic data generation (Appendix A...

  56. [56]

    We model the sensors as pinhole cameras with randomized intrinsic param- eters

    Camera Configuration:The simulation is equipped with a multi-camera system yielding RGB images. We model the sensors as pinhole cameras with randomized intrinsic param- eters. For each episode, we perturb the focal lengths (f x, fy) and principal points (c x, cy) of the camera intrinsic matrixK. Additionally, we randomize the focus distance and f-number t...

  57. [57]

    Lighting Configuration:The lighting system consists of three types of light sources: Dome, Sphere, and Distant lights. This setup allows us to effectively simulate a wide range of lighting conditions, as described below: •Dome Light (Environment Map):We utilize High Dynamic Range Images (HDRI) loaded from a predefined asset list. For each episode, a rando...

  58. [58]

    •Robot:The robot’s joint configuration is initialized via an inverse kinematics (IK) solver to reach random valid end- effector poses

    Scene Composition and Material Randomization:The scene consists of a robot manipulator, manipulable objects, and background elements such as a tabletop. •Robot:The robot’s joint configuration is initialized via an inverse kinematics (IK) solver to reach random valid end- effector poses. We apply material randomization to the robot’s visual mesh, adding ji...

  59. [59]

    Sweep Bean, Insert Screw, and Breakfast are conducted on a single-arm platform, while BiDex Pour is performed on a bimanual robot platform

    Imitation Learning Details:We select four tasks, Sweep Bean, Insert Screw, Breakfast, and BiDex Pour, to validate the application of Robo3R in imitation learning. Sweep Bean, Insert Screw, and Breakfast are conducted on a single-arm platform, while BiDex Pour is performed on a bimanual robot platform. For the single-arm platform, we use two RealSense D455...

  60. [60]

    Sim-to-Real Transfer Details:We assess whether Robo3R is effective in narrowing the sim-to-real visual gap in two tasks: Push Cube and Pick Cube, as depicted in Fig. 10. The sim-to-real visual gaps produced by different methods are visualized in Fig. 11. Compared to RGB images and point clouds acquired by depth cameras, Robo3R achieves a substantially sma...