arxiv: 2602.10101 · v2 · submitted 2026-02-10 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

Robo3R: Enhancing Robotic Manipulation with Accurate Feed-Forward 3D Reconstruction

Sizhe Yang , Linning Xu , Hao Li , Juncheng Mu , Jia Zeng , Dahua Lin , Jiangmiao Pang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 02:35 UTC · model grok-4.3

classification 💻 cs.RO

keywords 3D reconstructionrobotic manipulationfeed-forward modelpoint cloudmetric scalegrasp synthesismotion planningsim-to-real

0 comments

The pith

Robo3R predicts accurate metric-scale 3D geometry directly from RGB images and robot states for real-time robotic use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Robo3R as a feed-forward model that reconstructs scene geometry at metric scale without relying on depth sensors. It processes single RGB images plus robot states to output local geometry and poses, then aligns them into the robot's canonical frame through a learned similarity transform. A masked point head produces sharp point clouds while a keypoint PnP step refines camera alignment. The model is trained on a four-million-frame synthetic dataset and tested on downstream manipulation tasks. If the approach holds, robots could achieve more reliable 3D perception using ordinary cameras across imitation learning, grasping, and planning.

Core claim

Robo3R jointly infers scale-invariant local geometry and relative camera poses from RGB images and robot states, unifies them into the scene representation in the canonical robot frame via a learned global similarity transformation, employs a masked point head for fine-grained point clouds and a keypoint-based PnP formulation to refine extrinsics, and when trained on the Robo3R-4M synthetic dataset outperforms existing reconstruction methods and depth sensors on accuracy and on tasks including imitation learning, sim-to-real transfer, grasp synthesis, and collision-free motion planning.

What carries the argument

The masked point head for sharp point clouds combined with keypoint-based PnP refinement, unified by a learned global similarity transformation into the robot frame.

Load-bearing premise

The high-fidelity synthetic Robo3R-4M dataset supplies training data that generalizes to real-world robotic environments and physical interactions without large domain gaps.

What would settle it

A side-by-side real-robot grasping experiment in which depth-sensor input yields higher success rates than Robo3R reconstructions on the same hardware and scenes.

Figures

Figures reproduced from arXiv: 2602.10101 by Dahua Lin, Hao Li, Jiangmiao Pang, Jia Zeng, Juncheng Mu, Linning Xu, Sizhe Yang.

**Figure 1.** Figure 1: Overview. Robo3R enables manipulation-ready 3D reconstruction from RGB frames in real time. By achieving accurate metric-scale 3D geometry in the canonical robot frame, Robo3R eliminates the need for depth sensors and calibration, while improving accuracy and robustness in challenging manipulation scenarios. These features lead to notable improvements in downstream applications such as imitation learning, … view at source ↗

**Figure 2.** Figure 2: Method Overview. RGB images and robot states are encoded and fused. The transformer backbone processes the resulting features through alternating global and frame-wise attention. The masked point head decodes scale-invariant local geometry, while the relative pose head outputs relative poses for registering points across multiple views. S.T. tokens read out the global similarity transformation, which maps … view at source ↗

**Figure 3.** Figure 3: Masked point head. To address the over-smoothing problem for dense prediction, we propose a masked point head that decomposes point prediction into depth, normalized image coordinate, and mask predictions. Through unprojection, masking, and combination, we obtain sharp points with finegrained geometric details. Robot State [0.1, 0.32, -0.5…] Heatmap 2D Keypoint 3D Keypoint R t [PITH_FULL_IMAGE:figures/f… view at source ↗

**Figure 4.** Figure 4: Extrinsic estimation module. The extrinsic estimation module extracts robot keypoints and accurately estimates the camera extrinsics by solving the Perspective-n-Point (PnP) problem; the camera extrinsics are used to refine the global similarity transformation. (MLP) with GeLU activations. The image and state features are then fused via element-wise addition to obtain the combined features F ∈ R N× H 14 ×… view at source ↗

**Figure 5.** Figure 5: Data samples. The dataset showcases a diverse array of assets with extensive randomization, encompassing rich modalities and comprehensive annotations. camera intrinsics and extrinsics. Representative data samples are illustrated in [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative Comparisons of 3D geometry. We use a RealSense D455 as the depth camera and manually align the scale for point clouds reconstructed by π 3 . We evaluate the methods on several challenging scenarios, including a tiny object (row 1), a scene with a mirror and a transparent cup (row 2), and a cluttered environment (row 3). only 2 mm, requiring millimeter-level reconstruction and manipulation preci… view at source ↗

**Figure 7.** Figure 7: Imitation learning tasks. We design four manipulation tasks for imitation learning evaluation: Sweep Bean, Insert Screw, Breakfast, and BiDex Pour. 2) Sim-to-Real Transfer: Experimental setup and baselines. We collect 200 demonstrations for each of the “Push Cube” and “Pick Cube” tasks in NVIDIA Isaac Sim. The baseline methods acquire data using either RGB or depth cameras in simulation and deploy policie… view at source ↗

**Figure 8.** Figure 8: Reconstruction w/ and w/o the masked point head. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. [5] Siyan Dong, Shuzhe Wang, Shaohui Liu, Lulu Cai, Qingnan Fan, Juho Kannala, and Yanchao Yang. Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization. In Proceedings of the IEEE/CVF Conference o… view at source ↗

**Figure 9.** Figure 9: Hardware setup. Our experimental platform features both single-arm and dual-arm robots. We use RealSense D455 cameras to capture images of the scene. with binocular input, meeting the real-time requirements of robotic manipulation. D. Real-World Manipulation Experiment Details The robotic platforms used in our experiments consist of a single-arm Franka Research 3 equipped with a parallel gripper and a bima… view at source ↗

**Figure 11.** Figure 11: Sim-to-real visual gap. The left column shows data obtained from simulation, while the right column presents data from the real world. The first row displays observations from the RGB camera, the second row shows point clouds from the depth camera, and the third row illustrates point clouds reconstructed by Robo3R from RGB images [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: Reconstruction results in diverse scenarios. strong generalization capabilities to other robot types, such as a wheeled dual-arm robot (Row 1), as well as to indoor scenes without any robots present (Row 2) [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

read the original abstract

3D spatial perception is fundamental to generalizable robotic manipulation, yet obtaining reliable, high-quality 3D geometry remains challenging. Depth sensors suffer from noise and material sensitivity, while existing reconstruction models lack the precision and metric consistency required for physical interaction. We introduce Robo3R, a feed-forward, manipulation-ready 3D reconstruction model that predicts accurate, metric-scale scene geometry directly from RGB images and robot states in real time. Robo3R jointly infers scale-invariant local geometry and relative camera poses, which are unified into the scene representation in the canonical robot frame via a learned global similarity transformation. To meet the precision demands of manipulation, Robo3R employs a masked point head for sharp, fine-grained point clouds, and a keypoint-based Perspective-n-Point (PnP) formulation to refine camera extrinsics and global alignment. Trained on Robo3R-4M, a curated large-scale synthetic dataset with four million high-fidelity annotated frames, Robo3R consistently outperforms state-of-the-art reconstruction methods and depth sensors. Across downstream tasks including imitation learning, sim-to-real transfer, grasp synthesis, and collision-free motion planning, we observe consistent gains in performance, suggesting the promise of this alternative 3D sensing module for robotic manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Robo3R adds robot-state inputs and a learned similarity transform to feed-forward reconstruction so outputs land in the robot frame, but the real-world metric gains over depth sensors rest on unshown numbers from synthetic training.

read the letter

Robo3R feeds RGB images plus robot joint states into a feed-forward network that predicts local geometry and relative poses, then unifies them with a learned global similarity transform and keypoint PnP refinement to produce metric point clouds in the robot coordinate frame. A masked point head is added for sharper details. This setup is trained on their Robo3R-4M synthetic dataset of four million frames. The core idea is to bypass noisy depth sensors and give manipulation pipelines geometry that is already scaled and aligned correctly for grasping and planning. That integration of robot kinematics is the clearest new piece; most prior reconstruction work treats the camera as standalone. The framing of the problem is also direct: existing models lack the precision and speed needed for physical interaction. If the full paper shows clean ablations on the PnP step and the similarity transform, those choices look practical. The soft spot is generalization. The abstract claims consistent wins on imitation learning, sim-to-real transfer, grasp synthesis, and collision-free planning, yet the provided text gives no error tables, no real-robot Chamfer distances against depth baselines, and no domain-randomization details. Without those, the advantage could shrink once material reflectance and lighting appear. The stress-test note on synthetic-to-real transfer is on target here. This paper is for people building vision-based manipulation stacks who need faster, more consistent 3D input than current sensors provide. A serious referee should see it because the architecture targets a real bottleneck and the dataset is new, even if the experiments will probably need tightening. I would bring it to a reading group to walk through the alignment module, but I would not cite it until the quantitative results are checked.

Referee Report

2 major / 1 minor

Summary. The paper introduces Robo3R, a feed-forward 3D reconstruction model for robotic manipulation that predicts accurate, metric-scale scene geometry directly from RGB images and robot states in real time. It jointly infers scale-invariant local geometry and relative camera poses, unifies them into a canonical robot frame via a learned global similarity transformation, employs a masked point head for fine-grained point clouds and a keypoint-based PnP formulation for extrinsic refinement, and is trained on the Robo3R-4M synthetic dataset of four million high-fidelity frames. The central claim is that Robo3R consistently outperforms state-of-the-art reconstruction methods and depth sensors, yielding performance gains across downstream robotic tasks including imitation learning, sim-to-real transfer, grasp synthesis, and collision-free motion planning.

Significance. If the empirical claims are substantiated, Robo3R would represent a meaningful advance in robotic 3D perception by supplying a real-time, feed-forward alternative to noisy depth sensors that maintains metric consistency and integrates robot state for canonical alignment. This could improve precision in physical manipulation tasks and reduce reliance on hardware-specific sensing, particularly if the synthetic-to-real transfer holds under realistic conditions.

major comments (2)

[Abstract] Abstract: the assertion that Robo3R 'consistently outperforms state-of-the-art reconstruction methods and depth sensors' and delivers 'consistent gains in performance' across downstream tasks is presented without any quantitative metrics, error bars, ablation studies, or evaluation-protocol details, leaving the central empirical claims unsupported in the visible text.
[Evaluation] The central generalization claim (synthetic training on Robo3R-4M yielding metric-accurate geometry for physical robot tasks) lacks supporting evidence such as domain-randomization ablations, real-sensor noise modeling, or quantitative real-world tables reporting Chamfer distances and pose errors against depth-sensor baselines; without these, the precision advantage may not survive material reflectance, calibration drift, or lighting variations.

minor comments (1)

[Method] Clarify the exact formulation of the learned global similarity transformation and its interaction with the PnP refinement step to avoid ambiguity in how metric scale is recovered.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your review. We appreciate the opportunity to clarify and strengthen the empirical support in our paper. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that Robo3R 'consistently outperforms state-of-the-art reconstruction methods and depth sensors' and delivers 'consistent gains in performance' across downstream tasks is presented without any quantitative metrics, error bars, ablation studies, or evaluation-protocol details, leaving the central empirical claims unsupported in the visible text.

Authors: We agree that the abstract should provide more concrete support for the claims. The detailed quantitative metrics, error bars, ablation studies, and evaluation protocols are presented in Section 4 (Evaluation) and the appendix. In the revised manuscript, we have updated the abstract to include specific quantitative results, such as the reported improvements in 3D reconstruction accuracy and downstream task performance, to make the claims more substantiated in the visible text. revision: yes
Referee: [Evaluation] The central generalization claim (synthetic training on Robo3R-4M yielding metric-accurate geometry for physical robot tasks) lacks supporting evidence such as domain-randomization ablations, real-sensor noise modeling, or quantitative real-world tables reporting Chamfer distances and pose errors against depth-sensor baselines; without these, the precision advantage may not survive material reflectance, calibration drift, or lighting variations.

Authors: The manuscript does include quantitative evaluations on physical robots for sim-to-real transfer, grasp synthesis, and motion planning, showing gains over depth sensors. However, we acknowledge the value of additional ablations. We have added domain-randomization ablations and real-sensor noise modeling in the revised version. Additionally, we include a new table with real-world Chamfer distances and pose errors compared to depth-sensor baselines to demonstrate robustness to material reflectance, calibration drift, and lighting variations. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical feed-forward reconstruction model trained on the external Robo3R-4M synthetic dataset and evaluated on separate downstream robotic tasks (imitation learning, grasp synthesis, motion planning). No load-bearing step reduces by construction to self-definition, fitted inputs renamed as predictions, or self-citation chains; the architecture (masked point head, learned similarity transform, PnP refinement) is motivated by task requirements rather than tautologically derived from the reported results. Performance gains are shown via comparisons to baselines on held-out evaluations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that RGB images plus robot states suffice for metric 3D geometry and that synthetic data generalizes to real manipulation; the learned global similarity transformation is a fitted component without independent verification.

free parameters (1)

global similarity transformation parameters
Learned parameters that map local geometry and poses into the canonical robot frame; their values are determined during training.

axioms (1)

domain assumption RGB images combined with robot states contain sufficient information to reconstruct accurate metric-scale 3D geometry
Core premise enabling the entire feed-forward pipeline from these inputs alone.

pith-pipeline@v0.9.0 · 5539 in / 1472 out tokens · 149261 ms · 2026-05-16T02:35:53.703522+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Robo3R jointly infers scale-invariant local geometry and relative camera poses, which are unified into the scene representation in the canonical robot frame via a learned global similarity transformation... keypoint-based Perspective-n-Point (PnP) formulation to refine camera extrinsics
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Trained on Robo3R-4M, a curated large-scale synthetic dataset with four million high-fidelity annotated frames

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction
cs.CV 2026-05 unverdicted novelty 6.0

Asymmetric token reduction, with distinct merging for queries and pruning for key-values plus layer-wise adaptation, delivers up to 28x speedup on 1000-frame 3D reconstruction inputs while preserving competitive quality.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

In Robotics: Science and Systems (RSS), 2024

Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π 0: A vision- language-action flow model for general robot control. In Robotics: Science and Systems (RSS), 2024

work page 2024
[2]

Depth pro: Sharp monocular metric depth in less than a second

Aleksei Bochkovskii, Ama ˜AG ¸ l Delaunoy, Hugo Ger- main, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[3]

Neu- ral mp: A generalist neural motion planner

Murtaza Dalal, Jiahui Yang, Russell Mendonca, Youssef Khaky, Ruslan Salakhutdinov, and Deepak Pathak. Neu- ral mp: A generalist neural motion planner. InInter- national Conference on Intelligent Robots and Systems (IROS), 2025

work page 2025
[4]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Input Imagesw/o MPHOurs 1.5 mm-diameter ropes Fig. 8:Reconstruction w/ and w/o the masked point head. Proceedings of the IEEE/CVF Conference on Computer ...

work page 2023
[5]

Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization

Siyan Dong, Shuzhe Wang, Shaohui Liu, Lulu Cai, Qingnan Fan, Juho Kannala, and Yanchao Yang. Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[6]

Digital twin cata- log: A large-scale photorealistic 3d object digital twin dataset

Zhao Dong, Ka Chen, Zhaoyang Lv, Hong-Xing Yu, Yunzhi Zhang, Cheng Zhang, Yufeng Zhu, Stephen Tian, Zhengqin Li, Geordie Moffatt, et al. Digital twin cata- log: A large-scale photorealistic 3d object digital twin dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[7]

Anygrasp: Robust and efficient grasp perception in spatial and temporal domains.IEEE Transactions on Robotics, 39(5):3929–3945, 2023

Hao-Shu Fang, Chenxi Wang, Hongjie Fang, Minghao Gou, Jirong Liu, Hengxu Yan, Wenhai Liu, Yichen Xie, and Cewu Lu. Anygrasp: Robust and efficient grasp perception in spatial and temporal domains.IEEE Transactions on Robotics, 39(5):3929–3945, 2023

work page 2023
[8]

St4rtrack: Simultaneous 4d recon- struction and tracking in the world

Haiwen Feng, Junyi Zhang, Qianqian Wang, Yufei Ye, Pengcheng Yu, Michael J Black, Trevor Darrell, and Angjoo Kanazawa. St4rtrack: Simultaneous 4d recon- struction and tracking in the world. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025
[9]

Act3d: Infinite resolution action detection transformer for robotic manipulation

Theophile Gervet, Zhou Xian, Nikolaos Gkanatsios, and Katerina Fragkiadaki. Act3d: Infinite resolution action detection transformer for robotic manipulation. InCon- ference on Robot Learning (CoRL), 2023

work page 2023
[10]

Rvt: Robotic view transformer for 3d object manipulation

Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, and Dieter Fox. Rvt: Robotic view transformer for 3d object manipulation. InConference on Robot Learning (CoRL), 2023

work page 2023
[11]

Rvt-2: Learning precise manip- ulation from few demonstrations

Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu-Wei Chao, and Dieter Fox. Rvt-2: Learning precise manip- ulation from few demonstrations. InRobotics: Science and Systems (RSS), 2024

work page 2024
[12]

3d diffuser actor: Policy diffusion with 3d scene representations

Tsung-Wei Ke, Nikolaos Gkanatsios, and Katerina Fragkiadaki. 3d diffuser actor: Policy diffusion with 3d scene representations. InConference on Robot Learning (CoRL), 2024

work page 2024
[13]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction.arXiv preprint arXiv:2509.13414, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Pyroki: A modular toolkit for robot kinematic optimization

Chung Min Kim, Brent Yi, Hongsuk Choi, Yi Ma, Ken Goldberg, and Angjoo Kanazawa. Pyroki: A modular toolkit for robot kinematic optimization. in 2025 ieee. InInternational Conference on Intelligent Robots and Systems (IROS), 2025

work page 2025
[15]

Ep n p: An accurate o (n) solution to the p n p problem.International journal of computer vision, 81 (2):155–166, 2009

Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Ep n p: An accurate o (n) solution to the p n p problem.International journal of computer vision, 81 (2):155–166, 2009

work page 2009
[16]

Spatial forcing: Implicit spatial representation align- ment for vision-language-action model.arXiv preprint arXiv:2510.12276, 2025

Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, and Haoang Li. Spatial forcing: Implicit spatial representation align- ment for vision-language-action model.arXiv preprint arXiv:2510.12276, 2025

work page arXiv 2025
[17]

Gendexgrasp: Generalizable dexterous grasping

Puhao Li, Tengyu Liu, Yuyang Li, Yiran Geng, Yixin Zhu, Yaodong Yang, and Siyuan Huang. Gendexgrasp: Generalizable dexterous grasping. InInternational Con- ference on Robotics and Automation (ICRA), 2023

work page 2023
[18]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Evo-0: Vision- language-action model with implicit spatial understand- ing.arXiv preprint arXiv:2507.00416, 2025

Tao Lin, Gen Li, Yilei Zhong, Yanwen Zou, Yuxin Du, Jiting Liu, Encheng Gu, and Bo Zhao. Evo-0: Vision- language-action model with implicit spatial understand- ing.arXiv preprint arXiv:2507.00416, 2025

work page arXiv 2025
[20]

Manipulation as in simulation: Enabling accurate geometry perception in robots.arXiv preprint arXiv:2509.02530, 2025

Minghuan Liu, Zhengbang Zhu, Xiaoshen Han, Peng Hu, Haotong Lin, Xinyao Li, Jingxiao Chen, Jiafeng Xu, Yichu Yang, Yunfeng Lin, et al. Manipulation as in simulation: Enabling accurate geometry perception in robots.arXiv preprint arXiv:2509.02530, 2025

work page arXiv 2025
[21]

Trace anything: Representing any video in 4d via trajectory fields.arXiv preprint arXiv:2510.13802, 2025

Xinhang Liu, Yuxi Xiao, Donny Y Chen, Jiashi Feng, Yu-Wing Tai, Chi-Keung Tang, and Bingyi Kang. Trace anything: Representing any video in 4d via trajectory fields.arXiv preprint arXiv:2510.13802, 2025

work page arXiv 2025
[22]

H 3dp: Triply-hierarchical diffusion policy for visuomotor learn- ing.arXiv preprint arXiv:2505.07819, 2025

Yiyang Lu, Yufeng Tian, Zhecheng Yuan, Xianbang Wang, Pu Hua, Zhengrong Xue, and Huazhe Xu. H 3dp: Triply-hierarchical diffusion policy for visuomotor learn- ing.arXiv preprint arXiv:2505.07819, 2025

work page arXiv 2025
[23]

Dextrah-g: Pixels- to-action dexterous arm-hand grasping with geometric fabrics

Tyler Ga Wei Lum, Martin Matak, Viktor Makoviy- chuk, Ankur Handa, Arthur Allshire, Tucker Hermans, Nathan D Ratliff, and Karl Van Wyk. Dextrah-g: Pixels- to-action dexterous arm-hand grasping with geometric fabrics. InConference on Robot Learning (CoRL), 2024

work page 2024
[24]

One-policy-fits-all: Geometry-aware action la- tents for cross-embodiment manipulation.arXiv preprint arXiv:2603.14522, 2026

Juncheng Mu, Sizhe Yang, Hojin Bae, Feiyu Jia, Qingwei Ben, Boyi Li, Huazhe Xu, and Jiangmiao Pang. One-policy-fits-all: Geometry-aware action la- tents for cross-embodiment manipulation.arXiv preprint arXiv:2603.14522, 2026

work page arXiv 2026
[25]

Gp3: A 3d geometry-aware policy with multi-view images for robotic manipulation.arXiv preprint arXiv:2509.15733, 2025

Quanhao Qian, Guoyang Zhao, Gongjie Zhang, Jiuniu Wang, Ran Xu, Junlong Gao, and Deli Zhao. Gp3: A 3d geometry-aware policy with multi-view images for robotic manipulation.arXiv preprint arXiv:2509.15733, 2025

work page arXiv 2025
[26]

Dexpoint: Generalizable point cloud reinforcement learning for sim-to-real dexterous manipulation

Yuzhe Qin, Binghao Huang, Zhao-Heng Yin, Hao Su, and Xiaolong Wang. Dexpoint: Generalizable point cloud reinforcement learning for sim-to-real dexterous manipulation. InConference on Robot Learning (CoRL), 2023

work page 2023
[27]

Perceiver-actor: A multi-task transformer for robotic ma- nipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic ma- nipulation. InConference on Robot Learning (CoRL), 2023

work page 2023
[28]

Dynamic point maps: A versatile representation for dynamic 3d reconstruction

Edgar Sucar, Zihang Lai, Eldar Insafutdinov, and Andrea Vedaldi. Dynamic point maps: A versatile representation for dynamic 3d reconstruction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

work page 2025
[29]

Curobo: Parallelized collision-free robot motion generation

Balakumar Sundaralingam, Siva Kumar Sastry Hari, Adam Fishman, Caelan Garrett, Karl Van Wyk, Valts Blukis, Alexander Millane, Helen Oleynikova, Ankur Handa, Fabio Ramos, et al. Curobo: Parallelized collision-free robot motion generation. InInternational Conference on Robotics and Automation (ICRA), 2023

work page 2023
[30]

Masked depth modeling for spatial perception.arXiv preprint arXiv:2601.17895, 2026

Bin Tan, Changjiang Sun, Xiage Qin, Hanat Adai, Zelin Fu, Tianxiang Zhou, Han Zhang, Yinghao Xu, Xing Zhu, Yujun Shen, et al. Masked depth modeling for spatial perception.arXiv preprint arXiv:2601.17895, 2026

work page arXiv 2026
[31]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[32]

Continuous 3d perception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[33]

Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simulation

Ruicheng Wang, Jialiang Zhang, Jiayi Chen, Yinzhen Xu, Puhao Li, Tengyu Liu, and He Wang. Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simulation. InInternational Conference on Robotics and Automation (ICRA), 2023

work page 2023
[34]

Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision

Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xi- ang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2025

work page 2025
[35]

Moge-2: Accurate monocular geometry with metric scale and sharp details

Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jian- feng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[36]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[37]

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He.pi 3: Permutation- equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

D (r, o) grasp: A unified representation of robot and object interaction for cross-embodiment dexterous grasping

Zhenyu Wei, Zhixuan Xu, Jingxiang Guo, Yiwen Hou, Chongkai Gao, Zhehao Cai, Jiayu Luo, and Lin Shao. D (r, o) grasp: A unified representation of robot and object interaction for cross-embodiment dexterous grasping. In International Conference on Robotics and Automation (ICRA), 2025

work page 2025
[39]

Foundationstereo: Zero-shot stereo matching

Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. Foundationstereo: Zero-shot stereo matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[40]

Chaineddif- fuser: Unifying trajectory diffusion and keypose predic- tion for robotic manipulation

Zhou Xian, Nikolaos Gkanatsios, Theophile Gervet, Tsung-Wei Ke, and Katerina Fragkiadaki. Chaineddif- fuser: Unifying trajectory diffusion and keypose predic- tion for robotic manipulation. InConference on Robot Learning (CoRL), 2023

work page 2023
[41]

Pixel-perfect depth with semantics-prompted diffusion transformers

Gangwei Xu, Haotong Lin, Hongcheng Luo, Xianqi Wang, Jingfeng Yao, Lianghui Zhu, Yuechuan Pu, Cheng Chi, Haiyang Sun, Bing Wang, et al. Pixel-perfect depth with semantics-prompted diffusion transformers. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[42]

Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy

Yinzhen Xu, Weikang Wan, Jialiang Zhang, Haoran Liu, Zikang Shan, Hao Shen, Ruicheng Wang, Haoran Geng, Yijia Weng, Jiayi Chen, et al. Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[43]

Dnact: Diffusion guided multi-task 3d policy learning

Ge Yan, Yueh-Hua Wu, and Xiaolong Wang. Dnact: Diffusion guided multi-task 3d policy learning. InInter- national Conference on Intelligent Robots and Systems (IROS), 2025

work page 2025
[44]

Maniflow: A general robot manipulation policy via consistency flow training

Ge Yan, Jiyue Zhu, Yuquan Deng, Shiqi Yang, Ri-Zhao Qiu, Xuxin Cheng, Marius Memmel, Ranjay Krishna, Ankit Goyal, Xiaolong Wang, et al. Maniflow: A general robot manipulation policy via consistency flow training. InConference on Robot Learning (CoRL), 2025

work page 2025
[45]

Deep reactive policy: Learning reactive manipulator motion planning for dynamic environments

Jiahui Yang, Jason Jingzhou Liu, Yulong Li, Youssef Khaky, Kenneth Shaw, and Deepak Pathak. Deep reactive policy: Learning reactive manipulator motion planning for dynamic environments. InConference on Robot Learning (CoRL), 2025

work page 2025
[46]

Depth anything v2

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2. InAdvances in Neural Information Process- ing Systems (NeurIPS), 2024

work page 2024
[47]

Ultradex- grasp: Learning universal dexterous grasping for bi- manual robots with synthetic data.arXiv preprint arXiv:2603.05312, 2026

Sizhe Yang, Yiman Xie, Zhixuan Liang, Yang Tian, Jia Zeng, Dahua Lin, and Jiangmiao Pang. Ultradex- grasp: Learning universal dexterous grasping for bi- manual robots with synthetic data.arXiv preprint arXiv:2603.05312, 2026

work page arXiv 2026
[48]

Learning to manipulate anywhere: A visual generalizable framework for reinforcement learning

Zhecheng Yuan, Tianming Wei, Shuiqi Cheng, Gu Zhang, Yuanpei Chen, and Huazhe Xu. Learning to manipulate anywhere: A visual generalizable framework for reinforcement learning. InConference on Robot Learning (CoRL), 2024

work page 2024
[49]

Hermes: Human-to-robot embodied learning from multi-source motion data for mobile dexterous manipulation.arXiv preprint arXiv:2508.20085, 2025

Zhecheng Yuan, Tianming Wei, Langzhe Gu, Pu Hua, Tianhai Liang, Yuanpei Chen, and Huazhe Xu. Hermes: Human-to-robot embodied learning from multi-source motion data for mobile dexterous manipulation.arXiv preprint arXiv:2508.20085, 2025

work page arXiv 2025
[50]

Gnfactor: Multi-task real robot learning with generalizable neural feature fields

Yanjie Ze, Ge Yan, Yueh-Hua Wu, Annabella Macaluso, Yuying Ge, Jianglong Ye, Nicklas Hansen, Li Erran Li, and Xiaolong Wang. Gnfactor: Multi-task real robot learning with generalizable neural feature fields. In Conference on robot learning (CoRL), 2023

work page 2023
[51]

3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations

Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. InRobotics: Science and Systems (RSS), 2024

work page 2024
[52]

Efficiently reconstructing dynamic scenes one d4rt at a time.arXiv preprint arXiv:2512.08924, 2025

Chuhan Zhang, Guillaume Le Moing, Skanda Koppula, Ignacio Rocco, Liliane Momeni, Junyu Xie, Shuyang Sun, Rahul Sukthankar, Jo ¨elle K Barral, Raia Hadsell, et al. Efficiently reconstructing dynamic scenes one d4rt at a time.arXiv preprint arXiv:2512.08924, 2025

work page arXiv 2025
[53]

Generative visual foresight meets task-agnostic pose estimation in robotic table-top manipulation

Chuye Zhang, Xiaoxiong Zhang, Wei Pan, Linfang Zheng, and Wei Zhang. Generative visual foresight meets task-agnostic pose estimation in robotic table-top manipulation. InConference on robot learning (CoRL), 2025

work page 2025
[54]

Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes

Jialiang Zhang, Haoran Liu, Danshi Li, XinQiang Yu, Haoran Geng, Yufei Ding, Jiayi Chen, and He Wang. Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes. InConference on Robot Learning (CoRL), 2024

work page 2024
[55]

Monst3r: A simple approach for estimating geometry in the presence of motion

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion. In International Conference on Learning Representations (ICLR), 2025. APPENDIX In this Appendix, we provide details on synthetic data generation (Appendix A...

work page 2025
[56]

We model the sensors as pinhole cameras with randomized intrinsic param- eters

Camera Configuration:The simulation is equipped with a multi-camera system yielding RGB images. We model the sensors as pinhole cameras with randomized intrinsic param- eters. For each episode, we perturb the focal lengths (f x, fy) and principal points (c x, cy) of the camera intrinsic matrixK. Additionally, we randomize the focus distance and f-number t...

work page
[57]

Lighting Configuration:The lighting system consists of three types of light sources: Dome, Sphere, and Distant lights. This setup allows us to effectively simulate a wide range of lighting conditions, as described below: •Dome Light (Environment Map):We utilize High Dynamic Range Images (HDRI) loaded from a predefined asset list. For each episode, a rando...

work page 2000
[58]

•Robot:The robot’s joint configuration is initialized via an inverse kinematics (IK) solver to reach random valid end- effector poses

Scene Composition and Material Randomization:The scene consists of a robot manipulator, manipulable objects, and background elements such as a tabletop. •Robot:The robot’s joint configuration is initialized via an inverse kinematics (IK) solver to reach random valid end- effector poses. We apply material randomization to the robot’s visual mesh, adding ji...

work page
[59]

Sweep Bean, Insert Screw, and Breakfast are conducted on a single-arm platform, while BiDex Pour is performed on a bimanual robot platform

Imitation Learning Details:We select four tasks, Sweep Bean, Insert Screw, Breakfast, and BiDex Pour, to validate the application of Robo3R in imitation learning. Sweep Bean, Insert Screw, and Breakfast are conducted on a single-arm platform, while BiDex Pour is performed on a bimanual robot platform. For the single-arm platform, we use two RealSense D455...

work page
[60]

Sim-to-Real Transfer Details:We assess whether Robo3R is effective in narrowing the sim-to-real visual gap in two tasks: Push Cube and Pick Cube, as depicted in Fig. 10. The sim-to-real visual gaps produced by different methods are visualized in Fig. 11. Compared to RGB images and point clouds acquired by depth cameras, Robo3R achieves a substantially sma...

work page