pith. machine review for the scientific record. sign in

arxiv: 2604.14089 · v1 · submitted 2026-04-15 · 💻 cs.RO · cs.AI

Recognition: unknown

UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:52 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords universal manipulation interfaceLiDAR SLAMvisuomotor policymultimodal data collection3D spatial perceptionembodied manipulationdeformable object manipulationrobot learning
0
0 comments X

The pith

Adding a lightweight LiDAR sensor to the wrist-mounted UMI produces reliable metric-scale 3D pose data that raises policy success rates and unlocks tasks with deformable and articulated objects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends the original Universal Manipulation Interface by mounting a low-cost LiDAR alongside the camera on the wrist device. This change replaces fragile monocular visual SLAM with LiDAR-centric SLAM that stays accurate even when scenes contain occlusions or fast motion. The collected demonstrations gain consistent 3D structure and metric scale while the downstream policy remains a standard 2D visuomotor network. Because the input data are now higher quality, the same training pipeline yields policies that succeed more often on everyday tasks and can also master new problems such as handling large soft objects or opening articulated mechanisms.

Core claim

UMI-3D integrates a hardware-synchronized LiDAR into the portable wrist interface and supplies a unified spatiotemporal calibration that aligns images with point clouds, delivering accurate 3D pose trajectories; these trajectories raise demonstration quality enough to improve policy performance on standard tasks and to enable previously infeasible ones, all without altering the 2D policy formulation.

What carries the argument

Lightweight LiDAR integration with a unified spatiotemporal calibration framework that fuses visual observations and LiDAR point clouds into consistent metric-scale 3D demonstration representations.

If this is right

  • Standard manipulation tasks reach high success rates because the collected demonstrations contain fewer tracking failures and more consistent geometry.
  • Large deformable-object manipulation and articulated-object operation become learnable even though the policy itself stays 2D.
  • The full pipeline from synchronized capture through calibration, training, and deployment remains portable and open-source.
  • The same hardware-software stack supports large-scale data collection without sacrificing accessibility.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The 3D data could later be fed directly into 3D-aware policies to test whether further gains are possible beyond the current 2D formulation.
  • Similar LiDAR upgrades might be applied to other portable demonstration interfaces to improve robustness in unstructured environments.
  • Open-sourced calibration tools could become a shared resource for aligning multimodal sensors in other robotics data pipelines.

Load-bearing premise

The added LiDAR and calibration produce accurate metric-scale poses in real scenes without creating new systematic errors or drift that cancel the claimed data-quality gains.

What would settle it

A controlled comparison in which policies trained on UMI-3D demonstrations achieve no higher success rates than policies trained on the original vision-only UMI data for the same set of tasks.

Figures

Figures reproduced from arXiv: 2604.14089 by Ziming Wang.

Figure 1
Figure 1. Figure 1: Overview of the UMI-3D system. From left to right, the pipeline consists of three stages: (1) [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: UMI-3D Demonstration Interface Design. The system adopts a wrist-mounted sensing design that ensures consistent observation between human demonstrations (left) and robot execution (right). A wide-FoV fisheye camera provides a shared observation space across embodiments, while continuous gripper tracking enables precise action recording and control. This design establishes a unified perception￾action interf… view at source ↗
Figure 4
Figure 4. Figure 4: Fisheye camera intrinsic calibration. (A) Raw fisheye image capturing a planar checkerboard calibration target under a wide field of view. (B) Undistorted image using the estimated intrinsic parameters under the equidistant projection model. This calibration enables precise pixel-to-ray mapping, which is essential for subsequent perception tasks including marker detection, spatial alignment, and multimodal… view at source ↗
Figure 5
Figure 5. Figure 5: LiDAR–camera extrinsic calibration. (A) Calibration setup in a typical home environment, demonstrating the practicality and ease of deployment only with a calibration board. (B) Design specification of the calibration target, including geometric layout and fiducial markers. (C) Multimodal observations: (i) fisheye image with detected markers and calibration features; (ii) corresponding LiDAR point cloud. (… view at source ↗
Figure 6
Figure 6. Figure 6: Overview of the UMI-3D LiDAR–inertial odometry system. The system follows an iterated error-state Kalman filtering (ESIKF) framework on differentiable manifolds. IMU measurements drive high-frequency forward propagation, while LiDAR scans are processed through scan recombination and residual computation against a voxelized map representation. The state is iteratively refined through backward propagation an… view at source ↗
Figure 7
Figure 7. Figure 7: Unified coordinate system and frame transformations in UMI-3D. The global frame G is initialized as the first IMU frame, and all states are estimated incrementally with respect to this reference. The LiDAR trajectory is directly estimated through LiDAR–inertial odometry, while the camera trajectory is obtained by transforming LiDAR poses using the calibrated extrinsic LTC. The transformation between the Li… view at source ↗
Figure 8
Figure 8. Figure 8: Policy interface and relative action representation in UMI-3D. (Left) The policy takes synchronized multimodal observations, including RGB images, relative end-effector (EE) poses, and gripper states, with explicit latency alignment across sensing and actuation streams. A diffusion policy predicts a sequence of future EE poses, which are executed in a receding-horizon manner with temporal ensembling. (Righ… view at source ↗
Figure 9
Figure 9. Figure 9: Evaluation tasks and policy rollouts. We evaluate UMI-3D across four real-world manipulation tasks. Task 1. Cup arrangement, including both in-distribution and unseen scene/object generalization. Task 2. Curtain pulling, involving large deformable objects under varying lighting conditions. Task 3. Door opening and cup arrangement, a long-horizon task requiring interaction with articulated structures follow… view at source ↗
Figure 10
Figure 10. Figure 10: Robust LiDAR-centric SLAM under challenging real-world conditions. Three representative scenarios are shown: (1) a textureless white wall with minimal visual features, (2) rapid curtain pulling under strong illumination changes with large deformable motion, and (3) a long-horizon manipulation task involving articulated structures and dynamic occlusions. In all cases, UMI-3D achieves stable and accurate po… view at source ↗
Figure 11
Figure 11. Figure 11: Cup arrangement performance across object combinations. Left: Quantitative results over 8 × 8 cup–saucer combinations. Each cell corresponds to one object pair evaluated over 10 trials. The number in the top-right corner indicates the percentage of this combination in the 3,500 training demonstrations. The gray value denotes the accumulated score (out of 60), and the red value shows the normalized score. … view at source ↗
Figure 12
Figure 12. Figure 12: Curtain pulling performance under varying conditions. Left: Three representative curtain types used in training and eval￾uation, along with their occurrence ratios in the 769 demonstration trajectories. For each curtain type, we report the accumulated score (out of 160) and the normalized score across 40 trials. Right: Representative execution sequences for each curtain type, including the initial configu… view at source ↗
Figure 13
Figure 13. Figure 13: Long-horizon manipulation: door opening, cup grasping, and placement. Left: Task decomposition into three spatial regions: outside the door, inside the cabinet, and above the table. The robot sequentially performs door opening, cup retrieval, and placement, with all trials and corresponding scores shown. Please check our website for more comparison videos. Right: Failure propagation analysis using a Sanke… view at source ↗
Figure 14
Figure 14. Figure 14: Cross-embodiment policy transfer from UMI to UMI-3D. Left: Illustration of the training–deployment pipeline. Policies are trained using the original UMI system and directly deployed on UMI-3D hardware without finetuning. Middle: Quantitative results across 4 × 4 mouse–pad combinations in unseen environments. Each cell reports the accumulated score (out of 30) and normalized score over 5 trials. Right: Rep… view at source ↗
read the original abstract

We present UMI-3D, a multimodal extension of the Universal Manipulation Interface (UMI) for robust and scalable data collection in embodied manipulation. While UMI enables portable, wrist-mounted data acquisition, its reliance on monocular visual SLAM makes it vulnerable to occlusions, dynamic scenes, and tracking failures, limiting its applicability in real-world environments. UMI-3D addresses these limitations by introducing a lightweight and low-cost LiDAR sensor tightly integrated into the wrist-mounted interface, enabling LiDAR-centric SLAM with accurate metric-scale pose estimation under challenging conditions. We further develop a hardware-synchronized multimodal sensing pipeline and a unified spatiotemporal calibration framework that aligns visual observations with LiDAR point clouds, producing consistent 3D representations of demonstrations. Despite maintaining the original 2D visuomotor policy formulation, UMI-3D significantly improves the quality and reliability of collected data, which directly translates into enhanced policy performance. Extensive real-world experiments demonstrate that UMI-3D not only achieves high success rates on standard manipulation tasks, but also enables learning of tasks that are challenging or infeasible for the original vision-only UMI setup, including large deformable object manipulation and articulated object operation. The system supports an end-to-end pipeline for data acquisition, alignment, training, and deployment, while preserving the portability and accessibility of the original UMI. All hardware and software components are open-sourced to facilitate large-scale data collection and accelerate research in embodied intelligence: \href{https://umi-3d.github.io}{https://umi-3d.github.io}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents UMI-3D, a multimodal extension of the Universal Manipulation Interface (UMI) that integrates a lightweight LiDAR sensor into the wrist-mounted data collection device. This enables LiDAR-centric SLAM for metric-scale 6-DoF pose estimation that is more robust to occlusions and dynamic scenes than the original monocular visual SLAM. The authors describe a hardware-synchronized multimodal sensing pipeline and a unified spatiotemporal calibration framework to align visual observations with LiDAR point clouds. Despite retaining the original 2D visuomotor policy formulation, the paper claims that the resulting higher-quality and more reliable demonstration data directly improves policy performance and enables learning of previously challenging tasks such as large deformable object manipulation and articulated object operation. Extensive real-world experiments are reported to demonstrate high success rates on standard tasks along with new capabilities, and all hardware and software components are open-sourced.

Significance. If the empirical claims hold, the work offers a practical, portable, and low-cost advance for scalable data collection in embodied manipulation, addressing key limitations of vision-only systems in real-world conditions. The decision to preserve the original policy architecture while upgrading only the sensing pipeline is a pragmatic strength that lowers barriers to adoption. The open-sourcing of the full pipeline is a clear positive that supports reproducibility and community-scale data efforts. The significance is tempered by the need for rigorous validation that the added perception is the causal driver of the reported gains.

major comments (1)
  1. [Section 5] Section 5 (Experiments) and the abstract: The central claim that the LiDAR integration and spatiotemporal calibration produce demonstrably higher-quality 6-DoF trajectories (leading to policy gains) is not supported by independent ground-truth validation. No quantitative metrics (e.g., ATE, RPE, or failure rates) are reported comparing the estimated poses against an external reference such as motion capture on real manipulation sequences that include hand occlusions, fast wrist rotations, and deformable-object contact. This is load-bearing for the contribution because, without such evidence, observed policy improvements cannot be confidently attributed to the 3D perception rather than confounding factors such as operator behavior or post-processing.
minor comments (2)
  1. [Section 4] Figure 2 and Section 4: The description of the unified spatiotemporal calibration could include a brief quantitative assessment of residual alignment error after calibration to help readers assess consistency across modalities.
  2. [Abstract] Abstract: The statement that improved data quality 'directly translates into enhanced policy performance' would benefit from a forward reference to the specific success-rate tables or ablations that support this causal link.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We are grateful to the referee for the thorough review and the recognition of the practical contributions of UMI-3D. Below we provide a point-by-point response to the major comment and indicate the planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Section 5] Section 5 (Experiments) and the abstract: The central claim that the LiDAR integration and spatiotemporal calibration produce demonstrably higher-quality 6-DoF trajectories (leading to policy gains) is not supported by independent ground-truth validation. No quantitative metrics (e.g., ATE, RPE, or failure rates) are reported comparing the estimated poses against an external reference such as motion capture on real manipulation sequences that include hand occlusions, fast wrist rotations, and deformable-object contact. This is load-bearing for the contribution because, without such evidence, observed policy improvements cannot be confidently attributed to the 3D perception rather than confounding factors such as operator behavior or post-processing.

    Authors: We thank the referee for emphasizing the need for rigorous validation of the pose estimation quality. We agree that independent ground-truth metrics such as ATE or RPE against motion capture would provide stronger causal evidence. However, setting up motion capture for the full spectrum of dynamic, occluded, and contact-rich manipulation sequences is practically challenging in our real-world setup. Instead, we evaluated the system through end-to-end policy success rates and the enabling of previously infeasible tasks, using consistent data collection protocols across comparisons. In the revised manuscript, we will add a dedicated subsection in Section 5 with internal quantitative metrics on SLAM robustness (e.g., tracking failure rates, trajectory consistency, and point-cloud alignment quality) and qualitative trajectory visualizations contrasting UMI and UMI-3D on the same sequences. This will better support attribution of the observed gains to the multimodal perception while preserving the pragmatic focus on policy performance. revision: yes

Circularity Check

0 steps flagged

No circularity: hardware extension with experimental validation only

full rationale

The paper presents a hardware and sensing pipeline extension to the prior UMI system, adding a LiDAR sensor and spatiotemporal calibration for metric-scale pose estimation. No mathematical derivations, equations, fitted parameters, or predictions appear in the abstract or described content. Central claims rest on real-world experimental success rates for manipulation tasks rather than any self-referential reduction, self-citation chain, or ansatz smuggled via prior work. The contribution is self-contained as an empirical engineering improvement evaluated externally.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an engineering systems paper; the abstract introduces no explicit free parameters, mathematical axioms, or new postulated entities. The work builds on standard SLAM techniques and the existing UMI framework without adding ungrounded constructs.

pith-pipeline@v0.9.0 · 5580 in / 1235 out tokens · 54045 ms · 2026-05-10T12:52:48.855333+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 20 canonical work pages · 5 internal anchors

  1. [1]

    Automatic extrinsic calibration method for lidar and camera sensor setups.IEEE Transactions on Intelligent Transportation Systems, 2022

    Jorge Beltr ´an, Carlos Guindel, Arturo de la Escalera, and Fernando Garc´ıa. Automatic extrinsic calibration method for lidar and camera sensor setups.IEEE Transactions on Intelligent Transportation Systems, 2022. doi: 10.1109/ TITS.2022.3155228

  2. [2]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π 0: A vision- language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  3. [3]

    Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam.IEEE transactions on robotics, 37(6):1874–1890, 2021

    Carlos Campos, Richard Elvira, Juan J G ´omez Rodr´ıguez, Jos ´e MM Montiel, and Juan D Tard ´os. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam.IEEE transactions on robotics, 37(6):1874–1890, 2021

  4. [4]

    Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

    Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the- wild robot teaching without in-the-wild robots.arXiv preprint arXiv:2402.10329, 2024

  5. [5]

    Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  6. [6]

    Available: https://arxiv.org/abs/2601.09988

    Hojung Choi, Yifan Hou, Chuer Pan, Seongheon Hong, Austin Patel, Xiaomeng Xu, Mark R Cutkosky, and Shuran Song. In-the-wild compliant manipulation with umi-ft.arXiv preprint arXiv:2601.09988, 2026

  7. [7]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  8. [8]

    Lsd- slam: Large-scale direct monocular slam

    Jakob Engel, Thomas Sch ¨ops, and Daniel Cremers. Lsd- slam: Large-scale direct monocular slam. InEuropean conference on computer vision, pages 834–849. Springer, 2014

  9. [9]

    Robopocket: Improve robot policies instantly with your phone.arXiv preprint arXiv:2603.05504, 2026

    Junjie Fang, Wendi Chen, Han Xue, Fangyuan Zhou, Tian Le, Yi Wang, Yuting Zhang, Jun Lv, Chuan Wen, and Cewu Lu. Robopocket: Improve robot policies instantly with your phone.arXiv preprint arXiv:2603.05504, 2026

  10. [10]

    Svo: Fast semi-direct monocular visual odometry

    Christian Forster, Matia Pizzoli, and Davide Scaramuzza. Svo: Fast semi-direct monocular visual odometry. In 2014 IEEE international conference on robotics and automation (ICRA), pages 15–22. IEEE, 2014

  11. [11]

    Automatic generation and detection of highly reliable fiducial markers under occlusion.Pattern Recog- nition, 47(6):2280–2292, 2014

    Sergio Garrido-Jurado, Rafael Mu ˜noz-Salinas, Fran- cisco Jos ´e Madrid-Cuevas, and Manuel Jes ´us Mar ´ın- Jim´enez. Automatic generation and detection of highly reliable fiducial markers under occlusion.Pattern Recog- nition, 47(6):2280–2292, 2014

  12. [12]

    Vision meets robotics: The kitti dataset

    Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The international journal of robotics research, 32(11): 1231–1237, 2013

  13. [13]

    Umi on legs: Making manipulation policies mobile with manipulation-centric whole-body controllers, 2024

    Huy Ha, Yihuai Gao, Zipeng Fu, Jie Tan, and Shuran Song. Umi on legs: Making manipulation policies mobile with manipulation-centric whole-body controllers.arXiv preprint arXiv:2407.10353, 2024

  14. [14]

    Kalman filters on differentiable manifolds,

    Dongjiao He, Wei Xu, and Fu Zhang. Kalman filters on differentiable manifolds.arXiv preprint arXiv:2102.03804, 2021

  15. [15]

    Data scaling laws in im- itation learning for robotic manipulation

    Yingdong Hu, Fanqi Lin, Pingyue Sheng, Chuan Wen, Jiacheng You, and Yang Gao. Data scaling laws in im- itation learning for robotic manipulation.arXiv preprint arXiv:2410.18647, 2024

  16. [16]

    Umigen: A unified framework for egocentric point cloud generation and cross-embodiment robotic imitation learn- ing.arXiv preprint arXiv:2511.09302, 2025

    Yan Huang, Shoujie Li, Xingting Li, and Wenbo Ding. Umigen: A unified framework for egocentric point cloud generation and cross-embodiment robotic imitation learn- ing.arXiv preprint arXiv:2511.09302, 2025

  17. [17]

    Fastumi: A scalable and hardware- independent universal manipulation interface.arXiv preprint arXiv:2409.19499, 2024

    Kehui Liu, Chuyue Guan, Zhongjie Jia, Ziniu Wu, Xin Liu, Tianyu Wang, Shuai Liang, Pengan Chen, Pingrui Zhang, Haoming Song, et al. Fastumi: A scalable and hardware-independent universal manipulation interface with dataset.arXiv preprint arXiv:2409.19499, 2024

  18. [18]

    arXiv preprint arXiv:2602.03310 (2026)

    Songming Liu, Bangguo Li, Kai Ma, Lingxuan Wu, Hengkai Tan, Xiao Ouyang, Hang Su, and Jun Zhu. Rdt2: Exploring the scaling limit of umi data towards zero- shot cross-embodiment generalization.arXiv preprint arXiv:2602.03310, 2026

  19. [19]

    Low- cost retina-like robotic lidars based on incommensurable scanning.IEEE/ASME Transactions on Mechatronics, 27 (1):58–68, 2021

    Zheng Liu, Fu Zhang, and Xiaoping Hong. Low- cost retina-like robotic lidars based on incommensurable scanning.IEEE/ASME Transactions on Mechatronics, 27 (1):58–68, 2021

  20. [20]

    Orb-slam: A versatile and accurate monocular slam system.IEEE transactions on robotics, 31(5):1147–1163, 2015

    Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: A versatile and accurate monocular slam system.IEEE transactions on robotics, 31(5):1147–1163, 2015

  21. [21]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  22. [22]

    Vins-mono: A robust and versatile monocular visual-inertial state estimator.IEEE transactions on robotics, 34(4):1004– 1020, 2018

    Tong Qin, Peiliang Li, and Shaojie Shen. Vins-mono: A robust and versatile monocular visual-inertial state estimator.IEEE transactions on robotics, 34(4):1004– 1020, 2018

  23. [23]

    Learning transferable visual models from natural lan- guage supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. InInternational conference on ma- chine learning, pages 8748–8763. PmLR, 2021

  24. [24]

    A survey on lidar-based autonomous aerial vehicles.IEEE/ASME Transactions on Mechatronics, 2025

    Yunfan Ren, Yixi Cai, Haotian Li, Nan Chen, Fangcheng Zhu, Longji Yin, Fanze Kong, Rundong Li, and Fu Zhang. A survey on lidar-based autonomous aerial vehicles.IEEE/ASME Transactions on Mechatronics, 2025

  25. [25]

    Lego-loam: Lightweight and ground-optimized lidar odometry and mapping on variable terrain

    Tixiao Shan and Brendan Englot. Lego-loam: Lightweight and ground-optimized lidar odometry and mapping on variable terrain. In2018 IEEE/RSJ inter- national conference on intelligent robots and systems (IROS), pages 4758–4765. IEEE, 2018

  26. [26]

    Lio-sam: Tightly-coupled lidar inertial odometry via smoothing and mapping

    Tixiao Shan, Brendan Englot, Drew Meyers, Wei Wang, Carlo Ratti, and Daniela Rus. Lio-sam: Tightly-coupled lidar inertial odometry via smoothing and mapping. In 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 5135–5142. IEEE, 2020

  27. [27]

    Gen-0: Embodied foundation mod- els that scale with physical interaction.Generalist AI Blog, 2025

    Generalist AI Team. Gen-0: Embodied foundation mod- els that scale with physical interaction.Generalist AI Blog, 2025. https://generalistai.com/blog/nov-04-2025- GEN-0

  28. [28]

    Ustc flicar: A sensors fusion dataset of lidar- inertial-camera for heavy-duty autonomous aerial work robots.The International Journal of Robotics Research, 42(11):1015–1047, 2023

    Ziming Wang, Yujiang Liu, Yifan Duan, Xingchen Li, Xinran Zhang, Jianmin Ji, Erbao Dong, and Yanyong Zhang. Ustc flicar: A sensors fusion dataset of lidar- inertial-camera for heavy-duty autonomous aerial work robots.The International Journal of Robotics Research, 42(11):1015–1047, 2023

  29. [29]

    Dexumi: Using human hand as the universal manipulation interface for dexterous manipulation, 2025

    Mengda Xu, Han Zhang, Yifan Hou, Zhenjia Xu, Linxi Fan, Manuela Veloso, and Shuran Song. Dexumi: Using human hand as the universal manipulation in- terface for dexterous manipulation.arXiv preprint arXiv:2505.21864, 2025

  30. [30]

    Fast-lio2: Fast direct lidar-inertial odometry

    Wei Xu, Yixi Cai, Dongjiao He, Jiarong Lin, and Fu Zhang. Fast-lio2: Fast direct lidar-inertial odometry. IEEE Transactions on Robotics, 38(4):2053–2073, 2022

  31. [31]

    Hommi: Learning whole-body mobile manipulation from human demonstrations, 2026

    Xiaomeng Xu, Jisang Park, Han Zhang, Eric Cousineau, Aditya Bhat, Jose Barreiros, Dian Wang, and Shuran Song. Hommi: Learning whole-body mobile manip- ulation from human demonstrations.arXiv preprint arXiv:2603.03243, 2026

  32. [32]

    exumi: Extensible robot teaching system with action-aware task-agnostic tactile representation, 2025

    Yue Xu, Litao Wei, Pengyu An, Qingyu Zhang, and Yong-Lu Li. exumi: Extensible robot teaching system with action-aware task-agnostic tactile representation. arXiv preprint arXiv:2509.14688, 2025

  33. [33]

    Rethinking camera choice: An empirical study on fisheye cam- era properties in robotic manipulation.arXiv preprint arXiv:2603.02139, 2026

    Han Xue, Nan Min, Xiaotong Liu, Wendi Chen, Yuan Fang, Jun Lv, Cewu Lu, and Chuan Wen. Rethinking camera choice: An empirical study on fisheye cam- era properties in robotic manipulation.arXiv preprint arXiv:2603.02139, 2026

  34. [34]

    Efficient and probabilistic adaptive voxel mapping for accurate online lidar odometry.IEEE Robotics and Automation Letters, 7(3):8518–8525, 2022

    Chongjian Yuan, Wei Xu, Xiyuan Liu, Xiaoping Hong, and Fu Zhang. Efficient and probabilistic adaptive voxel mapping for accurate online lidar odometry.IEEE Robotics and Automation Letters, 7(3):8518–8525, 2022

  35. [35]

    John, Zhongyi Zhou, Junjie Wen, Guorui Feng, Yichen Zhu, and Yi Xu

    Qiyuan Zeng, Chengmeng Li, Jude St John, Zhongyi Zhou, Junjie Wen, Guorui Feng, Yichen Zhu, and Yi Xu. Activeumi: Robotic manipulation with active perception from robot-free human demonstrations.arXiv preprint arXiv:2510.01607, 2025

  36. [36]

    Loam: Lidar odometry and mapping in real-time

    Ji Zhang, Sanjiv Singh, et al. Loam: Lidar odometry and mapping in real-time. InRobotics: Science and systems, volume 2, pages 1–9. Berkeley, CA, 2014

  37. [37]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

  38. [38]

    Fast-calib: Lidar-camera extrinsic calibration in one second.arXiv preprint arXiv:2507.17210, 2025

    Chunran Zheng and Fu Zhang. Fast-calib: Lidar-camera extrinsic calibration in one second.arXiv preprint arXiv:2507.17210, 2025. APPENDIX A.Task Scoring Protocol To quantitatively evaluate manipulation performance, we adopt structured, task-specific scoring protocols across all experiments. Each task is decomposed into multiple stages corresponding to key ...