pith. sign in

arxiv: 2606.10382 · v1 · pith:3SOLFKGCnew · submitted 2026-06-09 · 💻 cs.RO

UMI-Bench 1.0: An Open and Reproducible Real-World Benchmark for Tabletop Robotic Manipulation with UMI Data

Pith reviewed 2026-06-27 12:57 UTC · model grok-4.3

classification 💻 cs.RO
keywords robotic manipulationreal-world benchmarkUMItabletop taskspolicy evaluationreproducible roboticswrist-view observation
0
0 comments X

The pith

UMI-Bench 1.0 introduces the first real-world benchmark built specifically for evaluating UMI-style robotic manipulation policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a new benchmark called UMI-Bench 1.0 for testing manipulation policies that use the Universal Manipulation Interface on real robots. Existing benchmarks do not align the full pipeline from data collection through physical deployment in the way UMI policies require. By creating a single reproducible protocol that covers data collection, scene reset, policy execution, logging, and factor analysis, the work aims to make it possible to measure how well these policies generalize beyond their training demonstrations. A sympathetic reader would care because reliable real-robot performance is the missing link between learned policies and practical use, and without a shared testbed it has been hard to compare progress.

Core claim

UMI-Bench 1.0 is the first benchmark dedicated to real-world evaluation of UMI-based manipulation models; it aligns data collection, scene reset, policy execution, result logging, and task-factor analysis within a unified protocol to provide a practical testbed for measuring how UMI-trained policies generalize to real physical manipulation.

What carries the argument

The unified protocol that couples wrist-view observations, action representation, data collection, and physical deployment into one auditable evaluation process.

If this is right

  • Researchers can now run the same UMI policy through a fixed sequence of tasks and obtain comparable numbers across different labs.
  • Task-factor analysis becomes possible because the protocol records scene resets and outcome logging in a consistent format.
  • Policy developers receive a direct way to test whether changes in data collection or action representation improve real-world success.
  • The benchmark makes the full evaluation process auditable, so failures can be traced to specific stages rather than treated as black-box outcomes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams that adopt the benchmark may discover that certain wrist-camera mounting choices affect generalization more than the learning algorithm itself.
  • If the protocol is extended to new tasks, it could reveal whether UMI policies transfer across object categories without retraining.
  • The local-first design may encourage smaller labs to contribute results, creating a broader distribution of tested environments than centralized benchmarks allow.

Load-bearing premise

Existing real-world benchmarks cannot be adapted to the UMI data-to-deployment setting, so a new unified protocol is required to achieve standardized and reproducible measurement.

What would settle it

An independent replication in which two separate teams run the identical UMI-Bench tasks and protocol on the same robot hardware but obtain success rates that differ by more than 15 percentage points on the same policy.

read the original abstract

Real-robot evaluation is essential for understanding whether learned manipulation policies can operate reliably outside curated demonstrations. This need is particularly pressing for Universal Manipulation Interface (UMI)-style policies, whose performance depends on the coupling between wrist-view observations, action representation, data collection, and physical deployment. Existing real-world benchmarks have made important progress, but they are not designed around this UMI data-to-deployment setting. We present UMI-Bench 1.0, a local-first real-robot benchmark for standardized evaluation of UMI-style manipulation policies. To the best of our knowledge, this is the first benchmark dedicated to real-world evaluation of UMI-based manipulation models. UMI-Bench aligns data collection, scene reset, policy execution, result logging, and task-factor analysis within a unified protocol. By making the full evaluation process reproducible and auditable, UMI-Bench provides a practical testbed for measuring how UMI-trained policies generalize to real physical manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes UMI-Bench 1.0 as the first benchmark dedicated to real-world evaluation of UMI-based manipulation policies. It introduces a unified protocol that aligns data collection, scene reset, policy execution, result logging, and task-factor analysis to enable standardized, reproducible, and auditable measurement of how UMI-trained policies generalize to physical tabletop manipulation tasks.

Significance. If the protocol is implemented with sufficient controls and validation to deliver on reproducibility claims, the benchmark would address a genuine gap: existing real-world manipulation benchmarks are not tailored to the UMI observation-action-data-deployment loop. A working UMI-specific testbed could improve comparability across policies and accelerate progress on generalization in physical settings.

major comments (2)
  1. [Abstract] Abstract: The central claim that the unified protocol produces 'reproducible and auditable' results rests on the assertion that aligning scene reset with the other components isolates policy performance. No description, procedure, or validation of the reset mechanism is provided, leaving open whether millimeter-scale placement errors and friction variations—known to dominate generalization measurements in tabletop settings—are controlled.
  2. [Abstract] Abstract: The manuscript states that UMI-Bench 'provides a practical testbed' but supplies neither implementation details of the protocol components nor any empirical results (e.g., inter-run variance, success-rate stability across resets, or comparison against prior benchmarks). Without such evidence the claim that the protocol achieves standardized evaluation cannot be evaluated.
minor comments (1)
  1. [Abstract] The phrase 'local-first' is introduced without definition or contrast to existing benchmarks; a brief clarification would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting these points on the abstract. We agree that additional detail is needed to support the reproducibility claims and will revise the manuscript to address both comments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the unified protocol produces 'reproducible and auditable' results rests on the assertion that aligning scene reset with the other components isolates policy performance. No description, procedure, or validation of the reset mechanism is provided, leaving open whether millimeter-scale placement errors and friction variations—known to dominate generalization measurements in tabletop settings—are controlled.

    Authors: We agree the abstract provides no description of the reset mechanism or its validation. In the revision we will add a concise description of the standardized scene-reset protocol (including fiducial-based placement, friction-control surfaces, and tolerance checks) to the abstract and expand the methods section with the full procedure plus validation data on placement error and friction variation. revision: yes

  2. Referee: [Abstract] Abstract: The manuscript states that UMI-Bench 'provides a practical testbed' but supplies neither implementation details of the protocol components nor any empirical results (e.g., inter-run variance, success-rate stability across resets, or comparison against prior benchmarks). Without such evidence the claim that the protocol achieves standardized evaluation cannot be evaluated.

    Authors: We acknowledge the absence of implementation details and empirical results in the current version. The revised manuscript will include a dedicated implementation subsection describing all protocol components and will report preliminary empirical results (inter-run variance, success-rate stability across resets) from initial benchmark runs to support the standardized-evaluation claim. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark proposal without derivations or self-referential fits

full rationale

The paper proposes UMI-Bench as a unified protocol for real-robot evaluation of UMI-style policies, claiming it is the first such benchmark and that it aligns data collection, scene reset, execution, logging, and analysis. No mathematical derivations, equations, parameter fitting, or predictions appear in the provided text. The central claims rest on the protocol's design and the absence of prior UMI-specific benchmarks, with no load-bearing self-citations, ansatzes, or reductions of outputs to inputs by construction. This is a standard benchmark contribution whose validity can be assessed externally via reproducibility of the protocol itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces a benchmark protocol without introducing new fitted parameters, mathematical axioms, or invented physical entities.

pith-pipeline@v0.9.1-grok · 5770 in / 946 out tokens · 20080 ms · 2026-06-27T12:57:36.770061+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ARP: Enhancing Quantized Skill Abstractions via Visual Alignment and Iterative Refinement for Robotic Manipulation

    cs.RO 2026-06 unverdicted novelty 6.0

    ARP enhances quantized skill abstractions in imitation learning by coupling visual grounding via contrastive alignment with execution refinement via IRH, reporting SOTA results on LIBERO, Meta-World, and real-robot tasks.

Reference graph

Works this paper leans on

22 extracted references · 15 canonical work pages · cited by 1 Pith paper · 10 internal anchors

  1. [1]

    Roboarena: Distributed real-world evaluation of generalist robot policies

    Pranav Atreya, Karl Pertsch, Tony Lee, Moo Jin Kim, Arhan Jain, Artur Kuramshin, Clemens Eppner, Cyrus Neary, Edward Hu, Fabio Ramos, et al. Roboarena: Distributed real-world evaluation of generalist robot policies. arXiv preprint arXiv:2506.18123, 2025

  2. [2]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  3. [3]

    Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren, ...

  4. [4]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  5. [5]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023

  6. [6]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems, 2023

  7. [7]

    Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

    Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.arXiv preprint arXiv:2402.10329, 2024

  8. [8]

    Rlbench: Therobotlearningbenchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

    StephenJames, ZicongMa, DavidRovickArrojo, andAndrewJDavison. Rlbench: Therobotlearningbenchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

  9. [9]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

  10. [10]

    Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

  11. [11]

    Fastumi-100k: Advancing data-driven robotic manipulation with a large-scale umi-style dataset.arXiv preprint arXiv:2510.08022, 2025

    Kehui Liu, Zhongjie Jia, Yang Li, Zhaxizhuoma, Pengan Chen, Song Liu, Xin Liu, Pingrui Zhang, Haoming Song, Xinyi Ye, Nieqing Cao, Zhigang Wang, Jia Zeng, Dong Wang, Yan Ding, Bin Zhao, and Xuelong Li. Fastumi-100k: Advancing data-driven robotic manipulation with a large-scale umi-style dataset.arXiv preprint arXiv:2510.08022, 2025

  12. [12]

    RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

    Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Man- dlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024. 9

  13. [13]

    Octo: An Open-Source Generalist Robot Policy

    OctoModelTeam, DibyaGhosh, HomerWalke, KarlPertsch, KevinBlack, OierMees, SudeepDasari, JoeyHejna, Charles Xu, Jianlan Luo, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  14. [14]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, et al. Open x-embodiment: Robotic learning datasets and rt-x models.arXiv preprint arXiv:2310.08864, 2023

  15. [15]

    ManipArena: Comprehensive Real-world Evaluation of Reasoning-Oriented Generalist Robot Manipulation

    Yu Sun, Meng Cao, Ping Yang, Rongtao Xu, Yunxiao Yan, Runze Xu, Liang Ma, Roy Gan, Andy Zhai, Qingxuan Chen, Zunnan Xu, Hao Wang, Jincheng Yu, Lucy Liang, Qian Wang, Ivan Laptev, Ian D Reid, and Xiaodan Liang. Maniparena: Comprehensive real-world evaluation of reasoning-oriented generalist robot manipulation. arXiv preprint arXiv:2603.28545, 2026

  16. [16]

    Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.arXiv preprint arXiv:2410.00425,

    Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse-kai Chan, et al. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai. arXiv preprint arXiv:2410.00425, 2024

  17. [17]

    Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, Abraham Lee, Kuan Fang, Chelsea Finn, and Sergey Levine

    Homer Rich Walke, Kevin Black, Tony Z. Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, Abraham Lee, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. InProceedings of Machine Learning Research, volume 229 of Proceedings of Machine Learning Research...

  18. [18]

    Robochallenge: Large-scale real-robot evaluation of embodied policies

    Adina Yakefu, Bin Xie, Chongyang Xu, Enwen Zhang, Erjin Zhou, Fan Jia, Haitao Yang, Haoqiang Fan, Haowei Zhang, Hongyang Peng, et al. Robochallenge: Large-scale real-robot evaluation of embodied policies. arXiv preprint arXiv:2510.17950, 2025

  19. [19]

    World Action Models are Zero-shot Policies

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi ...

  20. [20]

    Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks.arXiv preprint arXiv:2412.18194, 2024

    Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, and Xipeng Qiu. Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks.arXiv preprint arXiv:2412.18194, 2024

  21. [21]

    Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

    Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InProceedings of Robotics: Science and Systems, 2023

  22. [22]

    Fastumi: A scalable and hardware-independent universal manipulation interface with dataset

    Zhaxizhuom Zhaxizhuoma, Kehui Liu, Chuyue Guan, Zhongjie Jia, Ziniu Wu, Xin Liu, Tianyu Wang, Shuai Liang, Pengan Chen, Pingrui Zhang, Haoming Song, Delin Qu, Dong Wang, Zhigang Wang, Nieqing Cao, Yan Ding, Bin Zhao, and Xuelong Li. Fastumi: A scalable and hardware-independent universal manipulation interface with dataset. In Proceedings of Machine Learni...