UMI-Bench 1.0: An Open and Reproducible Real-World Benchmark for Tabletop Robotic Manipulation with UMI Data

Chao Yu; Chenjia Bai; Di Wu; Gaoqi Dong; Hongfei Jia; Jin Wang; Nieqing Cao; Shi Jin; Siao Liu; Tianyu Wang

arxiv: 2606.10382 · v1 · pith:3SOLFKGCnew · submitted 2026-06-09 · 💻 cs.RO

UMI-Bench 1.0: An Open and Reproducible Real-World Benchmark for Tabletop Robotic Manipulation with UMI Data

Shi Jin , Yuntian Wang , Yuhui Duan , Di Wu , Gaoqi Dong , Xiaohang Liu , Xiaotong Li , Hongfei Jia

show 11 more authors

Zehao Zhang Tianyu Wang Zhongjie Jia Yuanqi Yao Chenjia Bai Zhaxizhuoma Siao Liu Nieqing Cao Jin Wang Chao Yu Yan Ding

This is my paper

Pith reviewed 2026-06-27 12:57 UTC · model grok-4.3

classification 💻 cs.RO

keywords robotic manipulationreal-world benchmarkUMItabletop taskspolicy evaluationreproducible roboticswrist-view observation

0 comments

The pith

UMI-Bench 1.0 introduces the first real-world benchmark built specifically for evaluating UMI-style robotic manipulation policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a new benchmark called UMI-Bench 1.0 for testing manipulation policies that use the Universal Manipulation Interface on real robots. Existing benchmarks do not align the full pipeline from data collection through physical deployment in the way UMI policies require. By creating a single reproducible protocol that covers data collection, scene reset, policy execution, logging, and factor analysis, the work aims to make it possible to measure how well these policies generalize beyond their training demonstrations. A sympathetic reader would care because reliable real-robot performance is the missing link between learned policies and practical use, and without a shared testbed it has been hard to compare progress.

Core claim

UMI-Bench 1.0 is the first benchmark dedicated to real-world evaluation of UMI-based manipulation models; it aligns data collection, scene reset, policy execution, result logging, and task-factor analysis within a unified protocol to provide a practical testbed for measuring how UMI-trained policies generalize to real physical manipulation.

What carries the argument

The unified protocol that couples wrist-view observations, action representation, data collection, and physical deployment into one auditable evaluation process.

If this is right

Researchers can now run the same UMI policy through a fixed sequence of tasks and obtain comparable numbers across different labs.
Task-factor analysis becomes possible because the protocol records scene resets and outcome logging in a consistent format.
Policy developers receive a direct way to test whether changes in data collection or action representation improve real-world success.
The benchmark makes the full evaluation process auditable, so failures can be traced to specific stages rather than treated as black-box outcomes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams that adopt the benchmark may discover that certain wrist-camera mounting choices affect generalization more than the learning algorithm itself.
If the protocol is extended to new tasks, it could reveal whether UMI policies transfer across object categories without retraining.
The local-first design may encourage smaller labs to contribute results, creating a broader distribution of tested environments than centralized benchmarks allow.

Load-bearing premise

Existing real-world benchmarks cannot be adapted to the UMI data-to-deployment setting, so a new unified protocol is required to achieve standardized and reproducible measurement.

What would settle it

An independent replication in which two separate teams run the identical UMI-Bench tasks and protocol on the same robot hardware but obtain success rates that differ by more than 15 percentage points on the same policy.

read the original abstract

Real-robot evaluation is essential for understanding whether learned manipulation policies can operate reliably outside curated demonstrations. This need is particularly pressing for Universal Manipulation Interface (UMI)-style policies, whose performance depends on the coupling between wrist-view observations, action representation, data collection, and physical deployment. Existing real-world benchmarks have made important progress, but they are not designed around this UMI data-to-deployment setting. We present UMI-Bench 1.0, a local-first real-robot benchmark for standardized evaluation of UMI-style manipulation policies. To the best of our knowledge, this is the first benchmark dedicated to real-world evaluation of UMI-based manipulation models. UMI-Bench aligns data collection, scene reset, policy execution, result logging, and task-factor analysis within a unified protocol. By making the full evaluation process reproducible and auditable, UMI-Bench provides a practical testbed for measuring how UMI-trained policies generalize to real physical manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UMI-Bench offers a first dedicated protocol for UMI policy evaluation but physical reset variability undercuts the reproducibility claim.

read the letter

This paper introduces UMI-Bench 1.0 as the first benchmark built specifically around the UMI data-to-deployment pipeline for tabletop manipulation. It unifies data collection, scene reset, policy execution, logging, and factor analysis into one protocol.

The work does a clear job naming the gap: existing real-world benchmarks were not designed with UMI's wrist-view coupling and action format in mind. That observation is fair and points to a practical need in the subfield.

The soft spot is the reproducibility argument. The central claim is that aligning those steps produces auditable, comparable results. Yet scene reset in physical tabletop settings is sensitive to millimeter placement and friction differences that a protocol alone does not remove. The abstract and motivation do not show data that the reset procedure keeps reset noise from dominating measured generalization gaps. If the full paper includes controlled reset trials or variance numbers, that would strengthen it; without them the claim rests on the protocol description.

The paper is for researchers already working with UMI-style policies who need a shared evaluation setup. A reader in that niche can extract the protocol outline and decide whether to adopt pieces of it.

Send it to peer review. A benchmark proposal that targets a real mismatch in current practice is worth referee time, even if the reset section needs concrete validation before the reproducibility language can be taken at face value.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes UMI-Bench 1.0 as the first benchmark dedicated to real-world evaluation of UMI-based manipulation policies. It introduces a unified protocol that aligns data collection, scene reset, policy execution, result logging, and task-factor analysis to enable standardized, reproducible, and auditable measurement of how UMI-trained policies generalize to physical tabletop manipulation tasks.

Significance. If the protocol is implemented with sufficient controls and validation to deliver on reproducibility claims, the benchmark would address a genuine gap: existing real-world manipulation benchmarks are not tailored to the UMI observation-action-data-deployment loop. A working UMI-specific testbed could improve comparability across policies and accelerate progress on generalization in physical settings.

major comments (2)

[Abstract] Abstract: The central claim that the unified protocol produces 'reproducible and auditable' results rests on the assertion that aligning scene reset with the other components isolates policy performance. No description, procedure, or validation of the reset mechanism is provided, leaving open whether millimeter-scale placement errors and friction variations—known to dominate generalization measurements in tabletop settings—are controlled.
[Abstract] Abstract: The manuscript states that UMI-Bench 'provides a practical testbed' but supplies neither implementation details of the protocol components nor any empirical results (e.g., inter-run variance, success-rate stability across resets, or comparison against prior benchmarks). Without such evidence the claim that the protocol achieves standardized evaluation cannot be evaluated.

minor comments (1)

[Abstract] The phrase 'local-first' is introduced without definition or contrast to existing benchmarks; a brief clarification would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting these points on the abstract. We agree that additional detail is needed to support the reproducibility claims and will revise the manuscript to address both comments.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the unified protocol produces 'reproducible and auditable' results rests on the assertion that aligning scene reset with the other components isolates policy performance. No description, procedure, or validation of the reset mechanism is provided, leaving open whether millimeter-scale placement errors and friction variations—known to dominate generalization measurements in tabletop settings—are controlled.

Authors: We agree the abstract provides no description of the reset mechanism or its validation. In the revision we will add a concise description of the standardized scene-reset protocol (including fiducial-based placement, friction-control surfaces, and tolerance checks) to the abstract and expand the methods section with the full procedure plus validation data on placement error and friction variation. revision: yes
Referee: [Abstract] Abstract: The manuscript states that UMI-Bench 'provides a practical testbed' but supplies neither implementation details of the protocol components nor any empirical results (e.g., inter-run variance, success-rate stability across resets, or comparison against prior benchmarks). Without such evidence the claim that the protocol achieves standardized evaluation cannot be evaluated.

Authors: We acknowledge the absence of implementation details and empirical results in the current version. The revised manuscript will include a dedicated implementation subsection describing all protocol components and will report preliminary empirical results (inter-run variance, success-rate stability across resets) from initial benchmark runs to support the standardized-evaluation claim. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark proposal without derivations or self-referential fits

full rationale

The paper proposes UMI-Bench as a unified protocol for real-robot evaluation of UMI-style policies, claiming it is the first such benchmark and that it aligns data collection, scene reset, execution, logging, and analysis. No mathematical derivations, equations, parameter fitting, or predictions appear in the provided text. The central claims rest on the protocol's design and the absence of prior UMI-specific benchmarks, with no load-bearing self-citations, ansatzes, or reductions of outputs to inputs by construction. This is a standard benchmark contribution whose validity can be assessed externally via reproducibility of the protocol itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces a benchmark protocol without introducing new fitted parameters, mathematical axioms, or invented physical entities.

pith-pipeline@v0.9.1-grok · 5770 in / 946 out tokens · 20080 ms · 2026-06-27T12:57:36.770061+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ARP: Enhancing Quantized Skill Abstractions via Visual Alignment and Iterative Refinement for Robotic Manipulation
cs.RO 2026-06 unverdicted novelty 6.0

ARP enhances quantized skill abstractions in imitation learning by coupling visual grounding via contrastive alignment with execution refinement via IRH, reporting SOTA results on LIBERO, Meta-World, and real-robot tasks.

Reference graph

Works this paper leans on

22 extracted references · 15 canonical work pages · cited by 1 Pith paper · 10 internal anchors

[1]

Roboarena: Distributed real-world evaluation of generalist robot policies

Pranav Atreya, Karl Pertsch, Tony Lee, Moo Jin Kim, Arhan Jain, Artur Kuramshin, Clemens Eppner, Cyrus Neary, Edward Hu, Fabio Ramos, et al. Roboarena: Distributed real-world evaluation of generalist robot policies. arXiv preprint arXiv:2506.18123, 2025

work page arXiv 2025
[2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren, ...

2025
[4]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems, 2023

2023
[7]

Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.arXiv preprint arXiv:2402.10329, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Rlbench: Therobotlearningbenchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

StephenJames, ZicongMa, DavidRovickArrojo, andAndrewJDavison. Rlbench: Therobotlearningbenchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

2020
[9]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

2023
[11]

Fastumi-100k: Advancing data-driven robotic manipulation with a large-scale umi-style dataset.arXiv preprint arXiv:2510.08022, 2025

Kehui Liu, Zhongjie Jia, Yang Li, Zhaxizhuoma, Pengan Chen, Song Liu, Xin Liu, Pingrui Zhang, Haoming Song, Xinyi Ye, Nieqing Cao, Zhigang Wang, Jia Zeng, Dong Wang, Yan Ding, Bin Zhao, and Xuelong Li. Fastumi-100k: Advancing data-driven robotic manipulation with a large-scale umi-style dataset.arXiv preprint arXiv:2510.08022, 2025

work page arXiv 2025
[12]

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Man- dlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024. 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Octo: An Open-Source Generalist Robot Policy

OctoModelTeam, DibyaGhosh, HomerWalke, KarlPertsch, KevinBlack, OierMees, SudeepDasari, JoeyHejna, Charles Xu, Jianlan Luo, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Open X-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, et al. Open x-embodiment: Robotic learning datasets and rt-x models.arXiv preprint arXiv:2310.08864, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

ManipArena: Comprehensive Real-world Evaluation of Reasoning-Oriented Generalist Robot Manipulation

Yu Sun, Meng Cao, Ping Yang, Rongtao Xu, Yunxiao Yan, Runze Xu, Liang Ma, Roy Gan, Andy Zhai, Qingxuan Chen, Zunnan Xu, Hao Wang, Jincheng Yu, Lucy Liang, Qian Wang, Ivan Laptev, Ian D Reid, and Xiaodan Liang. Maniparena: Comprehensive real-world evaluation of reasoning-oriented generalist robot manipulation. arXiv preprint arXiv:2603.28545, 2026

work page internal anchor Pith review arXiv 2026
[16]

Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.arXiv preprint arXiv:2410.00425,

Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse-kai Chan, et al. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai. arXiv preprint arXiv:2410.00425, 2024

work page arXiv 2024
[17]

Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, Abraham Lee, Kuan Fang, Chelsea Finn, and Sergey Levine

Homer Rich Walke, Kevin Black, Tony Z. Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, Abraham Lee, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. InProceedings of Machine Learning Research, volume 229 of Proceedings of Machine Learning Research...

2023
[18]

Robochallenge: Large-scale real-robot evaluation of embodied policies

Adina Yakefu, Bin Xie, Chongyang Xu, Enwen Zhang, Erjin Zhou, Fan Jia, Haitao Yang, Haoqiang Fan, Haowei Zhang, Hongyang Peng, et al. Robochallenge: Large-scale real-robot evaluation of embodied policies. arXiv preprint arXiv:2510.17950, 2025

work page arXiv 2025
[19]

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks.arXiv preprint arXiv:2412.18194, 2024

Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, and Xipeng Qiu. Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks.arXiv preprint arXiv:2412.18194, 2024

work page arXiv 2024
[21]

Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InProceedings of Robotics: Science and Systems, 2023

2023
[22]

Fastumi: A scalable and hardware-independent universal manipulation interface with dataset

Zhaxizhuom Zhaxizhuoma, Kehui Liu, Chuyue Guan, Zhongjie Jia, Ziniu Wu, Xin Liu, Tianyu Wang, Shuai Liang, Pengan Chen, Pingrui Zhang, Haoming Song, Delin Qu, Dong Wang, Zhigang Wang, Nieqing Cao, Yan Ding, Bin Zhao, and Xuelong Li. Fastumi: A scalable and hardware-independent universal manipulation interface with dataset. In Proceedings of Machine Learni...

2025

[1] [1]

Roboarena: Distributed real-world evaluation of generalist robot policies

Pranav Atreya, Karl Pertsch, Tony Lee, Moo Jin Kim, Arhan Jain, Artur Kuramshin, Clemens Eppner, Cyrus Neary, Edward Hu, Fabio Ramos, et al. Roboarena: Distributed real-world evaluation of generalist robot policies. arXiv preprint arXiv:2506.18123, 2025

work page arXiv 2025

[2] [2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren, ...

2025

[4] [4]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems, 2023

2023

[7] [7]

Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.arXiv preprint arXiv:2402.10329, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Rlbench: Therobotlearningbenchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

StephenJames, ZicongMa, DavidRovickArrojo, andAndrewJDavison. Rlbench: Therobotlearningbenchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

2020

[9] [9]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

2023

[11] [11]

Fastumi-100k: Advancing data-driven robotic manipulation with a large-scale umi-style dataset.arXiv preprint arXiv:2510.08022, 2025

Kehui Liu, Zhongjie Jia, Yang Li, Zhaxizhuoma, Pengan Chen, Song Liu, Xin Liu, Pingrui Zhang, Haoming Song, Xinyi Ye, Nieqing Cao, Zhigang Wang, Jia Zeng, Dong Wang, Yan Ding, Bin Zhao, and Xuelong Li. Fastumi-100k: Advancing data-driven robotic manipulation with a large-scale umi-style dataset.arXiv preprint arXiv:2510.08022, 2025

work page arXiv 2025

[12] [12]

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Man- dlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024. 9

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Octo: An Open-Source Generalist Robot Policy

OctoModelTeam, DibyaGhosh, HomerWalke, KarlPertsch, KevinBlack, OierMees, SudeepDasari, JoeyHejna, Charles Xu, Jianlan Luo, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Open X-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, et al. Open x-embodiment: Robotic learning datasets and rt-x models.arXiv preprint arXiv:2310.08864, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

ManipArena: Comprehensive Real-world Evaluation of Reasoning-Oriented Generalist Robot Manipulation

Yu Sun, Meng Cao, Ping Yang, Rongtao Xu, Yunxiao Yan, Runze Xu, Liang Ma, Roy Gan, Andy Zhai, Qingxuan Chen, Zunnan Xu, Hao Wang, Jincheng Yu, Lucy Liang, Qian Wang, Ivan Laptev, Ian D Reid, and Xiaodan Liang. Maniparena: Comprehensive real-world evaluation of reasoning-oriented generalist robot manipulation. arXiv preprint arXiv:2603.28545, 2026

work page internal anchor Pith review arXiv 2026

[16] [16]

Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.arXiv preprint arXiv:2410.00425,

Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse-kai Chan, et al. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai. arXiv preprint arXiv:2410.00425, 2024

work page arXiv 2024

[17] [17]

Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, Abraham Lee, Kuan Fang, Chelsea Finn, and Sergey Levine

Homer Rich Walke, Kevin Black, Tony Z. Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, Abraham Lee, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. InProceedings of Machine Learning Research, volume 229 of Proceedings of Machine Learning Research...

2023

[18] [18]

Robochallenge: Large-scale real-robot evaluation of embodied policies

Adina Yakefu, Bin Xie, Chongyang Xu, Enwen Zhang, Erjin Zhou, Fan Jia, Haitao Yang, Haoqiang Fan, Haowei Zhang, Hongyang Peng, et al. Robochallenge: Large-scale real-robot evaluation of embodied policies. arXiv preprint arXiv:2510.17950, 2025

work page arXiv 2025

[19] [19]

World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks.arXiv preprint arXiv:2412.18194, 2024

Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, and Xipeng Qiu. Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks.arXiv preprint arXiv:2412.18194, 2024

work page arXiv 2024

[21] [21]

Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InProceedings of Robotics: Science and Systems, 2023

2023

[22] [22]

Fastumi: A scalable and hardware-independent universal manipulation interface with dataset

Zhaxizhuom Zhaxizhuoma, Kehui Liu, Chuyue Guan, Zhongjie Jia, Ziniu Wu, Xin Liu, Tianyu Wang, Shuai Liang, Pengan Chen, Pingrui Zhang, Haoming Song, Delin Qu, Dong Wang, Zhigang Wang, Nieqing Cao, Yan Ding, Bin Zhao, and Xuelong Li. Fastumi: A scalable and hardware-independent universal manipulation interface with dataset. In Proceedings of Machine Learni...

2025