Labimus: A Simulation and Benchmark for Humanoid Dexterous Manipulation in Chemical Laboratory
Pith reviewed 2026-07-01 05:51 UTC · model grok-4.3
The pith
Labimus benchmark shows robot policies complete lab tasks but fail to meet required experimental precision.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Labimus exposes a disconnect between task completion and experimental validity: policies that finish laboratory operations can still violate the precision tolerances demanded by real chemistry protocols, even under procedural layouts and perturbations.
What carries the argument
The Labimus benchmark, built from real-to-sim modeled lab assets, particle-based powder physics, and closed-loop instrument readouts that enable joint assessment of manipulation success and measurement validity.
If this is right
- Evaluation of lab robots must include quantitative precision metrics in addition to task success rates.
- Training methods need explicit mechanisms to enforce experimental tolerances during long-horizon sequences.
- The benchmark supplies a standardized testbed for comparing humanoid policies on chemically relevant manipulations.
- Development of reliable lab robots should prioritize closing the gap between task completion and valid experimental outcomes.
Where Pith is reading between the lines
- The precision gap may indicate that current imitation or reinforcement learning approaches lack sufficient feedback from measurement outcomes during training.
- Extending the benchmark to liquid handling or multi-step synthesis workflows could test whether the same disconnect appears in other lab domains.
- If real-robot validation confirms the gap, it would motivate hybrid sim-real training loops that incorporate live instrument data.
Load-bearing premise
The simulated assets, powder dynamics, and instrument readouts capture the precision and variability of actual organic chemistry operations closely enough for the observed gap to hold in reality.
What would settle it
Running the same policies on physical lab equipment and finding that the precision failures either disappear or persist at the same rate as in simulation.
Figures
read the original abstract
Laboratory automation has made remarkable progress through robotic platforms and AI-driven scientific reasoning. However, many laboratory operations (e.g., solid--solid transfer) remain inherently dynamic and require real-time adaptation to different materials and experimental conditions. Such precision-critical manipulations are difficult to standardize, motivating the use of humanoid robots with dexterous hands. Despite this opportunity, no existing benchmark evaluates humanoid manipulation in precision-critical laboratory environments. We present Labimus, to our knowledge, the first benchmark for humanoid dexterous manipulation in organic chemistry laboratories. Labimus reconstructs over 30 functionally faithful assets from real organic chemistry workstations through real-to-sim modeling, collectively covering the core operations of routine organic chemistry experiments. The benchmark integrates articulated laboratory instruments, particle-based powder physics, and closed-loop instrument readouts, enabling a complete manipulation-to-measurement pipeline. It further defines six atomic operations and a seven-step solid-weighing workflow derived from real laboratory standard operating procedures. We introduce a precision-aware evaluation protocol designed to jointly measure task completion, experimental precision, and long-horizon execution. We benchmark three representative policies under procedural layouts and environmental perturbations. Results reveal a precision gap: policies that successfully complete laboratory tasks can still fail to satisfy the quantitative tolerances required by experimental protocols. Our benchmark exposes a fundamental disconnect between task completion and experimental validity, providing a new testbed for developing reliable humanoid robots for scientific laboratories.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Labimus as the first benchmark for humanoid dexterous manipulation in organic chemistry laboratories. It reconstructs over 30 real-to-sim assets covering core operations, integrates articulated instruments with particle-based powder physics and closed-loop readouts, defines six atomic operations plus a seven-step solid-weighing workflow from real SOPs, and applies a precision-aware evaluation protocol. Benchmarking three policies under procedural and perturbed conditions reveals a precision gap in which task completion does not guarantee satisfaction of quantitative experimental tolerances.
Significance. If the simulation dynamics prove faithful to real laboratory tolerances, the benchmark supplies a needed testbed that shifts evaluation from binary task success to joint measurement of completion, precision, and long-horizon validity. The explicit construction from SOP-derived workflows and the precision-aware protocol constitute concrete strengths that could guide development of reliable lab robots.
major comments (1)
- [Abstract] Abstract: the central claim that the benchmark 'exposes a fundamental disconnect between task completion and experimental validity' is load-bearing on the fidelity of the particle-based powder physics, articulated instruments, and closed-loop readouts to real organic-chemistry tolerances (e.g., mass-transfer accuracy within protocol limits). No side-by-side quantitative comparison of simulated versus physical outcomes for weighing precision, powder flow, or sensor readouts is described, leaving open the possibility that the reported precision gap reflects simulation artifacts rather than transferable experimental requirements.
minor comments (2)
- The abstract states that 'over 30 functionally faithful assets' were reconstructed but supplies no quantitative metric or verification procedure for functional faithfulness.
- The three representative policies are mentioned without naming or characterizing them, which limits assessment of result generality.
Simulated Author's Rebuttal
We thank the referee for this constructive comment on simulation fidelity, which directly impacts the strength of our central claim. We address it point-by-point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the benchmark 'exposes a fundamental disconnect between task completion and experimental validity' is load-bearing on the fidelity of the particle-based powder physics, articulated instruments, and closed-loop readouts to real organic-chemistry tolerances (e.g., mass-transfer accuracy within protocol limits). No side-by-side quantitative comparison of simulated versus physical outcomes for weighing precision, powder flow, or sensor readouts is described, leaving open the possibility that the reported precision gap reflects simulation artifacts rather than transferable experimental requirements.
Authors: We agree that the central claim relies on the simulation components being sufficiently faithful to real laboratory tolerances. The manuscript does not include side-by-side quantitative comparisons of simulated versus physical outcomes for weighing precision, powder flow, or sensor readouts; this is a genuine limitation, as the work prioritizes benchmark construction from real-to-sim assets and SOP-derived workflows rather than new physical validation experiments. The particle-based physics, articulated instruments, and closed-loop readouts follow standard simulation practices with parameters chosen to approximate typical organic chemistry conditions, but without explicit calibration data against physical trials. In the revised manuscript we will (1) qualify the abstract claim to specify that the disconnect is shown within the simulated environment and (2) add an explicit limitations subsection discussing modeling assumptions and the need for future sim-to-real studies. These textual changes will be incorporated. revision: partial
Circularity Check
No circularity in benchmark definition or evaluation protocol
full rationale
The paper constructs Labimus as a simulation benchmark by reconstructing real laboratory assets via real-to-sim modeling and deriving workflows from standard operating procedures. No equations, fitted parameters, or predictions are defined in a self-referential manner. The precision-aware evaluation protocol jointly measures task completion and experimental validity as independent metrics without reducing one to the definition of the other. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim about a disconnect between task completion and validity follows directly from running external policies on the defined benchmark, without any reduction to the benchmark's own inputs by construction. This is a standard benchmark presentation with fully independent content.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Real laboratory standard operating procedures can be faithfully translated into six atomic operations and a seven-step workflow in simulation.
- domain assumption Particle-based powder physics and closed-loop instrument readouts produce dynamics representative of real organic chemistry manipulations.
Reference graph
Works this paper leans on
-
[1]
Maffettone, Vladimir V
Benjamin Burger, Phillip M. Maffettone, Vladimir V . Gusev, et al. A mobile robotic chemist.Nature, 583:237–241, 2020
2020
-
[2]
Rui Li, Zixuan Hu, Wenxi Qu, et al. LabUtopia: High-fidelity simulation and hierarchical benchmark for scientific embodied agents.arXiv preprint arXiv:2505.22634, 2025
-
[3]
Boiko, Robert MacKnight, Ben Kline, et al
Daniil A. Boiko, Robert MacKnight, Ben Kline, et al. Autonomous chemical research with large language models.Nature, 624:570–578, 2023
2023
-
[4]
Bran, Sam Cox, Oliver Schilter, et al
Andres M. Bran, Sam Cox, Oliver Schilter, et al. Augmenting large language models with chemistry tools.Nature Machine Intelligence, 6:525–535, 2024
2024
-
[5]
Genie Sim 3.0 : A High-Fidelity Comprehensive Simulation Platform for Humanoid Robot
Chenghao Yin, Da Huang, Di Yang, et al. Genie Sim 3.0: A high-fidelity comprehensive simulation platform for humanoid robot.arXiv preprint arXiv:2601.02078, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[6]
Chengshu Li, Ruohan Zhang, Josiah Wong, et al. BEHA VIOR-1K: A human-centered, embodied AI benchmark with 1,000 everyday activities and realistic simulation.arXiv preprint arXiv:2403.09227, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Stone Tao, Fanbo Xiang, Arth Shukla, et al. ManiSkill3: GPU parallelized robotics simulation and rendering for generalizable embodied AI.arXiv preprint arXiv:2410.00425, 2024
-
[8]
Chemistry3D: Robotic interaction benchmark for chemistry experiments
Shoujie Li, Yan Huang, Changqing Guo, et al. Chemistry3D: Robotic interaction benchmark for chemistry experiments. InIEEE International Conference on Robotics and Automation (ICRA), 2025
2025
-
[9]
Zhiqian Lan, Yuxuan Jiang, Ruiqi Wang, et al. AutoBio: A simulation and benchmark for robotic au- tomation in digital biology laboratory.arXiv preprint arXiv:2505.14030, 2025
-
[10]
MATTERIX: Toward a digital twin for robotics- assisted chemistry laboratory automation.Nature Computational Science, 6:67–82, 2026
Kourosh Darvish, Arjun Sohal, Abhijoy Mandal, et al. MATTERIX: Toward a digital twin for robotics- assisted chemistry laboratory automation.Nature Computational Science, 6:67–82, 2026
2026
-
[11]
Zhao, Vikash Kumar, Sergey Levine, et al
Tony Z. Zhao, Vikash Kumar, Sergey Levine, et al. Learning fine-grained bimanual manipulation with low-cost hardware. InRobotics: Science and Systems (RSS), 2023
2023
-
[12]
Diffusion policy: Visuomotor policy learning via action diffu- sion
Cheng Chi, Siyuan Feng, Yilun Du, et al. Diffusion policy: Visuomotor policy learning via action diffu- sion. InRobotics: Science and Systems (RSS), 2023
2023
-
[13]
Kevin Black, Noah Brown, Danny Driess, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
RLBench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters (RA-L), 5(2):3019–3026, 2020
Stephen James, Zicong Ma, David Rovick Arrojo, et al. RLBench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters (RA-L), 5(2):3019–3026, 2020
2020
-
[15]
CALVIN: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022
Oier Mees, Lukas Hermann, Erick Rosete-Beas, et al. CALVIN: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022. 14
2022
-
[16]
RoboCasa: Large-scale simulation of ev- eryday tasks for generalist robots
Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, et al. RoboCasa: Large-scale simulation of ev- eryday tasks for generalist robots. InRobotics: Science and Systems (RSS), 2024
2024
-
[17]
LIBERO: Benchmarking knowledge transfer for lifelong robot learning
Bo Liu, Yifeng Zhu, Chongkai Gao, et al. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
2023
-
[18]
Habitat 3.0: A co-habitat for humans, avatars and robots.arXiv preprint arXiv:2310.13724, 2023
Xavi Puig, Eric Undersander, Andrew Szot, et al. Habitat 3.0: A co-habitat for humans, avatars and robots.arXiv preprint arXiv:2310.13724, 2023
-
[19]
RoboGen: Towards unleashing infinite data for automated robot learning via generative simulation
Yufei Wang, Zhou Xian, Feng Chen, et al. RoboGen: Towards unleashing infinite data for automated robot learning via generative simulation. InInternational Conference on Machine Learning (ICML), 2024
2024
-
[20]
Tianxing Chen, Zanxin Chen, Baijun Chen, et al. RoboTwin 2.0: A scalable data generator and bench- mark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Factory: Fast contact for robotic assembly
Yashraj Narang, Kier Storey, Iretiayo Akinola, et al. Factory: Fast contact for robotic assembly. In Robotics: Science and Systems (RSS), 2022
2022
-
[22]
FurnitureBench: Reproducible real-world benchmark for long-horizon complex manipulation
Minho Heo, Youngwoon Lee, Doohyun Lee, et al. FurnitureBench: Reproducible real-world benchmark for long-horizon complex manipulation. InRobotics: Science and Systems (RSS), 2023
2023
-
[23]
Isaac Gym: High performance GPU- based physics simulation for robot learning
Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, et al. Isaac Gym: High performance GPU- based physics simulation for robot learning. InNeurIPS Datasets and Benchmarks, 2021
2021
-
[24]
Learning dexterous in-hand manipulation.The International Journal of Robotics Research, 39(1):3–20, 2020
OpenAI, Marcin Andrychowicz, Bowen Baker, et al. Learning dexterous in-hand manipulation.The International Journal of Robotics Research, 39(1):3–20, 2020
2020
-
[25]
Solving Rubik's Cube with a Robot Hand
OpenAI, Ilge Akkaya, Marcin Andrychowicz, et al. Solving Rubik’s cube with a robot hand.arXiv preprint arXiv:1910.07113, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[26]
Learning complex dexterous manipulation with deep reinforcement learning and demonstrations
Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, et al. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. InRobotics: Science and Systems (RSS), 2018
2018
-
[27]
DexMV: Imitation learning for dexterous manipulation from human videos
Yuzhe Qin, Yueh-Hua Wu, Shaowei Liu, et al. DexMV: Imitation learning for dexterous manipulation from human videos. InEuropean Conference on Computer Vision (ECCV), 2022
2022
-
[28]
DexGraspNet: A large-scale robotic dexterous grasp dataset for general objects based on simulation
Ruicheng Wang, Jialiang Zhang, Jiayi Chen, et al. DexGraspNet: A large-scale robotic dexterous grasp dataset for general objects based on simulation. InIEEE International Conference on Robotics and Automation (ICRA), pages 11359–11366, 2023
2023
-
[29]
UniDexGrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy
Yinzhen Xu, Weikang Wan, Jialiang Zhang, et al. UniDexGrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
2023
-
[30]
DexArt: Benchmarking generalizable dexterous manipulation with articulated objects
Chen Bao, Helin Xu, Yuzhe Qin, et al. DexArt: Benchmarking generalizable dexterous manipulation with articulated objects. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
2023
-
[31]
GAPartNet: Cross-category domain-generalizable ob- ject perception and manipulation via generalizable and actionable parts
Haoran Geng, Helin Xu, Chengyang Zhao, et al. GAPartNet: Cross-category domain-generalizable ob- ject perception and manipulation via generalizable and actionable parts. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
2023
-
[32]
DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo
Hanwen Wang, Weizhi Zhao, Xiangyu Wang, et al. DexJoCo: A benchmark and toolkit for task-oriented dexterous manipulation on MuJoCo.arXiv preprint arXiv:2605.16257, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[33]
HumanPlus: Humanoid shadowing and imitation from humans
Zipeng Fu, Qingqing Zhao, Qi Wu, et al. HumanPlus: Humanoid shadowing and imitation from humans. InConference on Robot Learning (CoRL), 2024
2024
-
[34]
Learning human-to-humanoid real-time whole-body teleop- eration
Tairan He, Zhengyi Luo, Wenli Xiao, et al. Learning human-to-humanoid real-time whole-body teleop- eration. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024
2024
-
[35]
OmniH2O: Universal and dexterous human-to-humanoid whole- body teleoperation and learning
Tairan He, Zhengyi Luo, Xialin He, et al. OmniH2O: Universal and dexterous human-to-humanoid whole- body teleoperation and learning. InConference on Robot Learning (CoRL), 2024
2024
-
[36]
Carmelo Sferrazza, Dun-Ming Huang, Xingyu Lin, et al. HumanoidBench: Simulated humanoid bench- mark for whole-body locomotion and manipulation.arXiv preprint arXiv:2403.10506, 2024. 15
-
[37]
ArtVIP: Articulated digital assets of visual realism, modular inter- action, and physical fidelity for robot learning
Zhao Jin, Zhengping Che, Tao Li, et al. ArtVIP: Articulated digital assets of visual realism, modular inter- action, and physical fidelity for robot learning. InInternational Conference on Learning Representations (ICLR), 2026
2026
-
[38]
ProcTHOR: Large-scale embodied AI using procedural generation
Matt Deitke, Eli VanderBilt, Alvaro Herrasti, et al. ProcTHOR: Large-scale embodied AI using procedural generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2022
2022
-
[39]
NVIDIA Isaac Sim: Enabling Scalable, GPU-Accelerated Simulation for Robotics
Sicong Gao, Maurice Pagnucco, Tomasz Bednarz, et al. NVIDIA Isaac Sim: Enabling scalable, GPU- accelerated simulation for robotics.arXiv preprint arXiv:2606.03551, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[40]
AnyTeleop: A general vision-based dexterous robot arm- hand teleoperation system
Yuzhe Qin, Wei Yang, Binghao Huang, et al. AnyTeleop: A general vision-based dexterous robot arm- hand teleoperation system. InRobotics: Science and Systems (RSS), 2023
2023
-
[41]
Domain randomization for transferring deep neural networks from simulation to the real world
Josh Tobin, Rachel Fong, Alex Ray, et al. Domain randomization for transferring deep neural networks from simulation to the real world. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017
2017
-
[42]
THE COLOSSEUM: A benchmark for evaluating generalization for robotic manipulation
Wilbert Pumacay, Ishika Singh, Jiafei Duan, et al. THE COLOSSEUM: A benchmark for evaluating generalization for robotic manipulation. InRobotics: Science and Systems (RSS), 2024
2024
-
[43]
What matters in learning from offline human demonstra- tions for robot manipulation
Ajay Mandlekar, Danfei Xu, Josiah Wong, et al. What matters in learning from offline human demonstra- tions for robot manipulation. InConference on Robot Learning (CoRL), 2021
2021
-
[44]
CLIPort: What and where pathways for robotic manip- ulation
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. CLIPort: What and where pathways for robotic manip- ulation. InConference on Robot Learning (CoRL), 2021
2021
-
[45]
RT-1: Robotics transformer for real-world control at scale
Anthony Brohan, Noah Brown, Justice Carbajal, et al. RT-1: Robotics transformer for real-world control at scale. InRobotics: Science and Systems (RSS), 2023
2023
-
[46]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Octo: An open-source generalist robot policy
Octo Model Team, Dibya Ghosh, Homer Walke, et al. Octo: An open-source generalist robot policy. In Robotics: Science and Systems (RSS), 2024
2024
-
[48]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, et al. OpenVLA: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246, 2024. 16
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.