ManipArena: Comprehensive Real-world Evaluation of Reasoning-Oriented Generalist Robot Manipulation

Charles Yang; Dongxiu Liu; Hao Wang; Ian Reid; Ivan Laptev; Jincheng Yu; Kaidong Zhang; Liang Ma; Liangwang Ruan; Lufang Chen

arxiv: 2603.28545 · v2 · pith:IL3BS4XGnew · submitted 2026-03-30 · 💻 cs.RO · cs.CV

ManipArena: Comprehensive Real-world Evaluation of Reasoning-Oriented Generalist Robot Manipulation

Yu Sun , Meng Cao , Yang Ping , Kaidong Zhang , Qingxuan Chen , Rongtao Xu , Liangwang Ruan , Xuecheng Chen

show 19 more authors

Dongxiu Liu Yunxiao Yan Zunnan Xu Runze Xu Charles Yang Peilun Zhang Xiaofan Li Ruyi Gan Liang Ma Yuehao Yin Jincheng Yu Lufang Chen Yuxin Liang Peng Zhai Hao Wang Ivan Laptev Ian Reid Qian Wang Xiaodan Liang

This is my paper

classification 💻 cs.RO cs.CV

keywords maniparenaevaluationmanipulationreal-robotacrossfailureframeworkgeneralization

0 comments

read the original abstract

Vision-Language-Action (VLA) models and world-action models have emerged as central paradigms for general-purpose robotic intelligence, yet their empirical progress remains constrained by the absence of evaluation protocols that are both physically realistic and diagnostically controlled. Simulator-centric benchmarks provide scale and reproducibility, but cannot fully capture the reality gap induced by perception noise, contact dynamics, latency, calibration error, and hardware constraints. Conversely, real-robot evaluations are often fragmented across platforms, scenes, objects, and scoring rules, making fair comparison and failure attribution difficult. We introduce ManipArena, a standardized real-robot evaluation framework for studying manipulation generalization under matched physical conditions. ManipArena comprises 20 tasks, 10,812 expert trajectories, 13.5M frames, and approximately 188 robot hours across tabletop and mobile manipulation. The framework combines schema-defined task variation, stratified in-domain, visualshift, and semantic-OOD trials, subtask-level partial-credit scoring, three-level language annotations, low-level motor signals, and paired real-to-sim environments reconstructed from physical scenes. Using ManipArena, we evaluate seven tabletop configurations spanning VLA and world-action-model policies. The results show that real-robot conclusions depend not only on architecture, but also on model provenance, fine-tuning regime, data sampling, and annotation granularity. ManipArena thus provides a reproducible and interpretable foundation for diagnosing capability boundaries and failure modes in embodied generalization.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
JoyAI-Sim: A Simulation-Enabled Interconversion Toolchain for the Embodied Data Pyramid
cs.RO 2026-06 unverdicted novelty 3.0

JoyAI-Sim provides bidirectional Robot-Simulation-Human pathways for aligned model evaluation and data generation in robotics using the JoySim simulator as an evaluation layer and physical consistency filter.