pith. sign in

arxiv: 2606.04233 · v1 · pith:PSK6OG2Enew · submitted 2026-06-02 · 💻 cs.RO

What Are We Actually Benchmarking in Robot Manipulation?

Pith reviewed 2026-06-28 09:29 UTC · model grok-4.3

classification 💻 cs.RO
keywords robot manipulationbenchmarksevaluation diagnosticsLIBEROCALVINshortcut learningstatistical significance
0
0 comments X

The pith

Many robot manipulation benchmarks fail diagnostics for shortcut use, insignificant gains, and data dependence, so their scores do not prove general capability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a benchmark score under one fixed setup is routinely treated as proof of broad manipulation skill, yet four specific failure modes can make that treatment invalid. It defines shortcut solvability, lack of statistical significance, creeping overfitting, and data-source dependence, then supplies one diagnostic for each. When the diagnostics are run on LIBERO, CALVIN, SimplerEnv, RoboCasa, and RoboTwin 2.0, LIBERO and CALVIN fail several while the newer suites fail fewer. On LIBERO a 0.09B model with no language encoder reaches or nears reported state-of-the-art numbers and most claimed improvements lack statistical proof. On CALVIN, randomizing object poses inside the training distribution lowers every tested policy's score. The authors release the diagnostics so future claims can be checked before they are accepted as progress.

Core claim

A robotics benchmark score measures success under one fixed evaluation setup, yet is routinely treated as evidence of general manipulation capability. Four failure modes weaken or invalidate this role: shortcut solvability, lack of statistical significance, creeping overfitting, and data-source dependence. One diagnostic is proposed for each mode. LIBERO and CALVIN fail multiple diagnostics; RoboCasa and RoboTwin 2.0 fail fewer. On LIBERO a 0.09B probe without a language encoder scores at or near reported SOTA and most reported gains are not provably statistically significant. On CALVIN, randomizing block poses within the training range drops performance for every tested policy.

What carries the argument

Four failure modes (shortcut solvability, lack of statistical significance, creeping overfitting, data-source dependence) each paired with one diagnostic that checks whether a benchmark score can serve as a valid proxy for general manipulation capability.

If this is right

  • LIBERO and CALVIN scores can no longer be cited as evidence of progress without first passing the four diagnostics.
  • RoboCasa and RoboTwin 2.0 become preferable evaluation suites for claims that aim to demonstrate general capability.
  • Reported SOTA numbers on LIBERO require statistical testing before they can be treated as real improvements.
  • Policies that appear strong on failing benchmarks may be exploiting fixed data sources or pose distributions rather than learning transferable skills.
  • Authors and reviewers can run the released diagnostics before publishing or accepting a new result.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmark designers could add the four diagnostics as mandatory checks in future suite releases.
  • If the diagnostics become standard, papers may shift from reporting single-suite numbers to reporting which suites pass which checks.
  • The same failure modes could be tested on other embodied benchmarks outside manipulation, such as navigation or assembly tasks.
  • A benchmark that passes all four diagnostics would still need separate evidence that its tasks cover the full range of manipulation behaviors a general agent should handle.

Load-bearing premise

The four listed failure modes and their diagnostics are enough to decide whether any given benchmark score really stands for general manipulation skill.

What would settle it

Apply the four diagnostics to a new benchmark in which a language-free 0.09B probe scores far below every reported policy, every claimed gain passes a statistical-significance test, and randomizing object poses inside the training distribution leaves policy performance unchanged.

Figures

Figures reproduced from arXiv: 2606.04233 by Luzhe Sun, Matthew Walter, Samuel Wheeler, Tewodros W. Ayalew, Tianchong Jiang, Xiangshan Tan.

Figure 1
Figure 1. Figure 1: The panels preview the four diagnostics used in our audit. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Cumulative counts of arXiv papers reporting results on CALVIN, LIBERO, SimplerEnv, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Reported gains from public scores alone. Each pie sorts one benchmark’s previous-best [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: SimplerEnv stack task with three changes that stay inside the training distribution. Left, [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example rendered observations from the five audited benchmarks, shown only to orient [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Closed-loop propagation in one LIBERO-Goal task 8 hardware comparison. With policy, [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
read the original abstract

A robotics benchmark score measures success under one fixed evaluation setup, yet is routinely treated as evidence of general manipulation capability. We identify four failure modes, each of which weakens or invalidates a benchmark's role as a valid proxy for that capability: shortcut solvability, lack of statistical significance, creeping overfitting, and data-source dependence. We propose one diagnostic per failure mode. We audit LIBERO, CALVIN, SimplerEnv, RoboCasa, and RoboTwin 2.0 under these diagnostics. LIBERO and CALVIN fail multiple diagnostics. RoboCasa and RoboTwin 2.0 fail fewer, despite appearing far less often in recent progress claims. On LIBERO, a 0.09B probe with no language encoder scores at or near reported SOTA, and most reported gains are not provably statistically significant. On CALVIN, randomizing block poses within the training range drops performance for every tested policy. We release the four diagnostics with reference implementations for authors and reviewers to apply before treating a benchmark score as evidence of progress. Code and artifacts are available at https://ripl.github.io/manipulation_benchmark_audit/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that robot manipulation benchmark scores are routinely treated as evidence of general manipulation capability but are undermined by four failure modes (shortcut solvability, lack of statistical significance, creeping overfitting, and data-source dependence). It proposes one diagnostic per mode, audits LIBERO, CALVIN, SimplerEnv, RoboCasa, and RoboTwin 2.0, reports that LIBERO and CALVIN fail multiple diagnostics while the others fail fewer, shows that a 0.09B probe without a language encoder reaches near-SOTA on LIBERO with most gains lacking statistical significance, and finds that pose randomization within the training range drops performance on CALVIN for all tested policies. Code and reference implementations are released.

Significance. If the diagnostics are shown to be sufficient, the work could improve evaluation practices in robot learning by discouraging over-interpretation of benchmark scores. The explicit release of code, artifacts, and reference implementations is a clear strength that supports reproducibility and community adoption of the proposed checks.

major comments (3)
  1. [Abstract] Abstract, first paragraph: the statement that each failure mode 'weakens or invalidates' a benchmark's role as a proxy for general manipulation capability treats the four modes as decisive without an argument or evidence that they are exhaustive, that passing all four guarantees proxy validity, or that failure necessarily severs any link to general capability rather than indicating only a narrower limitation.
  2. [Results (LIBERO audit)] Results on LIBERO (probe experiment): the claim that a 0.09B probe with no language encoder scores at or near reported SOTA requires the exact evaluation protocol, number of trials, variance estimates, and comparison baselines to be load-bearing; without these, the result cannot be verified as contradicting prior SOTA claims.
  3. [Results (CALVIN audit)] CALVIN pose-randomization experiment: the finding that randomizing block poses within the training range drops performance for every tested policy is presented as evidence of a failure mode, but the manuscript must specify the exact randomization procedure, number of seeds, and whether the drop exceeds the variance of the original evaluation to establish it as a diagnostic failure.
minor comments (2)
  1. [Introduction] The term 'creeping overfitting' is introduced without a formal definition or citation to prior usage; a brief clarification of how it differs from standard overfitting would improve readability.
  2. [Audit results] Table or figure reporting the per-benchmark diagnostic outcomes should include the exact numerical thresholds used for each diagnostic (e.g., significance level, probe size) to allow direct replication.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and presentation of our diagnostics. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract, first paragraph: the statement that each failure mode 'weakens or invalidates' a benchmark's role as a proxy for general manipulation capability treats the four modes as decisive without an argument or evidence that they are exhaustive, that passing all four guarantees proxy validity, or that failure necessarily severs any link to general capability rather than indicating only a narrower limitation.

    Authors: We agree the abstract phrasing could be read as implying decisiveness. The manuscript does not claim the four modes are exhaustive, nor that passing them guarantees proxy validity; it presents them as identifiable failure modes that can undermine the interpretation of a score as evidence of general capability. We will revise the abstract to state that each mode 'can weaken or invalidate' the proxy role and add a clarifying sentence that other unlisted failure modes may exist. revision: yes

  2. Referee: [Results (LIBERO audit)] Results on LIBERO (probe experiment): the claim that a 0.09B probe with no language encoder scores at or near reported SOTA requires the exact evaluation protocol, number of trials, variance estimates, and comparison baselines to be load-bearing; without these, the result cannot be verified as contradicting prior SOTA claims.

    Authors: The full evaluation protocol, trial counts (100 episodes per task), variance estimates, and direct numerical comparisons to published SOTA results are provided in Section 4.2 and the supplementary material, with the released code reproducing the exact numbers. To make this self-contained in the main text, we will add an explicit table listing the probe score against the cited SOTA baselines with their reported standard deviations. revision: yes

  3. Referee: [Results (CALVIN audit)] CALVIN pose-randomization experiment: the finding that randomizing block poses within the training range drops performance for every tested policy is presented as evidence of a failure mode, but the manuscript must specify the exact randomization procedure, number of seeds, and whether the drop exceeds the variance of the original evaluation to establish it as a diagnostic failure.

    Authors: The exact randomization procedure (uniform sampling of block poses within the original training distribution bounds), number of evaluation seeds (5), and statistical comparison (drop exceeds 2 standard deviations of the original evaluation for all policies) are detailed in Section 4.3 and the released reference implementation. We will insert a short paragraph in the main text summarizing these parameters and the variance comparison to ensure the diagnostic is fully load-bearing without requiring the supplement. revision: yes

Circularity Check

0 steps flagged

No significant circularity; diagnostics proposed independently

full rationale

The paper identifies four failure modes and proposes one diagnostic per mode, then applies them empirically to existing benchmarks. No equations, fitted parameters, or predictions reduce by construction to inputs defined inside the paper. The central claims rest on direct audits (e.g., 0.09B probe scores, pose randomization drops) rather than self-definition, self-citation chains, or renamed known results. The provided text contains no load-bearing self-citations or ansatzes that justify the failure-mode framework itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that benchmark scores are routinely treated as proxies for general capability and that the listed failure modes invalidate that proxy role; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption A robotics benchmark score measures success under one fixed evaluation setup, yet is routinely treated as evidence of general manipulation capability
    Opening sentence of the abstract; this premise motivates the entire audit.

pith-pipeline@v0.9.1-grok · 5743 in / 1188 out tokens · 19721 ms · 2026-06-28T09:29:58.284408+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 19 canonical work pages · 11 internal anchors

  1. [1]

    Torralba and A

    A. Torralba and A. A. Efros. Unbiased look at dataset bias. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1521–1528, 2011

  2. [2]

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  3. [3]

    O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard. CALVIN: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters, 7(3):7327–7334, 2022

  4. [4]

    X. Li, K. Hsu, J. Gu, O. Mees, K. Pertsch, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kirmani, S. Levine, J. Wu, C. Finn, H. Su, Q. Vuong, and T. Xiao. Evaluating real-world robot manip- ulation policies in simulation. InProceedings of the Conference on Robot Learning (CoRL), pages 3705–3728, 2025

  5. [5]

    Nasiriany, A

    S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu. RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots. InProceedings of Robotics: Science and Systems (RSS), 2024

  6. [6]

    T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, W. Deng, Y . Guo, T. Nian, X. Xie, Q. Chen, K. Su, T. Xu, G. Liu, M. Hu, H.-a. Gao, K. Wang, Z. Liang, Y . Qin, X. Yang, P. Luo, and Y . Mu. RoboTwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXi...

  7. [7]

    Goyal, T

    Y . Goyal, T. Khot, A. Agrawal, D. Summers-Stay, D. Batra, and D. Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering.Inter- national Journal of Computer Vision, 127(4):398–414, 2019

  8. [8]

    Lapuschkin, S

    S. Lapuschkin, S. W ¨aldchen, A. Binder, G. Montavon, W. Samek, and K.-R. M¨uller. Unmask- ing Clever Hans predictors and assessing what machines really learn.Nature Communications, 10(1):1096, 2019

  9. [9]

    Everingham, L

    M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (VOC) challenge.International Journal of Computer Vision, 88(2):303–338, 2010. 9

  10. [10]

    Musgrave, S

    K. Musgrave, S. Belongie, and S.-N. Lim. A metric learning reality check. InProceedings of the European Conference on Computer Vision (ECCV), volume 12370 ofLecture Notes in Computer Science, pages 681–699. Springer, 2020. doi:10.1007/978-3-030-58595-2 41

  11. [11]

    Ponce, T

    J. Ponce, T. L. Berg, M. Everingham, D. A. Forsyth, M. Hebert, S. Lazebnik, M. Marszałek, C. Schmid, B. C. Russell, A. Torralba, C. K. I. Williams, J. Zhang, and A. Zisserman. Dataset issues in object recognition. InToward Category-Level Object Recognition, Lecture Notes in Computer Science, pages 29–48. Springer, 2006

  12. [12]

    Recht, R

    B. Recht, R. Roelofs, L. Schmidt, and V . Shankar. Do ImageNet classifiers generalize to ImageNet? InProceedings of the International Conference on Machine Learning (ICML), volume 97, pages 5389–5400, 2019

  13. [13]

    Hendrycks and T

    D. Hendrycks and T. Dietterich. Benchmarking neural network robustness to common corrup- tions and perturbations. InProceedings of the International Conference on Learning Repre- sentations (ICLR), 2019

  14. [14]

    Barz and J

    B. Barz and J. Denzler. Do we train on test data? Purging CIFAR of near-duplicates.Journal of Imaging, 6(6):41, 2020

  15. [15]

    Elangovan, J

    A. Elangovan, J. He, and K. Verspoor. Memorization vs. generalization: Quantifying data leak- age in NLP performance evaluation. InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1325–1335, 2021

  16. [16]

    Lewis, P

    P. Lewis, P. Stenetorp, and S. Riedel. Question and answer test-train overlap in open-domain question answering datasets. InProceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 1000–1008, 2021

  17. [17]

    X. Zhou, Y . Xu, G. Tie, Y . Chen, G. Zhang, D. Chu, P. Zhou, and L. Sun. LIBERO-PRO: Towards robust and fair evaluation of vision-language-action models beyond memorization. arXiv preprint arXiv:2510.03827, 2025

  18. [18]

    S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, J. Li, X. He, S. Zhang, Z. Fei, J. Fu, J. Gong, and X. Qiu. LIBERO-Plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025

  19. [19]

    Y . Zhu, J. Wong, A. Mandlekar, R. Mart´ın-Mart´ın, A. Joshi, K. Lin, A. Maddukuri, S. Nasiri- any, and Y . Zhu. robosuite: A modular simulation framework and benchmark for robot learn- ing.arXiv preprint arXiv:2009.12293, 2020

  20. [20]

    T. Mu, Z. Ling, F. Xiang, D. Yang, X. Li, S. Tao, Z. Huang, Z. Jia, and H. Su. ManiSkill: Generalizable manipulation skill benchmark with large-scale demonstrations. InAdvances in Neural Information Processing Systems (NeurIPS), 2021. Datasets and Benchmarks Track

  21. [21]

    B. Shen, F. Xia, C. Li, R. Mart ´ın-Mart´ın, L. Fan, G. Wang, C. P ´erez-D’Arpino, S. Buch, S. Srivastava, L. P. Tchapmi, M. E. Tchapmi, K. Vainio, J. Wong, L. Fei-Fei, and S. Savarese. iGibson 1.0: A simulation environment for interactive tasks in large realistic scenes. InPro- ceedings of the IEEE/RSJ International Conference on Intelligent Robots and S...

  22. [22]

    T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-World: A benchmark and evaluation for multi-task and meta reinforcement learning. InProceedings of the Conference on Robot Learning (CoRL), pages 1094–1100, 2020

  23. [23]

    A. Zeng, P. Florence, J. Tompson, S. Welker, J. Chien, M. Attarian, T. Armstrong, I. Krasin, D. Duong, V . Sindhwani, and J. Lee. Transporter networks: Rearranging the visual world for robotic manipulation. InProceedings of the Conference on Robot Learning (CoRL), volume 155 ofProceedings of Machine Learning Research, pages 726–747. PMLR, 2021. 10

  24. [24]

    Jiang, A

    Y . Jiang, A. Gupta, Z. Zhang, G. Wang, Y . Dou, Y . Chen, L. Fei-Fei, A. Anandkumar, Y . Zhu, and L. Fan. VIMA: General robot manipulation with multimodal prompts. InProceedings of the International Conference on Machine Learning (ICML), pages 14975–15022, 2023

  25. [25]

    Lynch, A

    C. Lynch, A. Wahid, J. Tompson, T. Ding, J. Betker, R. Baruch, T. Armstrong, and P. Florence. Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, pages 1–8, 2024

  26. [26]

    R. Gong, J. Huang, Y . Zhao, H. Geng, X. Gao, Q. Wu, W. Ai, Z. Zhou, D. Terzopoulos, S.- C. Zhu, B. Jia, and S. Huang. ARNOLD: A benchmark for language-grounded task learning with continuous states in realistic 3D scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20483–20495, 2023

  27. [27]

    Zhang, Z

    S. Zhang, Z. Xu, P. Liu, X. Yu, Y . Li, Q. Gao, Z. Fei, Z. Yin, Z. Wu, Y .-G. Jiang, and X. Qiu. VLABench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11142–11152, 2025

  28. [28]

    James, Z

    S. James, Z. Ma, D. R. Arrojo, and A. J. Davison. RLBench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

  29. [29]

    Pumacay, I

    W. Pumacay, I. Singh, J. Duan, R. Krishna, J. Thomason, and D. Fox. THE COLOSSEUM: A benchmark for evaluating generalization for robotic manipulation. InProceedings of Robotics: Science and Systems (RSS), 2024

  30. [30]

    M. Heo, Y . Lee, D. Lee, and J. J. Lim. FurnitureBench: Reproducible real-world benchmark for long-horizon complex manipulation.International Journal of Robotics Research, 44(10– 11):1863–1891, 2025

  31. [31]

    Shridhar, J

    M. Shridhar, J. Thomason, D. Gordon, Y . Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox. ALFRED: A benchmark for interpreting grounded instructions for everyday tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

  32. [32]

    C. Li, R. Zhang, J. Wong, C. Gokmen, S. Srivastava, R. Mart ´ın-Mart´ın, C. Wang, G. Levine, M. Lingelbach, J. Sun, M. Anvari, M. Hwang, M. Sharma, A. Aydin, D. Bansal, S. Hunter, K.-Y . Kim, A. Lou, C. R. Matthews, I. Villa-Renteria, J. H. Tang, C. Tang, F. Xia, S. Savarese, H. Gweon, K. Liu, J. Wu, and L. Fei-Fei. BEHA VIOR-1K: A benchmark for embodied ...

  33. [33]

    A. Szot, A. Clegg, E. Undersander, E. Wijmans, Y . Zhao, J. Turner, N. Maestre, M. Mukadam, D. Chaplot, O. Maksymets, A. Gokaslan, V . V ondrus, S. Dharur, F. Meier, W. Galuba, et al. Habitat 2.0: Training home assistants to rearrange their habitat. InAdvances in Neural Infor- mation Processing Systems (NeurIPS), 2021

  34. [34]

    Oquab, T

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haz- iza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. J ´egou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. DINOv2: Learning robust visual features without s...

  35. [35]

    Zheng, J

    J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, Y .-Q. Zhang, J. Pang, J. Liu, T. Wang, and X. Zhan. X-VLA: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. InProceedings of the International Conference on Learning Representations (ICLR), 2026. 11

  36. [36]

    H. Wu, Y . Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. InProceedings of the International Conference on Learning Representations (ICLR), 2024

  37. [37]

    X. Li, M. Liu, H. Zhang, C. Yu, J. Xu, H. Wu, C. Cheang, Y . Jing, W. Zhang, H. Liu, H. Li, and T. Kong. Vision-language foundation models as effective robot imitators. InProceedings of the International Conference on Learning Representations (ICLR), 2024

  38. [38]

    Q. Li, Y . Liang, Z. Wang, L. Luo, X. Chen, M. Liao, F. Wei, Y . Deng, S. Xu, Y . Zhang, X. Wang, B. Liu, J. Fu, J. Bao, D. Chen, Y . Shi, J. Yang, and B. Guo. CogACT: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprint arXiv:2411.19650, 2024

  39. [39]

    X. Chen, Y . Chen, Y . Fu, N. Gao, J. Jia, W. Jin, H. Li, Y . Mu, J. Pang, Y . Qiao, Y . Tian, B. Wang, B. Wang, F. Wang, H. Wang, T. Wang, Z. Wang, X. Wei, C. Wu, S. Yang, J. Ye, J. Yu, J. Zeng, J. Zhang, J. Zhang, S. Zhang, F. Zheng, B. Zhou, and Y . Zhu. InternVLA- M1: A spatially guided vision-language-action framework for generalist robot policy.arXi...

  40. [40]

    B. Xie, E. Zhou, F. Jia, H. Shi, H. Fan, H. Zhang, H. Li, J. Sun, J. Bin, J. Huang, K. Liu, K. Liu, K. Gu, L. Sun, M. Zhang, P. Han, R. Hao, R. Zhang, S. Huang, S. Xie, T. Wang, T. Liu, W. Tang, W. Zhu, Y . Chen, Y . Liu, Y . Zhou, Y . Liu, Y . Zhao, Y . Ma, Y . Wei, Y . Chen, Z. Chen, Z. Li, Z. Wu, Z. Zhang, Z. Liu, Z. Yan, and Z. Zhang. Dexbotic: Open-s...

  41. [41]

    F. Li, W. Song, H. Zhao, J. Wang, P. Ding, D. Wang, L. Zeng, and H. Li. Spatial forcing: Implicit spatial representation alignment for vision-language-action model. InProceedings of the International Conference on Learning Representations (ICLR), 2026

  42. [42]

    Y . Luo, W. Chen, T. Liang, B. Wang, and Z. Li. SimVLA: A simple VLA baseline for robotic manipulation.arXiv preprint arXiv:2602.18224, 2026

  43. [43]

    Black, N

    K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vuong, H. Walke...

  44. [44]

    Cadene, S

    R. Cadene, S. Alibert, F. Capuano, M. Aractingi, A. Zouitine, P. Kooijmans, J. Choghari, M. Russi, C. Pascal, S. Palma, D. Aubakirova, M. Shukor, J. Moss, A. Soare, Q. Lhoest, Q. Gallou´edec, and T. Wolf. LeRobot: An open-source library for end-to-end robot learning. InProceedings of the International Conference on Learning Representations (ICLR), 2026

  45. [45]

    H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen-Estruch, A. W. He, V . My- ers, M. J. Kim, M. Du, A. Lee, K. Fang, C. Finn, and S. Levine. BridgeData V2: A dataset for robot learning at scale. InProceedings of the Conference on Robot Learning (CoRL), pages 1723–1736, 2023

  46. [46]

    Mart ´ınez-Plumed, R

    F. Mart ´ınez-Plumed, R. B. C. Prud ˆencio, A. Mart ´ınez-Us´o, and J. Hern ´andez-Orallo. Item response theory in AI: Analysing machine learning classifiers at the instance level.Artificial Intelligence, 271:18–42, 2019

  47. [47]

    Rodriguez, J

    P. Rodriguez, J. Barrow, A. Hoyle, J. P. Lalor, R. Jia, and J. Boyd-Graber. Evaluation examples are not equally informative: How should that change NLP leaderboards? InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 4486–4503, 2021. 12

  48. [48]

    D. J. Weiss and G. G. Kingsbury. Application of computerized adaptive testing to educational problems.Journal of Educational Measurement, 21(4):361–375, 1984

  49. [49]

    Song and P

    H. Song and P. Flach. Efficient and robust model benchmarks with item response theory and adaptive testing.International Journal of Interactive Multimedia and Artificial Intelligence, 6 (5):110–118, 2021

  50. [50]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model. InProceedings of the Conference on Robot Learning (CoRL), pages 2679–2713. PMLR, 2025

  51. [51]

    Y . Li, B. Zhang, C. Gu, Z. Ma, J. Zhang, J. Deng, X. Zhu, and L. Zhang. From imagined futures to executable actions: Mixture of latent actions for robot manipulation. InProceedings of the International Conference on Machine Learning (ICML), 2026

  52. [52]

    Y . Luo, W. Chen, T. Liang, and Z. Li. CORAL: Scalable multi-task robot learning via LoRA experts.arXiv preprint arXiv:2603.09298, 2026

  53. [53]

    Y . Yang, S. Zeng, T. Lin, X. Chang, D. Qi, J. Xiao, H. Liu, R. Chen, Y . Chen, D. Huo, F. Xiong, X. Wei, Z. Ma, and M. Xu. ABot-M0: VLA foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236, 2026

  54. [54]

    MotuBrain: An Advanced World Action Model for Robot Control

    MotuBrain Team, C. Xiang, F. Bao, H. Liu, H. Tan, H. Bi, J. Li, J. Liu, J. Pang, K. Jing, L. Liu, M. Cai, R. Cui, R. Zhao, R. Wang, S. Huang, Y . Feng, Y . Rong, Z. Wang, and J. Zhu. Mo- tuBrain: An advanced world action model for robot control.arXiv preprint arXiv:2604.27792, 2026

  55. [55]

    Y . Yue, Y . Wang, B. Kang, Y . Han, S. Wang, S. Song, J. Feng, and G. Huang. DeeR-VLA: Dynamic inference of multimodal large language models for efficient robot execution. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  56. [56]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    NVIDIA, J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. GR00T N1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

  57. [57]

    S. Bai, D. Wang, C. Chi, W. Zhou, J. Lyu, X. Zhao, P. Wang, Z. Wang, L. Xing, S. Zhang, and B. Chen. Reshaping action error distributions for reliable vision-language-action models. arXiv preprint arXiv:2602.04228, 2026

  58. [58]

    D. Jing, G. Wang, J. Liu, W. Tang, Z. Sun, Y . Yao, Z. Wei, Y . Liu, Z. Lu, and M. Ding. Mixture of horizons in action chunking.arXiv preprint arXiv:2511.19433, 2025

  59. [59]

    Reuss, O

    M. Reuss, O. E. Yagmurlu, F. Wenzel, and R. Lioutikov. Multimodal diffusion transformer: Learning versatile behavior from multimodal goals.arXiv preprint arXiv:2407.05996, 2024

  60. [60]

    Y . Liu, P. Ding, T. Jiang, X. Wang, W. Song, M. Lin, H. Zhao, H. Zhang, Z. Zhuang, W. Zhao, S. Huang, J. Shi, and D. Wang. MMaDA-VLA: Large diffusion vision-language-action model with unified multi-modal instruction and generation.arXiv preprint arXiv:2603.25406, 2026

  61. [61]

    R. Cai, J. Guo, X. He, P. Jin, J. Li, B. Lin, F. Liu, W. Liu, F. Ma, K. Ma, F. Qiu, H. Qu, Y . Su, Q. Sun, D. Wang, D. Wang, Y . Wang, R. Wu, D. Xiang, Y . Yang, H. Ye, Y . Zhang, and Q. Zhou. Xiaomi-Robotics-0: An open-sourced vision-language-action model with real-time execution. arXiv preprint arXiv:2602.12684, 2026

  62. [62]

    J. Guo, Q. Li, P. Li, Z. Chen, N. Sun, Y . Su, H. Wang, Y . Zhang, X. Li, and H. Liu. Unified 4D world action modeling from video priors with asynchronous denoising.arXiv preprint arXiv:2604.26694, 2026. 13

  63. [63]

    block” appears in869, “cube

    H. He and Thinking Machines Lab. Defeating nondeterminism in LLM inference. Thinking Machines Lab: Connectionism, 2025.https://thinkingmachines.ai/blog/ defeating-nondeterminism-in-llm-inference/. 14 LIBERO CALVIN SimplerEnv RoboCasa RoboTwin 2.0 Figure 5: Example rendered observations from the five audited benchmarks, shown only to orient readers to the ...