pith. machine review for the scientific record. sign in

arxiv: 2605.02881 · v2 · submitted 2026-05-04 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

MolmoAct2: Action Reasoning Models for Real-world Deployment

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:56 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision language actionroboticsembodied AIopen sourceaction reasoningbimanualflow matching
0
0 comments X

The pith

MolmoAct2 is a fully open action reasoning model that outperforms prior open and closed systems on robot control and embodied reasoning benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops MolmoAct2 to overcome barriers in current VLA models for real-world robot use, such as closed access, high costs, latency, and low success rates. It does this by introducing a specialized embodied reasoning backbone, extensive new datasets including a massive bimanual one, an open action tokenizer, a hybrid architecture for continuous actions, and an efficient adaptive reasoning method. If these advances hold, they would allow effective robot deployment using accessible open models on standard hardware. The extensive testing across multiple benchmarks supports claims of superior performance.

Core claim

MolmoAct2 advances its predecessor along five axes with MolmoER as the VLM backbone specialized for spatial and embodied reasoning on a 3.3M-sample corpus using a specialize-then-rehearse recipe, three new datasets with the largest open bimanual trajectories, the OpenFAST open-weight action tokenizer, an architecture redesign grafting flow-matching continuous-action expert onto discrete-token VLM via per-layer KV-cache conditioning, and MolmoThink adaptive-depth reasoning that re-predicts only changing regions. This leads to MolmoAct2 outperforming baselines like Pi-05 on 7 benchmarks and MolmoER surpassing GPT-5 and Gemini Robotics ER-1.5 on 13 embodied-reasoning benchmarks, with full open

What carries the argument

Grafting a flow-matching continuous-action expert onto a discrete-token VLM via per-layer KV-cache conditioning, enabling efficient action generation from reasoning outputs.

Load-bearing premise

The improvements from the new training recipe, datasets, and architecture will translate to reliable high success rates in diverse, previously unseen real-world robot deployments.

What would settle it

Deploying MolmoAct2 on a new robot platform or task outside the 7 benchmarks and observing whether its success rate falls significantly below that of the Pi-05 baseline or becomes too low for practical use.

read the original abstract

Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today's systems fall short on the criteria that matter for real-world deployment. Frontier models are closed, open-weight alternatives are tied to expensive hardware, reasoning-augmented policies pay prohibitive latency for their grounding, and fine-tuned success rates remain below the threshold for dependable use. We present MolmoAct2, a fully open action reasoning model built for practical deployment, advancing its predecessor along five axes. We introduce MolmoER, a VLM backbone specialized for spatial and embodied reasoning, trained on a 3.3M-sample corpus with a specialize-then-rehearse recipe. We release three new datasets spanning low-to-medium cost platforms, including MolmoAct2-BimanualYAM, 720 hours of teleoperated bimanual trajectories that constitute the largest open bimanual dataset to date, together with quality-filtered Franka (DROID) and SO100/101 subsets. We provide OpenFAST, an open-weight, open-data action tokenizer trained on millions of trajectories across five embodiments. We redesign the architecture to graft a flow-matching continuous-action expert onto a discrete-token VLM via per-layer KV-cache conditioning. Finally, we propose MolmoThink, an adaptive-depth reasoning variant that re-predicts depth tokens only for scene regions that change between timesteps, retaining geometric grounding at a fraction of prior latency. In the most extensive empirical study of any open VLA to date, spanning 7 simulation and real-world benchmarks, MolmoAct2 outperforms strong baselines including Pi-05, while MolmoER surpasses GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning benchmarks. We release model weights, training code, and complete training data. Project page: https://allenai.org/blog/molmoact2

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents MolmoAct2, a fully open Vision-Language-Action (VLA) model for practical robotic deployment. It advances the predecessor by introducing MolmoER, a VLM backbone specialized for spatial and embodied reasoning trained on a 3.3M-sample corpus using a specialize-then-rehearse recipe; three new datasets including the largest open bimanual dataset MolmoAct2-BimanualYAM with 720 hours of teleoperated trajectories; OpenFAST, an open-weight action tokenizer; an architectural redesign grafting a flow-matching continuous-action expert onto a discrete-token VLM via per-layer KV-cache conditioning; and MolmoThink, an adaptive-depth reasoning variant that reduces latency by re-predicting depth tokens only for changing scene regions. The key claim is that in an extensive study spanning 7 simulation and real-world benchmarks, MolmoAct2 outperforms strong baselines including Pi-05, while MolmoER surpasses GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning benchmarks, with full release of weights, code, and data.

Significance. If the performance claims hold under independent verification enabled by the artifact releases, this would represent a significant contribution to the field by providing an open, deployable VLA model that addresses key barriers in current systems such as closed access, hardware costs, latency, and low success rates. The new datasets, particularly the bimanual one, and the architectural and training innovations could serve as foundations for future work in embodied reasoning and real-world robotics applications.

major comments (1)
  1. The manuscript claims outperformance across 7 benchmarks and 13 reasoning tasks, but the abstract (and likely the results section) lacks specific details on evaluation protocols, error bars, data exclusion criteria, or potential post-hoc benchmark selection. This information is critical to substantiate the central claim of being the most extensive empirical study of any open VLA to date and to allow full assessment of the reported superiority over Pi-05, GPT-5, and Gemini Robotics ER-1.5.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and the recommendation for minor revision. The feedback on improving transparency around our empirical evaluation is valuable, and we address it directly below while committing to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: The manuscript claims outperformance across 7 benchmarks and 13 reasoning tasks, but the abstract (and likely the results section) lacks specific details on evaluation protocols, error bars, data exclusion criteria, or potential post-hoc benchmark selection. This information is critical to substantiate the central claim of being the most extensive empirical study of any open VLA to date and to allow full assessment of the reported superiority over Pi-05, GPT-5, and Gemini Robotics ER-1.5.

    Authors: We appreciate the referee's emphasis on methodological transparency. The results section (Section 5) already provides detailed evaluation protocols for all 7 benchmarks and 13 reasoning tasks, including task definitions, number of evaluation episodes, success criteria, hardware configurations, and statistical reporting. Error bars (mean ± standard deviation across 3–5 random seeds or runs) are present in every table and figure. Data exclusion criteria for the newly introduced datasets are specified in Section 4, covering quality filtering, trajectory length thresholds, and embodiment-specific cleaning steps. Benchmark selection was performed a priori based on standard tasks from prior VLA literature (e.g., those used by Pi-05 and related works) to ensure comparability; no post-hoc selection occurred. That said, the abstract is intentionally concise and does not enumerate these details. In revision we will (1) expand the abstract with a brief clause on evaluation scope and statistical reporting, and (2) add a short dedicated paragraph or subsection early in the experiments section that consolidates protocols, error-bar methodology, exclusion rules, and selection rationale for easier reference. These changes will be limited to presentation and will not affect any numbers or conclusions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on released benchmarks and external baselines

full rationale

The paper is an empirical contribution describing a new VLA model, new datasets (MolmoAct2-BimanualYAM, Franka subsets), an action tokenizer (OpenFAST), architectural changes (KV-cache grafting of flow-matching expert), and a reasoning variant (MolmoThink). Performance claims are supported by direct comparisons to external baselines (Pi-05, GPT-5, Gemini Robotics ER-1.5) across 7 simulation/real-world and 13 embodied-reasoning benchmarks. No mathematical derivations, equations, or 'predictions' are presented that reduce by construction to fitted inputs or self-citations. The work explicitly releases weights, code, and data for independent verification, satisfying the criteria for self-contained empirical results with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Central claims rest on empirical training success with chosen data scales, training recipes, and architecture grafts; numerous free parameters in model design and data curation typical of large-scale ML but not independently derived.

free parameters (2)
  • training corpus size
    3.3M-sample corpus size for MolmoER specialization chosen to achieve embodied reasoning gains.
  • bimanual dataset scale
    720-hour MolmoAct2-BimanualYAM dataset size selected as largest open bimanual collection.
axioms (1)
  • domain assumption New datasets and training distributions are representative of real-world robot deployment conditions.
    Invoked to support generalization claims from benchmarks to practical use.

pith-pipeline@v0.9.0 · 5753 in / 1534 out tokens · 86632 ms · 2026-05-11T00:56:02.526805+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 50 canonical work pages · 24 internal anchors

  1. [1]

    S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

  2. [2]

    Berscheid, P

    L. Berscheid, P. Meißner, and T. Kröger. Robot learning of shifting objects for grasping in cluttered environments. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),

  3. [3]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.pi_0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

  4. [4]

    RT-1: Robotics Transformer for Real-World Control at Scale

    A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817,

  5. [5]

    URLhttps://arxiv.org/abs/2511.04668. Z. Cai, R. Wang, C. Gu, F. Pu, J. Xu, Y. Wang, W. Yin, Z. Yang, C. Wei, Q. Sun, T. Zhou, J. Li, H. E. Pang, O. Qian, Y. Wei, Z. Lin, X. Shi, K. Deng, X. Han, Z. Chen, X. Fan, H. Deng, L. Lu, L. Pan, B. Li, Z. Liu, Q. Wang, D. Lin, and L. Yang. Scaling spatial intelligence with multimodal foundation models,

  6. [6]

    URL https://arxiv.org/abs/2511.13719. Q. Chen, J. Yu, M. Schwager, P. Abbeel, Y. Shentu, and P. Wu. Sarm: Stage-aware reward modeling for long horizon robot manipulation.arXiv preprint arXiv:2509.25358,

  7. [7]

    Project page. L. Cheng, J. Duan, Y. R. Wang, H. Fang, B. Li, Y. Huang, E. Wang, A. Eftekhar, J. Lee, W. Yuan, et al. Pointarena: Probing multimodal grounding through language-guided pointing.arXiv preprint arXiv:2505.09990,

  8. [8]

    URL https://arxiv.org/abs/2601.10611. G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

  9. [9]

    Robonet: Large-scale multi-robot learning,

    S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn. Robonet: Large-scale multi-robot learning.arXiv preprint arXiv:1910.11215,

  10. [10]

    S. Deng, M. Yan, Y. Zheng, J. Su, W. Zhang, X. Zhao, H. Cui, Z. Zhang, and H. Wang. Stereovla: Enhancing vision-language-action models with stereo vision.arXiv preprint arXiv:2512.21970,

  11. [11]

    Deshpande, M

    A. Deshpande, M. Guru, R. Hendrix, S. Jauhri, A. Eftekhar, R. Tripathi, M. Argus, J. Salvador, H. Fang, M. Wallingford, et al. Molmob0t: Large-scale simulation enables zero-shot manipulation.arXiv preprint arXiv:2603.16861,

  12. [12]

    URL https://arxiv.org/abs/2505.23705. 30 M. Du, B. Wu, Z. Li, X.-J. Huang, and Z. Wei. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers),

  13. [13]

    Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

    F. Ebert, Y. Yang, K. Schmeckpeper, B. Bucher, G. Georgakis, K. Daniilidis, C. Finn, and S. Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets.arXiv preprint arXiv:2109.13396,

  14. [14]

    H.-S. Fang, H. Fang, Z. Tang, J. Liu, C. Wang, J. Wang, H. Zhu, and C. Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot.arXiv preprint arXiv:2307.00595,

  15. [15]

    URLhttps://arxiv.org/abs/2311.18259. M. Guerquin. Introducing AI2’s beaker.AI2 Blog,

  16. [16]

    Accessed: 2024-12-31

    URLhttps://web.archive.org/web/20241231204439/https: //medium.com/ai2-blog/beaker-ed617d5f4593. Accessed: 2024-12-31. Original: https://medium.com/ai2-blog/ beaker-ed617d5f4593. C.-P. Huang, Y.-H. Wu, M.-H. Chen, Y.-C. F. Wang, and F.-E. Yang. Thinkact: Vision-language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815,

  17. [17]

    C.-Y. Hung, Q. Sun, P. Hong, A. Zadeh, C. Li, U. Tan, N. Majumder, S. Poria, et al. Nora: A small open-sourced generalist vision language action model for embodied tasks.arXiv preprint arXiv:2504.19854,

  18. [18]

    URLhttps://arxiv.org/abs/2504.16054. E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. InConference on Robot Learning,

  19. [19]

    URLhttps://arxiv.org/abs/1612.06890. A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y. Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945,

  20. [20]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,

  21. [21]

    M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645,

  22. [22]

    M. J. Kim, Y. Gao, T.-Y. Lin, Y.-C. Lin, Y. Ge, G. Lam, P. Liang, S. Song, M.-Y. Liu, C. Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026a. Y. Kim, W. Pumacay, O. Rayyan, M. Argus, W. Han, E. VanderBilt, J. Salvador, A. Deshpande, R. Hendrix, S. Jauhri, et al. Molmospaces: A lar...

  23. [23]

    URL https://arxiv.org/abs/2411.15124. J. Lee, J. Duan, H. Fang, Y. Deng, S. Liu, B. Li, B. Fang, J. Zhang, Y. R. Wang, S. Lee, W. Han, W. Pumacay, A. Wu, R. Hendrix, K. Farley, E. VanderBilt, A. Farhadi, D. Fox, and R. Krishna. Molmoact: Action reasoning models that can reason in space,

  24. [24]

    URLhttps://arxiv.org/abs/2508.07917. J. H. Lee, M. Kerzel, K. Ahrens, C. Weber, and S. Wermter. What is right for me is not yet right for you: A dataset for grounding relative directions via multi-task learning. InInternational Joint Conference on Artificial Intelligence, Jul

  25. [25]

    URLhttps://www.ijcai.org/proceedings/2022/0145.pdf. B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y. Li, Z. Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Q. Li, Y. Liang, Z. Wang, L. Luo, X. Chen, M. Liao, F. Wei, Y. Deng, S. Xu, Y. Zhang, et al. Cogact: A foundational vision-langu...

  26. [26]

    Flow Matching for Generative Modeling

    Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

  27. [27]

    S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864,

  28. [28]

    NVIDIA, J

    URLhttps://arxiv.org/abs/2301.02229. NVIDIA, J. Bjorck, N. C. Fernando Castañeda, X. Da, R. Ding, L. J. Fan, Y. Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y. L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y. Xi...

  29. [29]

    In 2024 IEEE International Conference on Robotics and Automation (ICRA),

  30. [30]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,

  31. [31]

    D. Qu, H. Song, Q. Chen, Y. Yao, X. Ye, Y. Ding, Z. Wang, J. Gu, B. Zhao, D. Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830,

  32. [32]

    URL https://arxiv.org/abs/2412.07755. P. Sermanet, T. Ding, J. Zhao, F. Xia, D. Dwibedi, K. Gopalakrishnan, C. Chan, G. Dulac-Arnold, S. Maddineni, N. J. Joshi, P. Florence, W. Han, R. Baruch, Y. Lu, S. Mirchandani, P. Xu, P. Sanketi, K. Hausman, I. Shafran, B. Ichter, and Y. Cao. Robovqa: Multimodal long-horizon reasoning for robotics,

  33. [33]

    URLhttps://arxiv.org/ abs/2311.00899. N. M. M. Shafiullah, A. Rai, H. Etukuru, Y. Liu, I. Misra, S. Chintala, and L. Pinto. On bringing robots home.arXiv preprint arXiv:2311.16098,

  34. [34]

    URLhttps://arxiv.org/abs/2506.01844. A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,

  35. [35]

    Q. Sun, P. Hong, T. D. Pala, V. Toh, U. Tan, D. Ghosal, S. Poria, et al. Emma-x: An embodied multimodal action model with grounded chain of thought and look-ahead spatial reasoning.arXiv preprint arXiv:2412.11974,

  36. [36]

    G. Team, P. Georgiev, V. I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530,

  37. [37]

    G. A. Team. Gen-1: Scaling embodied foundation models to mastery.Generalist AI Blog, 2026a. https://generalistai.com/blog/apr-02-2026-GEN-1. G. R. Team, A. Abdolmaleki, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, A. Balakrishna, N. Batchelor, A. Bewley, J. Bingham, et al. Gemini robotics 1.5: Pushing the frontier of generalist robots with advan...

  38. [38]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y. Xia, B. Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786,

  39. [39]

    Y. Tur, J. Naghiyev, H. Fang, W.-C. Tsai, J. Duan, D. Fox, and R. Krishna. Recurrent-depth vla: Implicit test-time compute scaling of vision-language-action models via latent iterative reasoning.arXiv preprint arXiv:2602.07845,

  40. [40]

    W. Wang, M. Ghobadi, K. Shakeri, Y. Zhang, and N. Hasani. Rail-only: A low-cost high-performance network for training llms with trillion parameters.2024 IEEE Symposium on High-Performance Interconnects (HOTI), pages 1–10,

  41. [41]

    URLhttps://api.semanticscholar.org/CorpusID:260125277. 33 W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025a. Y. Wang, P. Ding, L. Li, C. Cui, Z. Ge, X. Tong, W. Song, H. Zhao, W. Zhao, ...

  42. [42]

    Y. R. Wang, C. Ung, G. Tannert, J. Duan, J. Li, A. Le, R. Oswal, M. Grotz, W. Pumacay, Y. Deng, et al. Roboeval: Where robotic manipulation meets structured and scalable evaluation.arXiv preprint arXiv:2507.00435, 2025b. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in larg...

  43. [43]

    J. Wen, Y. Zhu, J. Li, Z. Tang, C. Shen, and F. Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855,

  44. [44]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...

  45. [45]

    R. Yang, Z. Zhu, Y. Li, J. Huang, S. Yan, S. Zhou, Z. Liu, X. Li, S. Li, W. Wang, Y. Lin, and H. Zhao. Visual spatial tuning, 2025c. URLhttps://arxiv.org/abs/2511.05491. S. Yang, J. Yang, P. Huang, E. Brown, Z. Yang, Y. Yu, S. Tong, Z. Zheng, Y. Xu, M. Wang, D. Lu, R. Fergus, Y. LeCun, L. Fei-Fei, and S. Xie. Cambrian-s: Towards spatial supersensing in vi...

  46. [46]

    Robotic Control via Embodied Chain-of-Thought Reasoning

    M. Zawalski, W. Chen, K. Pertsch, O. Mees, C. Finn, and S. Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693,

  47. [47]

    L. Zha, A. J. Hancock, M. Zhang, T. Yin, Y. Huang, D. Shah, A. Z. Ren, and A. Majumdar. Lap: Language-action pre-training enables zero-shot cross-embodiment transfer.arXiv preprint arXiv:2602.10556,

  48. [48]

    Zhang, M

    J. Zhang, M. Memmel, K. Kim, D. Fox, J. Thomason, F. Ramos, E. Bıyık, A. Gupta, and A. Li. Peek: Guiding and minimal image representations for zero-shot generalization of robot manipulation policies.arXiv preprint arXiv:2509.18282,

  49. [49]

    T. Z. Zhao, J. Tompson, D. Driess, P. Florence, K. Ghasemipour, C. Finn, and A. Wahid. Aloha unleashed: A simple recipe for robot dexterity.arXiv preprint arXiv:2410.13126,

  50. [50]

    X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

    J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y. Feng, Y. Zheng, J. Zou, Y. Chen, J. Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274,

  51. [51]

    arXiv preprint arXiv:2412.10345 (2024) 13

    R. Zheng, Y. Liang, S. Huang, J. Gao, H. Daumé III, A. Kolobov, F. Huang, and J. Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies.arXiv preprint arXiv:2412.10345,

  52. [52]

    E. Zhou, J. An, C. Chi, Y. Han, S. Rong, C. Zhang, P. Wang, Z. Wang, T. Huang, L. Sheng, et al. Roborefer: Towards spatial referring with reasoning in vision-language models for robotics.arXiv preprint arXiv:2506.04308,

  53. [53]

    C. Zhu, R. Yu, S. Feng, B. Burchfiel, P. Shah, and A. Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792,

  54. [54]

    4.1 and Sec

    34 Appendix The appendix includes the following sections: •§A - Model Details •§B - Training Details •§C - Evaluation Details •§D - Datasets Details •§E - Limitations and Potential Solutions A Model Details This appendix expands the model description in Sec. 4.1 and Sec. 4.2.MolmoAct2is built in two architectural stages. First,MolmoAct2-Pretrainadapts the...

  55. [55]

    dialects,

    The columns correspond to the three training stages used throughout the paper: pre-training, post-training, and embodiment-specific fine-tuning. B.2 MolmoAct2-FAST Tokenizer TheMolmoAct2-F AST Tokenizerarchitecture is an open-data implementation based on the FAST framework released by Physical Intelligence (Pertsch et al., 2025). While we utilize the core...

  56. [56]

    Camera positions are held fixed across all three models

    in a kitchen environment, and design five tasks, each with three spatial variants, for evaluation. Camera positions are held fixed across all three models. Figure 9 shows sample trajectories of MolmoAct2 from 5 different tasks. Table 18 presents the evaluation results for MolmoAct2,π0, and MolmoBot in the DROID setup. C.3 Real-world Bimanual YAM Implement...