Recognition: 2 theorem links
· Lean TheoremMolmoAct2: Action Reasoning Models for Real-world Deployment
Pith reviewed 2026-05-11 00:56 UTC · model grok-4.3
The pith
MolmoAct2 is a fully open action reasoning model that outperforms prior open and closed systems on robot control and embodied reasoning benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MolmoAct2 advances its predecessor along five axes with MolmoER as the VLM backbone specialized for spatial and embodied reasoning on a 3.3M-sample corpus using a specialize-then-rehearse recipe, three new datasets with the largest open bimanual trajectories, the OpenFAST open-weight action tokenizer, an architecture redesign grafting flow-matching continuous-action expert onto discrete-token VLM via per-layer KV-cache conditioning, and MolmoThink adaptive-depth reasoning that re-predicts only changing regions. This leads to MolmoAct2 outperforming baselines like Pi-05 on 7 benchmarks and MolmoER surpassing GPT-5 and Gemini Robotics ER-1.5 on 13 embodied-reasoning benchmarks, with full open
What carries the argument
Grafting a flow-matching continuous-action expert onto a discrete-token VLM via per-layer KV-cache conditioning, enabling efficient action generation from reasoning outputs.
Load-bearing premise
The improvements from the new training recipe, datasets, and architecture will translate to reliable high success rates in diverse, previously unseen real-world robot deployments.
What would settle it
Deploying MolmoAct2 on a new robot platform or task outside the 7 benchmarks and observing whether its success rate falls significantly below that of the Pi-05 baseline or becomes too low for practical use.
read the original abstract
Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today's systems fall short on the criteria that matter for real-world deployment. Frontier models are closed, open-weight alternatives are tied to expensive hardware, reasoning-augmented policies pay prohibitive latency for their grounding, and fine-tuned success rates remain below the threshold for dependable use. We present MolmoAct2, a fully open action reasoning model built for practical deployment, advancing its predecessor along five axes. We introduce MolmoER, a VLM backbone specialized for spatial and embodied reasoning, trained on a 3.3M-sample corpus with a specialize-then-rehearse recipe. We release three new datasets spanning low-to-medium cost platforms, including MolmoAct2-BimanualYAM, 720 hours of teleoperated bimanual trajectories that constitute the largest open bimanual dataset to date, together with quality-filtered Franka (DROID) and SO100/101 subsets. We provide OpenFAST, an open-weight, open-data action tokenizer trained on millions of trajectories across five embodiments. We redesign the architecture to graft a flow-matching continuous-action expert onto a discrete-token VLM via per-layer KV-cache conditioning. Finally, we propose MolmoThink, an adaptive-depth reasoning variant that re-predicts depth tokens only for scene regions that change between timesteps, retaining geometric grounding at a fraction of prior latency. In the most extensive empirical study of any open VLA to date, spanning 7 simulation and real-world benchmarks, MolmoAct2 outperforms strong baselines including Pi-05, while MolmoER surpasses GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning benchmarks. We release model weights, training code, and complete training data. Project page: https://allenai.org/blog/molmoact2
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents MolmoAct2, a fully open Vision-Language-Action (VLA) model for practical robotic deployment. It advances the predecessor by introducing MolmoER, a VLM backbone specialized for spatial and embodied reasoning trained on a 3.3M-sample corpus using a specialize-then-rehearse recipe; three new datasets including the largest open bimanual dataset MolmoAct2-BimanualYAM with 720 hours of teleoperated trajectories; OpenFAST, an open-weight action tokenizer; an architectural redesign grafting a flow-matching continuous-action expert onto a discrete-token VLM via per-layer KV-cache conditioning; and MolmoThink, an adaptive-depth reasoning variant that reduces latency by re-predicting depth tokens only for changing scene regions. The key claim is that in an extensive study spanning 7 simulation and real-world benchmarks, MolmoAct2 outperforms strong baselines including Pi-05, while MolmoER surpasses GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning benchmarks, with full release of weights, code, and data.
Significance. If the performance claims hold under independent verification enabled by the artifact releases, this would represent a significant contribution to the field by providing an open, deployable VLA model that addresses key barriers in current systems such as closed access, hardware costs, latency, and low success rates. The new datasets, particularly the bimanual one, and the architectural and training innovations could serve as foundations for future work in embodied reasoning and real-world robotics applications.
major comments (1)
- The manuscript claims outperformance across 7 benchmarks and 13 reasoning tasks, but the abstract (and likely the results section) lacks specific details on evaluation protocols, error bars, data exclusion criteria, or potential post-hoc benchmark selection. This information is critical to substantiate the central claim of being the most extensive empirical study of any open VLA to date and to allow full assessment of the reported superiority over Pi-05, GPT-5, and Gemini Robotics ER-1.5.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and the recommendation for minor revision. The feedback on improving transparency around our empirical evaluation is valuable, and we address it directly below while committing to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: The manuscript claims outperformance across 7 benchmarks and 13 reasoning tasks, but the abstract (and likely the results section) lacks specific details on evaluation protocols, error bars, data exclusion criteria, or potential post-hoc benchmark selection. This information is critical to substantiate the central claim of being the most extensive empirical study of any open VLA to date and to allow full assessment of the reported superiority over Pi-05, GPT-5, and Gemini Robotics ER-1.5.
Authors: We appreciate the referee's emphasis on methodological transparency. The results section (Section 5) already provides detailed evaluation protocols for all 7 benchmarks and 13 reasoning tasks, including task definitions, number of evaluation episodes, success criteria, hardware configurations, and statistical reporting. Error bars (mean ± standard deviation across 3–5 random seeds or runs) are present in every table and figure. Data exclusion criteria for the newly introduced datasets are specified in Section 4, covering quality filtering, trajectory length thresholds, and embodiment-specific cleaning steps. Benchmark selection was performed a priori based on standard tasks from prior VLA literature (e.g., those used by Pi-05 and related works) to ensure comparability; no post-hoc selection occurred. That said, the abstract is intentionally concise and does not enumerate these details. In revision we will (1) expand the abstract with a brief clause on evaluation scope and statistical reporting, and (2) add a short dedicated paragraph or subsection early in the experiments section that consolidates protocols, error-bar methodology, exclusion rules, and selection rationale for easier reference. These changes will be limited to presentation and will not affect any numbers or conclusions. revision: yes
Circularity Check
No significant circularity; empirical claims rest on released benchmarks and external baselines
full rationale
The paper is an empirical contribution describing a new VLA model, new datasets (MolmoAct2-BimanualYAM, Franka subsets), an action tokenizer (OpenFAST), architectural changes (KV-cache grafting of flow-matching expert), and a reasoning variant (MolmoThink). Performance claims are supported by direct comparisons to external baselines (Pi-05, GPT-5, Gemini Robotics ER-1.5) across 7 simulation/real-world and 13 embodied-reasoning benchmarks. No mathematical derivations, equations, or 'predictions' are presented that reduce by construction to fitted inputs or self-citations. The work explicitly releases weights, code, and data for independent verification, satisfying the criteria for self-contained empirical results with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
free parameters (2)
- training corpus size
- bimanual dataset scale
axioms (1)
- domain assumption New datasets and training distributions are representative of real-world robot deployment conditions.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We design a new VLA architecture to graft the discrete-token VLM into the flow-matching continuous-action expert via per-layer key-value (KV) conditioning.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat induction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MolmoAct2-Think... performs adaptive depth reasoning by autoregressively predicting only the tokens for scene regions that change between timesteps.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Berscheid, P
L. Berscheid, P. Meißner, and T. Kröger. Robot learning of shifting objects for grasping in cluttered environments. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),
2019
-
[3]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.pi_0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
RT-1: Robotics Transformer for Real-World Control at Scale
A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
URLhttps://arxiv.org/abs/2511.04668. Z. Cai, R. Wang, C. Gu, F. Pu, J. Xu, Y. Wang, W. Yin, Z. Yang, C. Wei, Q. Sun, T. Zhou, J. Li, H. E. Pang, O. Qian, Y. Wei, Z. Lin, X. Shi, K. Deng, X. Han, Z. Chen, X. Fan, H. Deng, L. Lu, L. Pan, B. Li, Z. Liu, Q. Wang, D. Lin, and L. Yang. Scaling spatial intelligence with multimodal foundation models,
- [6]
- [7]
-
[8]
URL https://arxiv.org/abs/2601.10611. G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,
-
[9]
Robonet: Large-scale multi-robot learning,
S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn. Robonet: Large-scale multi-robot learning.arXiv preprint arXiv:1910.11215,
- [10]
-
[11]
A. Deshpande, M. Guru, R. Hendrix, S. Jauhri, A. Eftekhar, R. Tripathi, M. Argus, J. Salvador, H. Fang, M. Wallingford, et al. Molmob0t: Large-scale simulation enables zero-shot manipulation.arXiv preprint arXiv:2603.16861,
-
[12]
URL https://arxiv.org/abs/2505.23705. 30 M. Du, B. Wu, Z. Li, X.-J. Huang, and Z. Wei. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers),
-
[13]
Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets
F. Ebert, Y. Yang, K. Schmeckpeper, B. Bucher, G. Georgakis, K. Daniilidis, C. Finn, and S. Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets.arXiv preprint arXiv:2109.13396,
work page internal anchor Pith review arXiv
- [14]
- [15]
-
[16]
URLhttps://web.archive.org/web/20241231204439/https: //medium.com/ai2-blog/beaker-ed617d5f4593. Accessed: 2024-12-31. Original: https://medium.com/ai2-blog/ beaker-ed617d5f4593. C.-P. Huang, Y.-H. Wu, M.-H. Chen, Y.-C. F. Wang, and F.-E. Yang. Thinkact: Vision-language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815,
- [17]
-
[18]
URLhttps://arxiv.org/abs/2504.16054. E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. InConference on Robot Learning,
work page internal anchor Pith review Pith/arXiv arXiv
- [19]
-
[20]
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645,
work page internal anchor Pith review arXiv
-
[22]
M. J. Kim, Y. Gao, T.-Y. Lin, Y.-C. Lin, Y. Ge, G. Lam, P. Liang, S. Song, M.-Y. Liu, C. Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026a. Y. Kim, W. Pumacay, O. Rayyan, M. Argus, W. Han, E. VanderBilt, J. Salvador, A. Deshpande, R. Hendrix, S. Jauhri, et al. Molmospaces: A lar...
work page internal anchor Pith review arXiv
-
[23]
URL https://arxiv.org/abs/2411.15124. J. Lee, J. Duan, H. Fang, Y. Deng, S. Liu, B. Li, B. Fang, J. Zhang, Y. R. Wang, S. Lee, W. Han, W. Pumacay, A. Wu, R. Hendrix, K. Farley, E. VanderBilt, A. Farhadi, D. Fox, and R. Krishna. Molmoact: Action reasoning models that can reason in space,
work page internal anchor Pith review arXiv
-
[24]
URLhttps://arxiv.org/abs/2508.07917. J. H. Lee, M. Kerzel, K. Ahrens, C. Weber, and S. Wermter. What is right for me is not yet right for you: A dataset for grounding relative directions via multi-task learning. InInternational Joint Conference on Artificial Intelligence, Jul
work page internal anchor Pith review arXiv
-
[25]
URLhttps://www.ijcai.org/proceedings/2022/0145.pdf. B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y. Li, Z. Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Q. Li, Y. Liang, Z. Wang, L. Luo, X. Chen, M. Liao, F. Wei, Y. Deng, S. Xu, Y. Zhang, et al. Cogact: A foundational vision-langu...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[26]
Flow Matching for Generative Modeling
Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864,
work page internal anchor Pith review arXiv
-
[28]
URLhttps://arxiv.org/abs/2301.02229. NVIDIA, J. Bjorck, N. C. Fernando Castañeda, X. Da, R. Ding, L. J. Fan, Y. Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y. L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y. Xi...
-
[29]
In 2024 IEEE International Conference on Robotics and Automation (ICRA),
2024
-
[30]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,
work page internal anchor Pith review arXiv
-
[31]
D. Qu, H. Song, Q. Chen, Y. Yao, X. Ye, Y. Ding, Z. Wang, J. Gu, B. Zhao, D. Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830,
work page internal anchor Pith review arXiv
-
[32]
URL https://arxiv.org/abs/2412.07755. P. Sermanet, T. Ding, J. Zhao, F. Xia, D. Dwibedi, K. Gopalakrishnan, C. Chan, G. Dulac-Arnold, S. Maddineni, N. J. Joshi, P. Florence, W. Han, R. Baruch, Y. Lu, S. Mirchandani, P. Xu, P. Sanketi, K. Hausman, I. Shafran, B. Ichter, and Y. Cao. Robovqa: Multimodal long-horizon reasoning for robotics,
- [33]
-
[34]
URLhttps://arxiv.org/abs/2506.01844. A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,
work page internal anchor Pith review arXiv
- [35]
-
[36]
G. Team, P. Georgiev, V. I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530,
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
G. A. Team. Gen-1: Scaling embodied foundation models to mastery.Generalist AI Blog, 2026a. https://generalistai.com/blog/apr-02-2026-GEN-1. G. R. Team, A. Abdolmaleki, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, A. Balakrishna, N. Batchelor, A. Bewley, J. Bingham, et al. Gemini robotics 1.5: Pushing the frontier of generalist robots with advan...
-
[38]
M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y. Xia, B. Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786,
work page internal anchor Pith review Pith/arXiv arXiv
- [39]
-
[40]
W. Wang, M. Ghobadi, K. Shakeri, Y. Zhang, and N. Hasani. Rail-only: A low-cost high-performance network for training llms with trillion parameters.2024 IEEE Symposium on High-Performance Interconnects (HOTI), pages 1–10,
2024
-
[41]
URLhttps://api.semanticscholar.org/CorpusID:260125277. 33 W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025a. Y. Wang, P. Ding, L. Li, C. Cui, Z. Ge, X. Tong, W. Song, H. Zhao, W. Zhao, ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
Y. R. Wang, C. Ung, G. Tannert, J. Duan, J. Li, A. Le, R. Oswal, M. Grotz, W. Pumacay, Y. Deng, et al. Roboeval: Where robotic manipulation meets structured and scalable evaluation.arXiv preprint arXiv:2507.00435, 2025b. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in larg...
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
J. Wen, Y. Zhu, J. Li, Z. Tang, C. Shen, and F. Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855,
-
[44]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
R. Yang, Z. Zhu, Y. Li, J. Huang, S. Yan, S. Zhou, Z. Liu, X. Li, S. Li, W. Wang, Y. Lin, and H. Zhao. Visual spatial tuning, 2025c. URLhttps://arxiv.org/abs/2511.05491. S. Yang, J. Yang, P. Huang, E. Brown, Z. Yang, Y. Yu, S. Tong, Z. Zheng, Y. Xu, M. Wang, D. Lu, R. Fergus, Y. LeCun, L. Fei-Fei, and S. Xie. Cambrian-s: Towards spatial supersensing in vi...
-
[46]
Robotic Control via Embodied Chain-of-Thought Reasoning
M. Zawalski, W. Chen, K. Pertsch, O. Mees, C. Finn, and S. Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693,
work page internal anchor Pith review arXiv
- [47]
- [48]
- [49]
-
[50]
X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model
J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y. Feng, Y. Zheng, J. Zou, Y. Chen, J. Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274,
work page internal anchor Pith review arXiv
-
[51]
arXiv preprint arXiv:2412.10345 (2024) 13
R. Zheng, Y. Liang, S. Huang, J. Gao, H. Daumé III, A. Kolobov, F. Huang, and J. Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies.arXiv preprint arXiv:2412.10345,
- [52]
-
[53]
C. Zhu, R. Yu, S. Feng, B. Burchfiel, P. Shah, and A. Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792,
work page internal anchor Pith review arXiv
-
[54]
4.1 and Sec
34 Appendix The appendix includes the following sections: •§A - Model Details •§B - Training Details •§C - Evaluation Details •§D - Datasets Details •§E - Limitations and Potential Solutions A Model Details This appendix expands the model description in Sec. 4.1 and Sec. 4.2.MolmoAct2is built in two architectural stages. First,MolmoAct2-Pretrainadapts the...
2026
-
[55]
dialects,
The columns correspond to the three training stages used throughout the paper: pre-training, post-training, and embodiment-specific fine-tuning. B.2 MolmoAct2-FAST Tokenizer TheMolmoAct2-F AST Tokenizerarchitecture is an open-data implementation based on the FAST framework released by Physical Intelligence (Pertsch et al., 2025). While we utilize the core...
2025
-
[56]
Camera positions are held fixed across all three models
in a kitchen environment, and design five tasks, each with three spatial variants, for evaluation. Camera positions are held fixed across all three models. Figure 9 shows sample trajectories of MolmoAct2 from 5 different tasks. Table 18 presents the evaluation results for MolmoAct2,π0, and MolmoBot in the DROID setup. C.3 Real-world Bimanual YAM Implement...
2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.