arxiv: 2605.02881 · v2 · submitted 2026-05-04 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

MolmoAct2: Action Reasoning Models for Real-world Deployment

Haoquan Fang , Jiafei Duan , Donovan Clay , Sam Wang , Shuo Liu , Weikai Huang , Xiang Fan , Wei-Chuan Tsai

show 21 more authors

Shirui Chen Yi Ru Wang Shanli Xing Jaemin Cho Jae Sung Park Ainaz Eftekhar Peter Sushko Karen Farley Angad Wadhwa Cole Harrison Winson Han Ying-Chun Lee Eli VanderBilt Rose Hendrix Suveen Ellawela Lucas Ngoo Joyce Chai Zhongzheng Ren Ali Farhadi Dieter Fox Ranjay Krishna

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:56 UTC · model grok-4.3

classification 💻 cs.RO

keywords vision language actionroboticsembodied AIopen sourceaction reasoningbimanualflow matching

0 comments

The pith

MolmoAct2 is a fully open action reasoning model that outperforms prior open and closed systems on robot control and embodied reasoning benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops MolmoAct2 to overcome barriers in current VLA models for real-world robot use, such as closed access, high costs, latency, and low success rates. It does this by introducing a specialized embodied reasoning backbone, extensive new datasets including a massive bimanual one, an open action tokenizer, a hybrid architecture for continuous actions, and an efficient adaptive reasoning method. If these advances hold, they would allow effective robot deployment using accessible open models on standard hardware. The extensive testing across multiple benchmarks supports claims of superior performance.

Core claim

MolmoAct2 advances its predecessor along five axes with MolmoER as the VLM backbone specialized for spatial and embodied reasoning on a 3.3M-sample corpus using a specialize-then-rehearse recipe, three new datasets with the largest open bimanual trajectories, the OpenFAST open-weight action tokenizer, an architecture redesign grafting flow-matching continuous-action expert onto discrete-token VLM via per-layer KV-cache conditioning, and MolmoThink adaptive-depth reasoning that re-predicts only changing regions. This leads to MolmoAct2 outperforming baselines like Pi-05 on 7 benchmarks and MolmoER surpassing GPT-5 and Gemini Robotics ER-1.5 on 13 embodied-reasoning benchmarks, with full open

What carries the argument

Grafting a flow-matching continuous-action expert onto a discrete-token VLM via per-layer KV-cache conditioning, enabling efficient action generation from reasoning outputs.

Load-bearing premise

The improvements from the new training recipe, datasets, and architecture will translate to reliable high success rates in diverse, previously unseen real-world robot deployments.

What would settle it

Deploying MolmoAct2 on a new robot platform or task outside the 7 benchmarks and observing whether its success rate falls significantly below that of the Pi-05 baseline or becomes too low for practical use.

read the original abstract

Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today's systems fall short on the criteria that matter for real-world deployment. Frontier models are closed, open-weight alternatives are tied to expensive hardware, reasoning-augmented policies pay prohibitive latency for their grounding, and fine-tuned success rates remain below the threshold for dependable use. We present MolmoAct2, a fully open action reasoning model built for practical deployment, advancing its predecessor along five axes. We introduce MolmoER, a VLM backbone specialized for spatial and embodied reasoning, trained on a 3.3M-sample corpus with a specialize-then-rehearse recipe. We release three new datasets spanning low-to-medium cost platforms, including MolmoAct2-BimanualYAM, 720 hours of teleoperated bimanual trajectories that constitute the largest open bimanual dataset to date, together with quality-filtered Franka (DROID) and SO100/101 subsets. We provide OpenFAST, an open-weight, open-data action tokenizer trained on millions of trajectories across five embodiments. We redesign the architecture to graft a flow-matching continuous-action expert onto a discrete-token VLM via per-layer KV-cache conditioning. Finally, we propose MolmoThink, an adaptive-depth reasoning variant that re-predicts depth tokens only for scene regions that change between timesteps, retaining geometric grounding at a fraction of prior latency. In the most extensive empirical study of any open VLA to date, spanning 7 simulation and real-world benchmarks, MolmoAct2 outperforms strong baselines including Pi-05, while MolmoER surpasses GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning benchmarks. We release model weights, training code, and complete training data. Project page: https://allenai.org/blog/molmoact2

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MolmoAct2 ships a full open VLA release with new datasets, tokenizer, and architecture tweaks that beat reported baselines on multiple tasks.

read the letter

The main point is that MolmoAct2 comes with everything released: model weights, training code, and new datasets including the largest open bimanual collection at 720 hours. That openness plus the benchmark numbers is what stands out for anyone who wants to build on or check this work directly. They introduce MolmoER as a backbone trained on 3.3 million samples using a specialize-then-rehearse recipe aimed at spatial and embodied reasoning. The BimanualYAM dataset plus filtered Franka and SO100 subsets give concrete new training material across platforms. OpenFAST provides an open action tokenizer, the architecture grafts a flow-matching expert via per-layer KV-cache conditioning, and MolmoThink adds adaptive reasoning that skips unchanged regions to cut latency. These changes produce the claimed wins: MolmoAct2 ahead of Pi-05 on seven simulation and real benchmarks, and MolmoER ahead of GPT-5 and Gemini on thirteen embodied-reasoning ones. The releases make the empirical claims testable rather than just stated. The abstract stays light on exact protocols, error bars, or data filtering details, so the strength of the outperformance will need the code to verify. Real-world generalization beyond the tested setups is still the usual open question for VLAs, though the artifacts lower the barrier to checking that. This paper is for robotics and VLA researchers who need usable open starting points with data and code. It has enough new empirical scope and verifiability to deserve peer review.

Referee Report

1 major / 0 minor

Summary. The paper presents MolmoAct2, a fully open Vision-Language-Action (VLA) model for practical robotic deployment. It advances the predecessor by introducing MolmoER, a VLM backbone specialized for spatial and embodied reasoning trained on a 3.3M-sample corpus using a specialize-then-rehearse recipe; three new datasets including the largest open bimanual dataset MolmoAct2-BimanualYAM with 720 hours of teleoperated trajectories; OpenFAST, an open-weight action tokenizer; an architectural redesign grafting a flow-matching continuous-action expert onto a discrete-token VLM via per-layer KV-cache conditioning; and MolmoThink, an adaptive-depth reasoning variant that reduces latency by re-predicting depth tokens only for changing scene regions. The key claim is that in an extensive study spanning 7 simulation and real-world benchmarks, MolmoAct2 outperforms strong baselines including Pi-05, while MolmoER surpasses GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning benchmarks, with full release of weights, code, and data.

Significance. If the performance claims hold under independent verification enabled by the artifact releases, this would represent a significant contribution to the field by providing an open, deployable VLA model that addresses key barriers in current systems such as closed access, hardware costs, latency, and low success rates. The new datasets, particularly the bimanual one, and the architectural and training innovations could serve as foundations for future work in embodied reasoning and real-world robotics applications.

major comments (1)

The manuscript claims outperformance across 7 benchmarks and 13 reasoning tasks, but the abstract (and likely the results section) lacks specific details on evaluation protocols, error bars, data exclusion criteria, or potential post-hoc benchmark selection. This information is critical to substantiate the central claim of being the most extensive empirical study of any open VLA to date and to allow full assessment of the reported superiority over Pi-05, GPT-5, and Gemini Robotics ER-1.5.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and the recommendation for minor revision. The feedback on improving transparency around our empirical evaluation is valuable, and we address it directly below while committing to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: The manuscript claims outperformance across 7 benchmarks and 13 reasoning tasks, but the abstract (and likely the results section) lacks specific details on evaluation protocols, error bars, data exclusion criteria, or potential post-hoc benchmark selection. This information is critical to substantiate the central claim of being the most extensive empirical study of any open VLA to date and to allow full assessment of the reported superiority over Pi-05, GPT-5, and Gemini Robotics ER-1.5.

Authors: We appreciate the referee's emphasis on methodological transparency. The results section (Section 5) already provides detailed evaluation protocols for all 7 benchmarks and 13 reasoning tasks, including task definitions, number of evaluation episodes, success criteria, hardware configurations, and statistical reporting. Error bars (mean ± standard deviation across 3–5 random seeds or runs) are present in every table and figure. Data exclusion criteria for the newly introduced datasets are specified in Section 4, covering quality filtering, trajectory length thresholds, and embodiment-specific cleaning steps. Benchmark selection was performed a priori based on standard tasks from prior VLA literature (e.g., those used by Pi-05 and related works) to ensure comparability; no post-hoc selection occurred. That said, the abstract is intentionally concise and does not enumerate these details. In revision we will (1) expand the abstract with a brief clause on evaluation scope and statistical reporting, and (2) add a short dedicated paragraph or subsection early in the experiments section that consolidates protocols, error-bar methodology, exclusion rules, and selection rationale for easier reference. These changes will be limited to presentation and will not affect any numbers or conclusions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on released benchmarks and external baselines

full rationale

The paper is an empirical contribution describing a new VLA model, new datasets (MolmoAct2-BimanualYAM, Franka subsets), an action tokenizer (OpenFAST), architectural changes (KV-cache grafting of flow-matching expert), and a reasoning variant (MolmoThink). Performance claims are supported by direct comparisons to external baselines (Pi-05, GPT-5, Gemini Robotics ER-1.5) across 7 simulation/real-world and 13 embodied-reasoning benchmarks. No mathematical derivations, equations, or 'predictions' are presented that reduce by construction to fitted inputs or self-citations. The work explicitly releases weights, code, and data for independent verification, satisfying the criteria for self-contained empirical results with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Central claims rest on empirical training success with chosen data scales, training recipes, and architecture grafts; numerous free parameters in model design and data curation typical of large-scale ML but not independently derived.

free parameters (2)

training corpus size
3.3M-sample corpus size for MolmoER specialization chosen to achieve embodied reasoning gains.
bimanual dataset scale
720-hour MolmoAct2-BimanualYAM dataset size selected as largest open bimanual collection.

axioms (1)

domain assumption New datasets and training distributions are representative of real-world robot deployment conditions.
Invoked to support generalization claims from benchmarks to practical use.

pith-pipeline@v0.9.0 · 5753 in / 1534 out tokens · 86632 ms · 2026-05-11T00:56:02.526805+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We design a new VLA architecture to graft the discrete-token VLM into the flow-matching continuous-action expert via per-layer key-value (KV) conditioning.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MolmoAct2-Think... performs adaptive depth reasoning by autoregressively predicting only the tokens for scene regions that change between timesteps.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 50 canonical work pages · 24 internal anchors

[1]

S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Berscheid, P

L. Berscheid, P. Meißner, and T. Kröger. Robot learning of shifting objects for grasping in cluttered environments. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),

2019
[3]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.pi_0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

URLhttps://arxiv.org/abs/2511.04668. Z. Cai, R. Wang, C. Gu, F. Pu, J. Xu, Y. Wang, W. Yin, Z. Yang, C. Wei, Q. Sun, T. Zhou, J. Li, H. E. Pang, O. Qian, Y. Wei, Z. Lin, X. Shi, K. Deng, X. Han, Z. Chen, X. Fan, H. Deng, L. Lu, L. Pan, B. Li, Z. Liu, Q. Wang, D. Lin, and L. Yang. Scaling spatial intelligence with multimodal foundation models,

work page arXiv
[6]

URL https://arxiv.org/abs/2511.13719. Q. Chen, J. Yu, M. Schwager, P. Abbeel, Y. Shentu, and P. Wu. Sarm: Stage-aware reward modeling for long horizon robot manipulation.arXiv preprint arXiv:2509.25358,

work page arXiv
[7]

Project page. L. Cheng, J. Duan, Y. R. Wang, H. Fang, B. Li, Y. Huang, E. Wang, A. Eftekhar, J. Lee, W. Yuan, et al. Pointarena: Probing multimodal grounding through language-guided pointing.arXiv preprint arXiv:2505.09990,

work page arXiv
[8]

URL https://arxiv.org/abs/2601.10611. G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page arXiv
[9]

Robonet: Large-scale multi-robot learning,

S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn. Robonet: Large-scale multi-robot learning.arXiv preprint arXiv:1910.11215,

work page arXiv 1910
[10]

S. Deng, M. Yan, Y. Zheng, J. Su, W. Zhang, X. Zhao, H. Cui, Z. Zhang, and H. Wang. Stereovla: Enhancing vision-language-action models with stereo vision.arXiv preprint arXiv:2512.21970,

work page arXiv
[11]

Deshpande, M

A. Deshpande, M. Guru, R. Hendrix, S. Jauhri, A. Eftekhar, R. Tripathi, M. Argus, J. Salvador, H. Fang, M. Wallingford, et al. Molmob0t: Large-scale simulation enables zero-shot manipulation.arXiv preprint arXiv:2603.16861,

work page arXiv
[12]

URL https://arxiv.org/abs/2505.23705. 30 M. Du, B. Wu, Z. Li, X.-J. Huang, and Z. Wei. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers),

work page arXiv
[13]

Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

F. Ebert, Y. Yang, K. Schmeckpeper, B. Bucher, G. Georgakis, K. Daniilidis, C. Finn, and S. Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets.arXiv preprint arXiv:2109.13396,

work page internal anchor Pith review arXiv
[14]

H.-S. Fang, H. Fang, Z. Tang, J. Liu, C. Wang, J. Wang, H. Zhu, and C. Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot.arXiv preprint arXiv:2307.00595,

work page arXiv
[15]

URLhttps://arxiv.org/abs/2311.18259. M. Guerquin. Introducing AI2’s beaker.AI2 Blog,

work page arXiv
[16]

Accessed: 2024-12-31

URLhttps://web.archive.org/web/20241231204439/https: //medium.com/ai2-blog/beaker-ed617d5f4593. Accessed: 2024-12-31. Original: https://medium.com/ai2-blog/ beaker-ed617d5f4593. C.-P. Huang, Y.-H. Wu, M.-H. Chen, Y.-C. F. Wang, and F.-E. Yang. Thinkact: Vision-language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815,

work page arXiv 2024
[17]

C.-Y. Hung, Q. Sun, P. Hong, A. Zadeh, C. Li, U. Tan, N. Majumder, S. Poria, et al. Nora: A small open-sourced generalist vision language action model for embodied tasks.arXiv preprint arXiv:2504.19854,

work page arXiv
[18]

URLhttps://arxiv.org/abs/2504.16054. E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. InConference on Robot Learning,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

URLhttps://arxiv.org/abs/1612.06890. A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y. Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945,

work page arXiv
[20]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645,

work page internal anchor Pith review arXiv
[22]

M. J. Kim, Y. Gao, T.-Y. Lin, Y.-C. Lin, Y. Ge, G. Lam, P. Liang, S. Song, M.-Y. Liu, C. Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026a. Y. Kim, W. Pumacay, O. Rayyan, M. Argus, W. Han, E. VanderBilt, J. Salvador, A. Deshpande, R. Hendrix, S. Jauhri, et al. Molmospaces: A lar...

work page internal anchor Pith review arXiv
[23]

URL https://arxiv.org/abs/2411.15124. J. Lee, J. Duan, H. Fang, Y. Deng, S. Liu, B. Li, B. Fang, J. Zhang, Y. R. Wang, S. Lee, W. Han, W. Pumacay, A. Wu, R. Hendrix, K. Farley, E. VanderBilt, A. Farhadi, D. Fox, and R. Krishna. Molmoact: Action reasoning models that can reason in space,

work page internal anchor Pith review arXiv
[24]

URLhttps://arxiv.org/abs/2508.07917. J. H. Lee, M. Kerzel, K. Ahrens, C. Weber, and S. Wermter. What is right for me is not yet right for you: A dataset for grounding relative directions via multi-task learning. InInternational Joint Conference on Artificial Intelligence, Jul

work page internal anchor Pith review arXiv
[25]

URLhttps://www.ijcai.org/proceedings/2022/0145.pdf. B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y. Li, Z. Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Q. Li, Y. Liang, Z. Wang, L. Luo, X. Chen, M. Liao, F. Wei, Y. Deng, S. Xu, Y. Zhang, et al. Cogact: A foundational vision-langu...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

Flow Matching for Generative Modeling

Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864,

work page internal anchor Pith review arXiv
[28]

NVIDIA, J

URLhttps://arxiv.org/abs/2301.02229. NVIDIA, J. Bjorck, N. C. Fernando Castañeda, X. Da, R. Ding, L. J. Fan, Y. Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y. L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y. Xi...

work page arXiv
[29]

In 2024 IEEE International Conference on Robotics and Automation (ICRA),

2024
[30]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,

work page internal anchor Pith review arXiv
[31]

D. Qu, H. Song, Q. Chen, Y. Yao, X. Ye, Y. Ding, Z. Wang, J. Gu, B. Zhao, D. Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830,

work page internal anchor Pith review arXiv
[32]

URL https://arxiv.org/abs/2412.07755. P. Sermanet, T. Ding, J. Zhao, F. Xia, D. Dwibedi, K. Gopalakrishnan, C. Chan, G. Dulac-Arnold, S. Maddineni, N. J. Joshi, P. Florence, W. Han, R. Baruch, Y. Lu, S. Mirchandani, P. Xu, P. Sanketi, K. Hausman, I. Shafran, B. Ichter, and Y. Cao. Robovqa: Multimodal long-horizon reasoning for robotics,

work page arXiv
[33]

URLhttps://arxiv.org/ abs/2311.00899. N. M. M. Shafiullah, A. Rai, H. Etukuru, Y. Liu, I. Misra, S. Chintala, and L. Pinto. On bringing robots home.arXiv preprint arXiv:2311.16098,

work page arXiv
[34]

URLhttps://arxiv.org/abs/2506.01844. A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,

work page internal anchor Pith review arXiv
[35]

Q. Sun, P. Hong, T. D. Pala, V. Toh, U. Tan, D. Ghosal, S. Poria, et al. Emma-x: An embodied multimodal action model with grounded chain of thought and look-ahead spatial reasoning.arXiv preprint arXiv:2412.11974,

work page arXiv
[36]

G. Team, P. Georgiev, V. I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

G. A. Team. Gen-1: Scaling embodied foundation models to mastery.Generalist AI Blog, 2026a. https://generalistai.com/blog/apr-02-2026-GEN-1. G. R. Team, A. Abdolmaleki, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, A. Balakrishna, N. Batchelor, A. Bewley, J. Bingham, et al. Gemini robotics 1.5: Pushing the frontier of generalist robots with advan...

work page arXiv 2026
[38]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y. Xia, B. Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Y. Tur, J. Naghiyev, H. Fang, W.-C. Tsai, J. Duan, D. Fox, and R. Krishna. Recurrent-depth vla: Implicit test-time compute scaling of vision-language-action models via latent iterative reasoning.arXiv preprint arXiv:2602.07845,

work page arXiv
[40]

W. Wang, M. Ghobadi, K. Shakeri, Y. Zhang, and N. Hasani. Rail-only: A low-cost high-performance network for training llms with trillion parameters.2024 IEEE Symposium on High-Performance Interconnects (HOTI), pages 1–10,

2024
[41]

URLhttps://api.semanticscholar.org/CorpusID:260125277. 33 W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025a. Y. Wang, P. Ding, L. Li, C. Cui, Z. Ge, X. Tong, W. Song, H. Zhao, W. Zhao, ...

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Y. R. Wang, C. Ung, G. Tannert, J. Duan, J. Li, A. Le, R. Oswal, M. Grotz, W. Pumacay, Y. Deng, et al. Roboeval: Where robotic manipulation meets structured and scalable evaluation.arXiv preprint arXiv:2507.00435, 2025b. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in larg...

work page internal anchor Pith review Pith/arXiv arXiv
[43]

J. Wen, Y. Zhu, J. Li, Z. Tang, C. Shen, and F. Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855,

work page Pith review arXiv
[44]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...

work page internal anchor Pith review Pith/arXiv arXiv
[45]

R. Yang, Z. Zhu, Y. Li, J. Huang, S. Yan, S. Zhou, Z. Liu, X. Li, S. Li, W. Wang, Y. Lin, and H. Zhao. Visual spatial tuning, 2025c. URLhttps://arxiv.org/abs/2511.05491. S. Yang, J. Yang, P. Huang, E. Brown, Z. Yang, Y. Yu, S. Tong, Z. Zheng, Y. Xu, M. Wang, D. Lu, R. Fergus, Y. LeCun, L. Fei-Fei, and S. Xie. Cambrian-s: Towards spatial supersensing in vi...

work page arXiv
[46]

Robotic Control via Embodied Chain-of-Thought Reasoning

M. Zawalski, W. Chen, K. Pertsch, O. Mees, C. Finn, and S. Levine. Robotic control via embodied chain-of-thought reasoning.arXiv preprint arXiv:2407.08693,

work page internal anchor Pith review arXiv
[47]

L. Zha, A. J. Hancock, M. Zhang, T. Yin, Y. Huang, D. Shah, A. Z. Ren, and A. Majumdar. Lap: Language-action pre-training enables zero-shot cross-embodiment transfer.arXiv preprint arXiv:2602.10556,

work page arXiv
[48]

Zhang, M

J. Zhang, M. Memmel, K. Kim, D. Fox, J. Thomason, F. Ramos, E. Bıyık, A. Gupta, and A. Li. Peek: Guiding and minimal image representations for zero-shot generalization of robot manipulation policies.arXiv preprint arXiv:2509.18282,

work page arXiv
[49]

T. Z. Zhao, J. Tompson, D. Driess, P. Florence, K. Ghasemipour, C. Finn, and A. Wahid. Aloha unleashed: A simple recipe for robot dexterity.arXiv preprint arXiv:2410.13126,

work page arXiv
[50]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y. Feng, Y. Zheng, J. Zou, Y. Chen, J. Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274,

work page internal anchor Pith review arXiv
[51]

arXiv preprint arXiv:2412.10345 (2024) 13

R. Zheng, Y. Liang, S. Huang, J. Gao, H. Daumé III, A. Kolobov, F. Huang, and J. Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies.arXiv preprint arXiv:2412.10345,

work page arXiv
[52]

E. Zhou, J. An, C. Chi, Y. Han, S. Rong, C. Zhang, P. Wang, Z. Wang, T. Huang, L. Sheng, et al. Roborefer: Towards spatial referring with reasoning in vision-language models for robotics.arXiv preprint arXiv:2506.04308,

work page arXiv
[53]

C. Zhu, R. Yu, S. Feng, B. Burchfiel, P. Shah, and A. Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792,

work page internal anchor Pith review arXiv
[54]

4.1 and Sec

34 Appendix The appendix includes the following sections: •§A - Model Details •§B - Training Details •§C - Evaluation Details •§D - Datasets Details •§E - Limitations and Potential Solutions A Model Details This appendix expands the model description in Sec. 4.1 and Sec. 4.2.MolmoAct2is built in two architectural stages. First,MolmoAct2-Pretrainadapts the...

2026
[55]

dialects,

The columns correspond to the three training stages used throughout the paper: pre-training, post-training, and embodiment-specific fine-tuning. B.2 MolmoAct2-FAST Tokenizer TheMolmoAct2-F AST Tokenizerarchitecture is an open-data implementation based on the FAST framework released by Physical Intelligence (Pertsch et al., 2025). While we utilize the core...

2025
[56]

Camera positions are held fixed across all three models

in a kitchen environment, and design five tasks, each with three spatial variants, for evaluation. Camera positions are held fixed across all three models. Figure 9 shows sample trajectories of MolmoAct2 from 5 different tasks. Table 18 presents the evaluation results for MolmoAct2,π0, and MolmoBot in the DROID setup. C.3 Real-world Bimanual YAM Implement...

2016