Pelican-Unify 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action

Che Liu; Haosong Sun; Haoyuan Shi; Jian Tang; Jiayu Hu; Jinpeng Lu; Jin Xu; Junwei Liao; Kuishu Wu; Nga Teng Chan

arxiv: 2605.15153 · v2 · pith:WV2OW65Anew · submitted 2026-05-14 · 💻 cs.RO · cs.AI

Pelican-Unify 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action

Yi Zhang , Yinda Chen , Che Liu , Zeyuan Ding , Jin Xu , Shilong Zou , Junwei Liao , Jiayu Hu

show 21 more authors

Xiancong Ren Xiaopeng Zhang Yechi Liu Haoyuan Shi Zecong Tang Haosong Sun Renwen Cui Kuishu Wu Wenhai Liu Yang Xu Yingji Zhang Yidong Wang Senkang Hu Jinpeng Lu Nga Teng Chan Yechen Wu Zeting Liu Xianzhou Hou Yong Dai Jian Tang Xiaozhu Ju

This is my paper

Pith reviewed 2026-05-22 09:40 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords unified embodied modelvision-language modelfuture video generationrobot actionmultimodal reasoningjoint trainingembodied intelligence

0 comments

The pith

A single VLM with joint language-video-action losses and a future generator can match specialist performance on understanding, reasoning, and embodied action.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Pelican-Unify 1.0 trains one vision-language model to serve as both an understanding module that maps scenes and instructions into a shared space and a reasoning module that produces task-oriented thought chains in one pass. The final hidden state becomes a latent variable that a Unified Future Generator conditions on to produce future videos and actions together through separate heads in a single denoising step. Language, video, and action losses are all back-propagated into the shared representation so that understanding, reasoning, imagination, and action are optimized jointly rather than in isolation. Experiments show this yields 64.7 on eight VLM benchmarks, first place at 66.03 on WorldArena, and 93.5 average on RoboTwin with only one checkpoint, indicating that unification need not force compromises.

Core claim

The paper establishes that a single VLM can act as unified understanding and reasoning module by mapping inputs into a shared semantic space and autoregressively generating chains of thought while projecting the final state to a latent; a Unified Future Generator then conditions on this latent to jointly generate future videos and future actions via modality-specific heads in one denoising process, with language, video, and action losses all back-propagated into the shared representation, allowing the model to reach 64.7 on VLM benchmarks, 66.03 on WorldArena (first), and 93.5 on RoboTwin (second-best) using one checkpoint.

What carries the argument

The Unified Future Generator (UFG) that takes the VLM's final hidden-state latent and produces future videos and actions through two modality-specific heads inside the same denoising process.

If this is right

Embodied systems could replace separate perception, reasoning, and control modules with one jointly trained network.
Planning in robots could directly use the model's generated future videos and actions for decision making.
Training pipelines for embodied AI can focus on a single shared representation instead of staged specialist training.
Performance on language and vision benchmarks can transfer to action generation without separate fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may reduce overall system complexity and latency by eliminating hand-offs between separate models.
Extending the joint denoising process to longer video horizons or additional sensor modalities could be tested directly on the same architecture.
Real-world robot deployment would show whether the shared representation improves robustness to distribution shift compared with specialist stacks.

Load-bearing premise

Joint back-propagation of language, video, and action losses through one shared VLM representation is enough to couple understanding, reasoning, imagination, and action without creating performance trade-offs.

What would settle it

If independent specialist models trained on the same data volume and scale outperform the unified model on any of the reported VLM, WorldArena, or RoboTwin metrics, the claim that joint optimization avoids compromises would be refuted.

Figures

Figures reproduced from arXiv: 2605.15153 by Che Liu, Haosong Sun, Haoyuan Shi, Jian Tang, Jiayu Hu, Jinpeng Lu, Jin Xu, Junwei Liao, Kuishu Wu, Nga Teng Chan, Renwen Cui, Senkang Hu, Shilong Zou, Wenhai Liu, Xiancong Ren, Xianzhou Hou, Xiaopeng Zhang, Xiaozhu Ju, Yang Xu, Yechen Wu, Yechi Liu, Yidong Wang, Yinda Chen, Yingji Zhang, Yi Zhang, Yong Dai, Zecong Tang, Zeting Liu, Zeyuan Ding.

**Figure 1.** Figure 1: Pelican-Unified 1.0 closes the understand-reason–imagine–act loop by centering all three faces on one loop state [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 1.** Figure 1: Starting from a base VLM, standard VLA policy training weakens grounding and attention, while Pelican-Unify 1.0 [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 2.** Figure 2: Starting from a base VLM, standard VLA policy training weakens grounding and attention, while Pelican-Unified 1.0 [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 2.** Figure 2: Pelican-Unify 1.0 can take actions as conditional inputs, enabling action-conditioned video prediction. Left: The [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Pelican-Unified 1.0 can take actions as conditional inputs, enabling action-conditioned video prediction. Left: The [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Compositional generalization evaluation. During training, the model is optimized only on atomic manipulation tasks individually, without exposure to their composed counterparts. At test time, we evaluate the model on unseen compositional tasks that require combining multiple learned skills, demonstrating strong compositional generalization ability in long-horizon embodied manipulation. Failures are concen… view at source ↗

**Figure 5.** Figure 5: Fine-grained manipulation and physical imagination capability. Our model demonstrates strong fine-grained embodied manipulation skills in challenging connector insertion tasks, including waterproof, RJ45, and USB insertion, while also exhibiting powerful physical imagination ability to predict plausible future interactions and object dynamics under realworld constraints. upon this foundation, we designed … view at source ↗

**Figure 6.** Figure 6: Execution timelines of seen and unseen robotic manipulation tasks. For each task, we visualize synchronized side-view and top-view observations at five representative execution steps. The upper block shows two seen tasks, including sweeping debris into a dustpan and pouring into a cup, while the lower block shows an unseen cup-wiping task for evaluating cross-task generalization. act in ways whose conseque… view at source ↗

read the original abstract

We present Pelican-Unify 1.0, the first embodied foundation model trained according to the principle of unification. Pelican-Unify 1.0 uses a single VLM as a unified understanding module, mapping scenes, instructions, visual contexts, and action histories into a shared semantic space. The same VLM also serves as a unified reasoning module, autoregressively producing task-, action-, and future-oriented chains of thought in a single forward pass and projecting the final hidden state into a dense latent variable. A Unified Future Generator (UFG) then conditions on this latent variable and jointly generates future videos and future actions through two modality-specific output heads within the same denoising process. The language, video, and action losses are all backpropagated into the shared representation, enabling the model to jointly optimize understanding, reasoning, imagination, and action during training, rather than training three isolated expert systems. Experiments demonstrate that unification does not imply compromise. With a single checkpoint, Pelican-Unify 1.0 achieves strong performance across all three capabilities: 64.7 on eight VLM benchmarks, the best among comparable-scale models; 66.03 on WorldArena, ranking first; and 93.5 on RoboTwin, the second-best average among compared action methods. These results show that the unified paradigm succeeds in preserving specialist strength while bringing understanding, reasoning, imagination, and action into one model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pelican-Unify reports competitive embodied benchmarks from a single VLM plus joint video-action generator, but missing scale details leave the no-compromise claim hard to verify.

read the letter

The core point is that Pelican-Unify 1.0 trains one VLM backbone to handle understanding and reasoning, then adds a Unified Future Generator that denoises future video and action outputs together from the same latent state. All three losses back-propagate into the shared representation. The paper shows this single checkpoint hitting 64.7 on eight VLM benchmarks, first place on WorldArena, and second on RoboTwin action averages, which is presented as evidence that unification preserves specialist-level results.

Referee Report

2 major / 1 minor

Summary. The paper introduces Pelican-Unify 1.0 as the first embodied foundation model trained under a unification principle. A single VLM serves as both the understanding module (mapping scenes, instructions, and histories into a shared semantic space) and the reasoning module (autoregressively generating task-, action-, and future-oriented chains of thought while projecting a final hidden state to a latent variable). A Unified Future Generator (UFG) then conditions on this latent to jointly generate future videos and actions via modality-specific heads in one denoising process. Language, video, and action losses are back-propagated into the shared VLM representation. The abstract reports that a single checkpoint achieves 64.7 on eight VLM benchmarks (best among comparable-scale models), 66.03 on WorldArena (rank 1), and 93.5 on RoboTwin (second-best average), concluding that unification preserves specialist strength without compromise.

Significance. If the unification mechanism and benchmark results can be rigorously verified, the work would constitute a meaningful advance in embodied AI by demonstrating that joint optimization of understanding, reasoning, imagination, and action within one model need not incur the performance trade-offs typical of multi-task or modular systems. It offers a concrete architectural path (shared VLM + UFG with joint loss back-propagation) toward more integrated robotic intelligence and challenges the prevailing specialist-model paradigm.

major comments (2)

[Abstract] Abstract: The central claim that 64.7 constitutes 'the best among comparable-scale models' on VLM benchmarks cannot be evaluated because the manuscript reports neither the parameter count, layer count, hidden dimension, nor FLOPs for Pelican-Unify 1.0 nor for any baseline. Without this information the qualifier 'comparable-scale' is unverifiable and the assertion that unification avoids performance compromise rests on an uncheckable premise.
[Abstract] Abstract / Experiments: The reported results (66.03 on WorldArena, 93.5 on RoboTwin) are presented without any description of the experimental protocol, baseline selection criteria, statistical significance tests, or ablation studies on the joint language-video-action loss. This absence prevents assessment of whether the numbers genuinely support the 'no compromise' conclusion or could arise from capacity differences or post-hoc selection.

minor comments (1)

[Abstract] Abstract: The acronym 'UFG' is introduced without an inline expansion on first use, which reduces immediate readability for readers scanning the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We agree that additional information on model scale and experimental rigor is required to fully substantiate the claims in the abstract. We address each major comment below and will incorporate the necessary revisions into the next version of the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 64.7 constitutes 'the best among comparable-scale models' on VLM benchmarks cannot be evaluated because the manuscript reports neither the parameter count, layer count, hidden dimension, nor FLOPs for Pelican-Unify 1.0 nor for any baseline. Without this information the qualifier 'comparable-scale' is unverifiable and the assertion that unification avoids performance compromise rests on an uncheckable premise.

Authors: We agree that explicit model-scale specifications are essential for verifying the 'comparable-scale' qualifier. In the revised manuscript we will add a new table (or subsection within Experiments) that reports the exact parameter count, layer count, hidden dimension, and estimated FLOPs for Pelican-Unify 1.0 together with the corresponding figures for all VLM baselines. This addition will make the performance comparison directly verifiable and will strengthen the claim that unification preserves specialist-level results. revision: yes
Referee: [Abstract] Abstract / Experiments: The reported results (66.03 on WorldArena, 93.5 on RoboTwin) are presented without any description of the experimental protocol, baseline selection criteria, statistical significance tests, or ablation studies on the joint language-video-action loss. This absence prevents assessment of whether the numbers genuinely support the 'no compromise' conclusion or could arise from capacity differences or post-hoc selection.

Authors: We acknowledge that the current manuscript lacks sufficient detail on the evaluation setup. We will expand the Experiments section to include: (i) a complete description of the WorldArena and RoboTwin evaluation protocols, (ii) explicit criteria used to select the reported baselines, (iii) statistical significance results (standard deviations across multiple runs and p-values where applicable), and (iv) ablation studies that isolate the contribution of the joint language-video-action loss. These additions will allow readers to assess whether the observed performance truly reflects the benefits of unification rather than capacity or selection artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper describes an architecture (single VLM for understanding/reasoning plus UFG for joint video/action generation) and states that joint back-propagation of language/video/action losses couples the capabilities. The central claim that unification preserves specialist strength is supported by reported scores on independent external benchmarks (eight VLM benchmarks, WorldArena, RoboTwin). These are not shown to reduce to the model's own definitions or fitted parameters by construction; they are presented as measured outcomes. No equations, self-citations, or ansatzes in the provided text create a load-bearing circular reduction. The derivation chain is therefore self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The model rests on standard VLM pre-training assumptions plus the novel claim that joint video-action denoising plus shared back-propagation produces non-compromised performance.

invented entities (1)

Unified Future Generator (UFG) no independent evidence
purpose: Jointly generate future videos and actions from a latent variable produced by the VLM in a single denoising process
Introduced as a core new module in the abstract; no independent evidence or external validation is provided.

pith-pipeline@v0.9.0 · 5903 in / 1320 out tokens · 52987 ms · 2026-05-22T09:40:51.966000+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The language, video, and action losses are all backpropagated into the shared representation, enabling the model to jointly optimize understanding, reasoning, imagination, and action during training
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A Unified Future Generator (UFG) then conditions on this latent variable and jointly generates future videos and future actions through two modality-specific output heads within the same denoising process
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

unified understanding, which embeds scenes, instructions, action histories, and visual contexts into a shared semantic space

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 17 internal anchors

[1]

Agarwal, A

N. Agarwal, A. Ali, M. Bala, Y. Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y. Chen, Y. Cui, Y. Ding, et al. Cosmos world foundation model platform for physical ai, 2025

work page 2025
[2]

A. Ali, J. Bai, M. Bala, Y. Balaji, A. Blakeman, T. Cai, J. Cao, T. Cao, E. Cha, Y.-W. Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Qwen3-VL Technical Report

S. Bai et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

H. Bi, H. Tan, S. Xie, Z. Wang, S. Huang, H. Liu, R. Zhao, Y. Feng, C. Xiang, Y. Rong, H. Zhao, H. Liu, Z. Su, L. Ma, H. Su, and J. Zhu. Motus: A unified latent action world model, 2025

work page 2025
[5]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

L. Chen, J. Li, X. Dong, P. Zhang, Y. Zang, Z. Chen, H. Duan, J. Wang, Y. Qiao, D. Lin, et al. Are we on the right way for evaluating large vision-language models?arXiv preprint arXiv:2403.20330, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Y. Chen, R. Chen, D. Huo, Y. Yang, D. Qi, H. Liu, T. Lin, S. Zeng, J. Xiao, X. Chang, F. Xiong, X. Wei, Z. Ma, and M. Xu. Abot-physworld: Interactive world foundation model for robotic manipulation with physics alignment, 2026

work page 2026
[8]

X. Chi, P. Jia, C.-K. Fan, X. Ju, W. Mi, K. Zhang, Z. Qin, W. Tian, K. Ge, H. Li, et al. Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

work page arXiv 2025
[9]

A. Clark. Whatever next? predictive brains, situated agents, and the future of cognitive science.Behavioral and brain sciences, 36(3):181–204, 2013

work page 2013
[10]

Clark.Surfing Uncertainty: Prediction, Action, and the Embodied Mind

A. Clark.Surfing Uncertainty: Prediction, Action, and the Embodied Mind. Oxford University Press, Oxford, 2016

work page 2016
[11]

StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

S. Community. Starvla: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

DeepMind

G. DeepMind. Veo 3.1: Our most capable generative video model.https://deepmind.google/technologies/veo/,

work page
[13]

Accessed: 2026-05-14

work page 2026
[14]

D. C. Dennett. The embodied mind: Cognitive science and human experience, 1993

work page 1993
[15]

L. Fan, Z. Xu, C. Cao, W. Zhang, M. Yuan, and J. Chen. Aim: Intent-aware unified world action modeling with spatial value maps.arXiv preprint arXiv:2604.11135, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

A. Figure. Helix: A vision-language-action model for generalist humanoid control, 2024

work page 2024
[17]

K. Friston. The free-energy principle: A unified brain theory?Nature Reviews Neuroscience, 11(2):127–138, 2010

work page 2010
[18]

Gigaworld-0: World models as data engine to empower embodied ai, 2025

GigaAI. Gigaworld-0: World models as data engine to empower embodied ai, 2025

work page 2025
[19]

Y. Guo, L. X. Shi, J. Chen, and C. Finn. Ctrl-world: A controllable generative world model for robot manipulation. In The Fourteenth International Conference on Learning Representations (ICLR), 2026

work page 2026
[20]

G. Hesslow. Conscious thought as simulation of behaviour and perception.Trends in Cognitive Sciences, 6(6):242–247, 2002

work page 2002
[21]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al. π0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Jeannerod

M. Jeannerod. Neural simulation of action: A unifying mechanism for motor cognition.NeuroImage, 14(1):S103–S109, 2001

work page 2001
[23]

Jiang, S

Y. Jiang, S. Chen, S. Huang, L. Chen, P. Zhou, Y. Liao, X. He, C. Liu, H. Li, M. Yao, and G. Ren. Enerverse-ac: Envisioning embodied environments with action condition, 2025

work page 2025
[24]

Kamath, J

A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ram´e, M. Rivi`ere, L. Rouillard, et al. Gemma 3 technical report, 2025

work page 2025
[25]

E. R. Kandel, J. D. Koester, S. H. Mack, and S. A. Siegelbaum.Principles of Neural Science. McGraw-Hill Education, New York, 6 edition, 2021. page 13 of 16

work page 2021
[26]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

J. Lee, J. Duan, H. Fang, Y. Deng, S. Liu, B. Li, B. Fang, J. Zhang, Y. R. Wang, S. Lee, W. Han, W. Pumacay, A. Wu, R. Hendrix, K. Farley, E. VanderBilt, A. Farhadi, D. Fox, and R. Krishna. Molmoact: Action reasoning models that can reason in space, 2025

work page 2025
[28]

L. Li, Q. Zhang, Y. Luo, S. Yang, R. Wang, F. Han, M. Yu, Z. Gao, N. Xue, X. Zhu, Y. Shen, and Y. Xu. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan, J. Wang, C. He, Z. Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

work page 2024
[30]

H. Luo, W. Zhang, Y. Feng, S. Zheng, H. Xu, C. Xu, Z. Xi, Y. Fu, and Z. Lu. Being-h0.7: A latent world-action model from egocentric videos.arXiv preprint arXiv:2605.00078, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

L. Maes, Q. L. Lidec, D. Scieur, Y. LeCun, and R. Balestriero. Leworldmodel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

Masry, D

A. Masry, D. Long, J. Q. Tan, S. Joty, and E. Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland, May 2022. Association for Computational Linguistics

work page 2022
[33]

Mathew, V

M. Mathew, V. Bagal, R. P. Tito, D. Karatzas, E. Valveny, and C. V. Jawahar. Infographicvqa, 2021

work page 2021
[34]

S. Miao, N. Feng, J. Wu, Y. Lin, X. He, D. Li, and M. Long. Jepa-vla: Video predictive embedding is needed for vla models, 2026

work page 2026
[35]

Seedance, D

T. Seedance, D. Chen, L. Chen, X. Chen, Y. Chen, Z. Chen, Z. Chen, F. Cheng, T. Cheng, Y. Cheng, et al. Seedance 2.0: Advancing video generation for world complexity, 2026

work page 2026
[36]

Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

Y. Shang, Z. Li, Y. Ma, W. Su, X. Jin, Z. Wang, L. Jin, X. Zhang, Y. Tang, H. Su, et al. Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

work page arXiv 2026
[37]

H. Shen, T. Wu, Q. Han, Y. Hsieh, J. Wang, Y. Zhang, Y. Cheng, Z. Hao, Y. Ni, X. Wang, Z. Wan, K. Zhang, W. Xu, J. Xiong, P. Luo, W. Chen, C. Tao, Z. Mao, and N. Wong. Phyx: Does your model have the ”wits” for physical reasoning?, 2025

work page 2025
[38]

G. R. Team, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, T. Armstrong, A. Balakrishna, R. Baruch, M. Bauza, M. Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

H. Team. Happyhorse-1.0, 2026

work page 2026
[40]

M. Team, C. Xiang, F. Bao, H. Liu, H. Tan, H. Bi, J. Li, J. Liu, J. Pang, K. Jing, L. Liu, M. Cai, R. Cui, R. Zhao, R. Wang, S. Huang, Y. Feng, Y. Rong, Z. Wang, and J. Zhu. Motubrain: An advanced world action model for robot control, 2026

work page 2026
[41]

W. Team. Wan2.6: A state-of-the-art video generation model.WanAI:LeadingAIVideoGenerationModel, 2026. Accessed: 2026-05-14

work page 2026
[42]

W. Team. Wan2.7, 2026

work page 2026
[43]

Unifolm-wma-0: A world-model-action (wma) framework under unifolm family, 2025

Unitree. Unifolm-wma-0: A world-model-action (wma) framework under unifolm family, 2025

work page 2025
[44]

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

W. Wu, F. Lu, Y. Wang, S. Yang, S. Liu, F. Wang, Q. Zhu, H. Sun, Y. Wang, S. Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[46]

Y. Yang, S. Zeng, T. Lin, X. Chang, D. Qi, J. Xiao, H. Liu, R. Chen, Y. Chen, D. Huo, et al. Abot-m0: Vla foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[47]

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y. Yang, W. Hong, X. Zhang, G. Feng, D. Yin, Y. Zhang, W. Wang, Y. Cheng, B. Xu, X. Gu, Y. Dong, and J. Tang. Cogvideox: Text-to-video diffusion models with an expert transformer, 2025

work page 2025
[48]

S. Ye, Y. Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y. L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026. page 14 of 16

work page internal anchor Pith review Pith/arXiv arXiv 2026
[49]

T. Yuan, Z. Dong, Y. Liu, and H. Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[50]

W. Yuan, J. Duan, V. Blukis, W. Pumacay, R. Krishna, A. Murali, A. Mousavian, and D. Fox. Robopoint: A vision- language model for spatial affordance prediction for robotics, 2024

work page 2024
[51]

X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y. Sun, C. Wei, B. Yu, R. Yuan, R. Sun, M. Yin, B. Zheng, Z. Yang, Y. Liu, W. Huang, H. Sun, Y. Su, and W. Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of CVPR, 2024

work page 2024
[52]

Zawalski, W

M. Zawalski, W. Chen, K. Pertsch, O. Mees, C. Finn, and S. Levine. Robotic control via embodied chain-of-thought reasoning, 2025

work page 2025
[53]

Zhang, C

Y. Zhang, C. Liu, X. Ren, H. Ni, S. Zhang, Z. Ding, J. Hu, H. Shan, Z. Niu, Z. Liu, et al. Pelican-vl 1.0: A foundation brain model for embodied intelligence.arXiv preprint arXiv:2511.00108, 2025

work page arXiv 2025
[54]

Zheng, J

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y. Feng, Y. Zheng, J. Zou, Y. Chen, J. Zeng, T. Wang, Y.-Q. Zhang, J. Liu, and X. Zhan. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026

work page 2026
[55]

E. Zhou, J. An, C. Chi, Y. Han, S. Rong, C. Zhang, P. Wang, Z. Wang, T. Huang, L. Sheng, and S. Zhang. Roborefer: Towards spatial referring with reasoning in vision-language models for robotics, 2026

work page 2026
[56]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. page 15 of 16

work page 2023
[57]

The final public release will replace the group-level placeholders below with individual names after internal approval

Contributions Our contributors are organized based on their roles and magnitude of contribution. The final public release will replace the group-level placeholders below with individual names after internal approval. 6.1. Core Contributors Unified VLM and Action capability: Yi Zhang, Yinda Chen, Che Liu, Zeyuan Ding Unified World-model capability: Jin Xu,...

work page

[1] [1]

Agarwal, A

N. Agarwal, A. Ali, M. Bala, Y. Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y. Chen, Y. Cui, Y. Ding, et al. Cosmos world foundation model platform for physical ai, 2025

work page 2025

[2] [2]

A. Ali, J. Bai, M. Bala, Y. Balaji, A. Blakeman, T. Cai, J. Cao, T. Cao, E. Cha, Y.-W. Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Qwen3-VL Technical Report

S. Bai et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

H. Bi, H. Tan, S. Xie, Z. Wang, S. Huang, H. Liu, R. Zhao, Y. Feng, C. Xiang, Y. Rong, H. Zhao, H. Liu, Z. Su, L. Ma, H. Su, and J. Zhu. Motus: A unified latent action world model, 2025

work page 2025

[5] [5]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

L. Chen, J. Li, X. Dong, P. Zhang, Y. Zang, Z. Chen, H. Duan, J. Wang, Y. Qiao, D. Lin, et al. Are we on the right way for evaluating large vision-language models?arXiv preprint arXiv:2403.20330, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Y. Chen, R. Chen, D. Huo, Y. Yang, D. Qi, H. Liu, T. Lin, S. Zeng, J. Xiao, X. Chang, F. Xiong, X. Wei, Z. Ma, and M. Xu. Abot-physworld: Interactive world foundation model for robotic manipulation with physics alignment, 2026

work page 2026

[8] [8]

X. Chi, P. Jia, C.-K. Fan, X. Ju, W. Mi, K. Zhang, Z. Qin, W. Tian, K. Ge, H. Li, et al. Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642, 2025

work page arXiv 2025

[9] [9]

A. Clark. Whatever next? predictive brains, situated agents, and the future of cognitive science.Behavioral and brain sciences, 36(3):181–204, 2013

work page 2013

[10] [10]

Clark.Surfing Uncertainty: Prediction, Action, and the Embodied Mind

A. Clark.Surfing Uncertainty: Prediction, Action, and the Embodied Mind. Oxford University Press, Oxford, 2016

work page 2016

[11] [11]

StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

S. Community. Starvla: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

DeepMind

G. DeepMind. Veo 3.1: Our most capable generative video model.https://deepmind.google/technologies/veo/,

work page

[13] [13]

Accessed: 2026-05-14

work page 2026

[14] [14]

D. C. Dennett. The embodied mind: Cognitive science and human experience, 1993

work page 1993

[15] [15]

L. Fan, Z. Xu, C. Cao, W. Zhang, M. Yuan, and J. Chen. Aim: Intent-aware unified world action modeling with spatial value maps.arXiv preprint arXiv:2604.11135, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

A. Figure. Helix: A vision-language-action model for generalist humanoid control, 2024

work page 2024

[17] [17]

K. Friston. The free-energy principle: A unified brain theory?Nature Reviews Neuroscience, 11(2):127–138, 2010

work page 2010

[18] [18]

Gigaworld-0: World models as data engine to empower embodied ai, 2025

GigaAI. Gigaworld-0: World models as data engine to empower embodied ai, 2025

work page 2025

[19] [19]

Y. Guo, L. X. Shi, J. Chen, and C. Finn. Ctrl-world: A controllable generative world model for robot manipulation. In The Fourteenth International Conference on Learning Representations (ICLR), 2026

work page 2026

[20] [20]

G. Hesslow. Conscious thought as simulation of behaviour and perception.Trends in Cognitive Sciences, 6(6):242–247, 2002

work page 2002

[21] [21]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al. π0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Jeannerod

M. Jeannerod. Neural simulation of action: A unifying mechanism for motor cognition.NeuroImage, 14(1):S103–S109, 2001

work page 2001

[23] [23]

Jiang, S

Y. Jiang, S. Chen, S. Huang, L. Chen, P. Zhou, Y. Liao, X. He, C. Liu, H. Li, M. Yao, and G. Ren. Enerverse-ac: Envisioning embodied environments with action condition, 2025

work page 2025

[24] [24]

Kamath, J

A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ram´e, M. Rivi`ere, L. Rouillard, et al. Gemma 3 technical report, 2025

work page 2025

[25] [25]

E. R. Kandel, J. D. Koester, S. H. Mack, and S. A. Siegelbaum.Principles of Neural Science. McGraw-Hill Education, New York, 6 edition, 2021. page 13 of 16

work page 2021

[26] [26]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

J. Lee, J. Duan, H. Fang, Y. Deng, S. Liu, B. Li, B. Fang, J. Zhang, Y. R. Wang, S. Lee, W. Han, W. Pumacay, A. Wu, R. Hendrix, K. Farley, E. VanderBilt, A. Farhadi, D. Fox, and R. Krishna. Molmoact: Action reasoning models that can reason in space, 2025

work page 2025

[28] [28]

L. Li, Q. Zhang, Y. Luo, S. Yang, R. Wang, F. Han, M. Yu, Z. Gao, N. Xue, X. Zhu, Y. Shen, and Y. Xu. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [29]

Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan, J. Wang, C. He, Z. Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

work page 2024

[30] [30]

H. Luo, W. Zhang, Y. Feng, S. Zheng, H. Xu, C. Xu, Z. Xi, Y. Fu, and Z. Lu. Being-h0.7: A latent world-action model from egocentric videos.arXiv preprint arXiv:2605.00078, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[31] [31]

L. Maes, Q. L. Lidec, D. Scieur, Y. LeCun, and R. Balestriero. Leworldmodel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

Masry, D

A. Masry, D. Long, J. Q. Tan, S. Joty, and E. Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland, May 2022. Association for Computational Linguistics

work page 2022

[33] [33]

Mathew, V

M. Mathew, V. Bagal, R. P. Tito, D. Karatzas, E. Valveny, and C. V. Jawahar. Infographicvqa, 2021

work page 2021

[34] [34]

S. Miao, N. Feng, J. Wu, Y. Lin, X. He, D. Li, and M. Long. Jepa-vla: Video predictive embedding is needed for vla models, 2026

work page 2026

[35] [35]

Seedance, D

T. Seedance, D. Chen, L. Chen, X. Chen, Y. Chen, Z. Chen, Z. Chen, F. Cheng, T. Cheng, Y. Cheng, et al. Seedance 2.0: Advancing video generation for world complexity, 2026

work page 2026

[36] [36]

Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

Y. Shang, Z. Li, Y. Ma, W. Su, X. Jin, Z. Wang, L. Jin, X. Zhang, Y. Tang, H. Su, et al. Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

work page arXiv 2026

[37] [37]

H. Shen, T. Wu, Q. Han, Y. Hsieh, J. Wang, Y. Zhang, Y. Cheng, Z. Hao, Y. Ni, X. Wang, Z. Wan, K. Zhang, W. Xu, J. Xiong, P. Luo, W. Chen, C. Tao, Z. Mao, and N. Wong. Phyx: Does your model have the ”wits” for physical reasoning?, 2025

work page 2025

[38] [38]

G. R. Team, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, T. Armstrong, A. Balakrishna, R. Baruch, M. Bauza, M. Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

H. Team. Happyhorse-1.0, 2026

work page 2026

[40] [40]

M. Team, C. Xiang, F. Bao, H. Liu, H. Tan, H. Bi, J. Li, J. Liu, J. Pang, K. Jing, L. Liu, M. Cai, R. Cui, R. Zhao, R. Wang, S. Huang, Y. Feng, Y. Rong, Z. Wang, and J. Zhu. Motubrain: An advanced world action model for robot control, 2026

work page 2026

[41] [41]

W. Team. Wan2.6: A state-of-the-art video generation model.WanAI:LeadingAIVideoGenerationModel, 2026. Accessed: 2026-05-14

work page 2026

[42] [42]

W. Team. Wan2.7, 2026

work page 2026

[43] [43]

Unifolm-wma-0: A world-model-action (wma) framework under unifolm family, 2025

Unitree. Unifolm-wma-0: A world-model-action (wma) framework under unifolm family, 2025

work page 2025

[44] [44]

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

W. Wu, F. Lu, Y. Wang, S. Yang, S. Liu, F. Wang, Q. Zhu, H. Sun, Y. Wang, S. Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[46] [46]

Y. Yang, S. Zeng, T. Lin, X. Chang, D. Qi, J. Xiao, H. Liu, R. Chen, Y. Chen, D. Huo, et al. Abot-m0: Vla foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[47] [47]

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y. Yang, W. Hong, X. Zhang, G. Feng, D. Yin, Y. Zhang, W. Wang, Y. Cheng, B. Xu, X. Gu, Y. Dong, and J. Tang. Cogvideox: Text-to-video diffusion models with an expert transformer, 2025

work page 2025

[48] [48]

S. Ye, Y. Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y. L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026. page 14 of 16

work page internal anchor Pith review Pith/arXiv arXiv 2026

[49] [49]

T. Yuan, Z. Dong, Y. Liu, and H. Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[50] [50]

W. Yuan, J. Duan, V. Blukis, W. Pumacay, R. Krishna, A. Murali, A. Mousavian, and D. Fox. Robopoint: A vision- language model for spatial affordance prediction for robotics, 2024

work page 2024

[51] [51]

X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y. Sun, C. Wei, B. Yu, R. Yuan, R. Sun, M. Yin, B. Zheng, Z. Yang, Y. Liu, W. Huang, H. Sun, Y. Su, and W. Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of CVPR, 2024

work page 2024

[52] [52]

Zawalski, W

M. Zawalski, W. Chen, K. Pertsch, O. Mees, C. Finn, and S. Levine. Robotic control via embodied chain-of-thought reasoning, 2025

work page 2025

[53] [53]

Zhang, C

Y. Zhang, C. Liu, X. Ren, H. Ni, S. Zhang, Z. Ding, J. Hu, H. Shan, Z. Niu, Z. Liu, et al. Pelican-vl 1.0: A foundation brain model for embodied intelligence.arXiv preprint arXiv:2511.00108, 2025

work page arXiv 2025

[54] [54]

Zheng, J

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y. Feng, Y. Zheng, J. Zou, Y. Chen, J. Zeng, T. Wang, Y.-Q. Zhang, J. Liu, and X. Zhan. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026

work page 2026

[55] [55]

E. Zhou, J. An, C. Chi, Y. Han, S. Rong, C. Zhang, P. Wang, Z. Wang, T. Huang, L. Sheng, and S. Zhang. Roborefer: Towards spatial referring with reasoning in vision-language models for robotics, 2026

work page 2026

[56] [56]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. page 15 of 16

work page 2023

[57] [57]

The final public release will replace the group-level placeholders below with individual names after internal approval

Contributions Our contributors are organized based on their roles and magnitude of contribution. The final public release will replace the group-level placeholders below with individual names after internal approval. 6.1. Core Contributors Unified VLM and Action capability: Yi Zhang, Yinda Chen, Che Liu, Zeyuan Ding Unified World-model capability: Jin Xu,...

work page