OSCAR: Omni-Embodiment Action-Conditioned World Model for Robotics

Jun Gao; Zhuoyuan Wu

arxiv: 2606.04463 · v2 · pith:IG2TQODEnew · submitted 2026-06-03 · 💻 cs.RO

OSCAR: Omni-Embodiment Action-Conditioned World Model for Robotics

Zhuoyuan Wu , Jun Gao This is my paper

Pith reviewed 2026-06-28 06:32 UTC · model grok-4.3

classification 💻 cs.RO

keywords roboticsworld modelaction-conditioned videopolicy evaluationembodiment generalizationkinematic skeletonvideo generation

0 comments

The pith

OSCAR conditions a video world model on 2D kinematic skeletons to evaluate robot policies virtually with strong correlation to real-world results across embodiments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OSCAR as an action-conditioned video world model trained on a large standardized dataset combining robotics and egocentric human videos. It uses 2D kinematic skeleton rendering as the conditioning input to achieve better action following and generalization than prior models. The central goal is to enable robot policy evaluation inside generated videos rather than on physical hardware. The authors report that virtual evaluations on RoboArena policies align closely with real-world measurements.

Core claim

OSCAR is a precise action-conditioned video world model finetuned from Cosmos-Predict2.5-2B on a single GH200 GPU using a cleaned joint dataset from diverse robot and human sources. By adopting 2D kinematic skeleton rendering as a unified conditioning representation, the model improves action following, appearance quality, and motion consistency over baselines and produces virtual policy evaluations that show significant correlation with real-world robot performance.

What carries the argument

2D kinematic skeleton rendering as a unified conditioning representation that carries action information across robot arms and human hands

Load-bearing premise

The 2D kinematic skeleton rendering supplies enough precise action detail to work equally well for every robot arm and human hand without embodiment-specific loss of fidelity.

What would settle it

A collection of robot policies whose performance rankings or scores in OSCAR-generated videos differ substantially from the rankings obtained when the same policies are run on physical robots.

Figures

Figures reproduced from arXiv: 2606.04463 by Jun Gao, Zhuoyuan Wu.

**Figure 1.** Figure 1: OSCAR as a real-world policy-evaluation proxy on RoboArena [1]. Left: Comparison between a OSCAR rollout (top) and the corresponding real-world rollout (bottom) for the π0-FAST [2, 3] policy; three frames sampled uniformly over the episode. Right: Mean success rates on RoboArena across seven generalist policies: evaluating on our world model exhibits a strong correlation with real-world evaluation. Abstrac… view at source ↗

**Figure 2.** Figure 2: Method overview. OSCAR consists of three components: (1) Condition encoding encodes the first frame I0 and rendered skeleton S1:T into latents using VAE; (2) Conditioning injection combines the skeleton latent with the noisy video latent; and (3) Video generation, where a DiT denoises the tokens and a VAE decoder decodes the final video. at the pixel level, including pointmap renderings (Kinema4D [4]), occ… view at source ↗

**Figure 3.** Figure 3: Skeleton overlays at video frames for the eight training sources. Each block shows four episodes from one source. Top row: DROID, RH20T-cfg5, RH20T-cfg7, InternData (four robot recordings). Bottom row: AgiBot G1, AIROA-MoMa, EgoDex, EPIC-Kitchens (humanoid and two human MANO sources). follow the given action, especially when the target motion differs from the training distribution. On the other hand, more … view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of action-conditioned video generation on two embodiments. Compared with five baselines, our method achieved much better visual quality with precise action following. (i) Text-only control [23, 6, 8] is weakest because language can not precisely describe the action sequence. (ii) Latent-action guidance [11, 15] is limited by the training embodiments: a fixed kinematic layout becomes … view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on the four remaining robot embodiments. Embodiment colours [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Additional qualitative samples for AgiBot G1 and DROID, complementing Figure [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Conditioning-channel ablation (Table [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Data-composition ablation (Table [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Human-data qualitative samples. Each panel stacks GT (top) and [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Additional human-data qualitative samples, complementing Figure [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Put the food on the plate. Each strip pairs the OSCAR rollout (top) with the real-robot video (bottom) at six time steps. The real column is the RoboArena human success label and the WM column is the VLM verdict on the rollout (§5.4); cells where the two differ are highlighted. On this task 5 of seven policies succeed on the real robot. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Press a button on the phone. Each strip pairs the OSCAR rollout (top) with the real-robot video (bottom) at six time steps. The real column is the RoboArena human success label and the WM column is the VLM verdict on the rollout (§5.4); cells where the two differ are highlighted. On this task 2 of seven policies succeed on the real robot. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

**Figure 13.** Figure 13: Move the bread to the plate. Each strip pairs the OSCAR rollout (top) with the real-robot video (bottom) at six time steps. The real column is the RoboArena human success label and the WM column is the VLM verdict on the rollout (§5.4); cells where the two differ are highlighted. On this task 1 of seven policies succeeds on the real robot. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗

read the original abstract

We present OSCAR, a precise action-conditioned video world model that generalizes across different robot embodiments and enables robot policy evaluation. Existing video world models face three main challenges for real-world robot evaluation: limited scenario diversity in current robot training datasets, imprecise action following, and poor generalization across embodiments for broad adoption. We tackle these challenges from two perspectives. At its core is a large-scale standardized data pipeline that curates, filters, and deduplicates broad robotics and egocentric human datasets, yielding a clean joint-training dataset that spans diverse tasks, scenarios, actions, and robot embodiments. To condition the video model, we adopt 2D kinematic skeleton rendering as a unified conditioning representation that generalizes across different robot arms or even human hands. We finetune the Cosmos-Predict2.5-2B model on a single GH200 GPU. Our model achieves significant improvement on action following, appearance quality, and motion consistency, compared to existing baselines, which either have a much larger model size or require more GPUs. We further deploy OSCAR to evaluate robot policies from RoboArena. Extensive experiments demonstrate the significant correlation between our virtual policy evaluation in OSCAR and real-world evaluation, paving the way for the future where robot policies can be purely evaluated in virtual generated worlds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OSCAR fine-tunes Cosmos on deduplicated multi-embodiment data with 2D skeleton conditioning and claims virtual-real policy correlation, but the abstract supplies no metrics to judge either the gains or the correlation strength.

read the letter

The paper's main contribution is a data pipeline that pulls together robotics and egocentric human videos, deduplicates them, and trains a 2B video model on the result using 2D kinematic skeletons as the action-conditioning signal. They then run RoboArena policies through the model to generate rollouts and report that those virtual scores line up with real-world scores.

What is actually new is the combination of that curation step with direct use for policy ranking on an existing benchmark, plus the choice to start from Cosmos-Predict2.5 rather than training from scratch. The single-GPU fine-tune is also a practical detail that lowers the barrier compared with larger video-model efforts.

The soft spot is the complete absence of numbers. The abstract says the model improves action following and motion consistency over baselines and that the virtual-real correlation is significant, yet it gives no PSNR, FVD, correlation coefficient, dataset sizes, or statistical test. Without those, it is impossible to tell whether the skeleton conditioning actually preserves enough 3D action information across arms and hands or whether the correlation is strong enough to matter for real deployment.

The stress-test worry about 2D projection losing depth and rotation cues is therefore still open; the paper would need to show that the correlation survives on embodiments whose kinematics differ in 3D structure.

This is for people working on scalable robot evaluation who already follow video world models. A reader who wants concrete evidence that virtual testing can replace physical runs will find the current write-up preliminary.

It deserves peer review because the problem is concrete and the pipeline is reproducible in principle, but any referee will ask for the missing quantitative results and ablations on the conditioning choice.

Referee Report

2 major / 1 minor

Summary. The paper presents OSCAR, an action-conditioned video world model for robotics that uses 2D kinematic skeleton rendering as a unified conditioning signal across robot arms and human hands. It describes a large-scale data curation pipeline combining robotics and egocentric human datasets, finetuning of the Cosmos-Predict2.5-2B model on a single GH200 GPU, claimed improvements in action following/appearance/motion consistency over baselines, and a significant correlation between virtual policy evaluations (on RoboArena policies) and real-world results.

Significance. If the reported virtual-real correlation is supported by quantitative evidence, the work could enable lower-cost policy evaluation by shifting testing into generated worlds. The standardized data pipeline and embodiment-agnostic conditioning approach address practical barriers in current video world models, though the lack of reported metrics prevents gauging the magnitude of these advances.

major comments (2)

[Abstract] Abstract: the claims of 'significant improvement on action following' and 'significant correlation between our virtual policy evaluation in OSCAR and real-world evaluation' are presented without any quantitative metrics, baseline comparisons, dataset sizes, statistical details, or evaluation protocol; this absence makes the central data-to-claim link unevaluable from the manuscript.
[Method] Method (conditioning representation): the assertion that 2D kinematic skeleton rendering 'generalizes across different robot arms or even human hands' and preserves 'precise action information' is load-bearing for both the cross-embodiment claim and the virtual-real correlation; the manuscript does not address or test whether projection from 3D joint configurations to 2D discards depth/out-of-plane cues that would degrade fidelity for kinematically dissimilar embodiments.

minor comments (1)

[Abstract] Abstract: model size (2B) and training hardware (single GH200) are stated, but no corresponding numbers are given for the baselines against which improvements are claimed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. We address the two major comments point-by-point below, clarifying where quantitative details appear in the manuscript and acknowledging where additional discussion is warranted.

read point-by-point responses

Referee: [Abstract] Abstract: the claims of 'significant improvement on action following' and 'significant correlation between our virtual policy evaluation in OSCAR and real-world evaluation' are presented without any quantitative metrics, baseline comparisons, dataset sizes, statistical details, or evaluation protocol; this absence makes the central data-to-claim link unevaluable from the manuscript.

Authors: The abstract is intentionally concise and high-level. Quantitative results—including specific action-following metrics versus baselines, the reported correlation coefficient between virtual and real policy evaluations, dataset sizes after deduplication, and the full evaluation protocol—are provided in the Experiments and Results sections. To make the abstract self-contained, we will revise it to include the key numerical highlights (e.g., correlation value and relative improvements) while preserving brevity. revision: yes
Referee: [Method] Method (conditioning representation): the assertion that 2D kinematic skeleton rendering 'generalizes across different robot arms or even human hands' and preserves 'precise action information' is load-bearing for both the cross-embodiment claim and the virtual-real correlation; the manuscript does not address or test whether projection from 3D joint configurations to 2D discards depth/out-of-plane cues that would degrade fidelity for kinematically dissimilar embodiments.

Authors: The 2D skeleton rendering was selected precisely because it yields a compact, embodiment-agnostic signal that can be generated from any 3D joint set. The manuscript demonstrates cross-embodiment generalization through training and evaluation on multiple robot arms plus human-hand data. However, we agree that an explicit analysis of information loss from 3D-to-2D projection (depth/out-of-plane cues) is not present. We will add a dedicated paragraph discussing this potential limitation and its implications for kinematically dissimilar embodiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity: central claim is empirical correlation with external real-world data

full rationale

The paper presents OSCAR as a finetuned video world model using 2D kinematic skeleton rendering for cross-embodiment conditioning, with the load-bearing claim being an observed correlation between its virtual policy evaluations (on RoboArena policies) and separate real-world evaluations. This correlation is reported as an experimental outcome resting on external measurements rather than any derivation, equation, or fitted quantity internal to the model. No self-citations, uniqueness theorems, ansatzes, or renamings are invoked in a load-bearing way; the conditioning choice is stated as an adoption without reducing the correlation result to a self-definition. The chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, mathematical axioms, or newly postulated entities are stated. The approach depends on the pre-existing Cosmos-Predict2.5-2B model and on the assumption that curated public datasets can be combined without introducing unstated biases.

pith-pipeline@v0.9.1-grok · 5755 in / 1148 out tokens · 41663 ms · 2026-06-28T06:32:50.911487+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 20 canonical work pages · 12 internal anchors

[1]

Atreya, K

P. Atreya, K. Pertsch, T. Lee, M. J. Kim, A. Jain, A. Kuramshin, C. Neary, E. S. Hu, K. Arora, K. Ellis, et al. Roboarena: Distributed real-world evaluation of generalist robot policies. InConference on Robot Learning, pages 336–364. PMLR, 2025

2025
[2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

M. Xu, T. Zhang, T. Liu, Z. Chen, X. Han, and Z. Liu. Kinema4D: Kinematic 4D world modeling for spatiotemporal embodied simulation.arXiv preprint arXiv:2603.16669, 2026

work page arXiv 2026
[5]

Y . Liao, P. Zhou, S. Huang, D. Yang, S. Chen, Y . Jiang, Y . Hu, J. Cai, S. Liu, J. Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

A. Ali, J. Bai, M. Bala, Y . Balaji, A. Blakeman, T. Cai, J. Cao, T. Cao, E. Cha, Y .-W. Chao, and other. World simulation with video foundation models for physical AI.arXiv preprint arXiv:2511.00062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Y . Du, S. Yang, B. Dai, H. Dai, O. Nachum, J. Tenenbaum, D. Schuurmans, and P. Abbeel. Learning universal policies via text-guided video generation. InAdvances in neural information processing systems, volume 36, pages 9156–9172, 2023

2023
[8]

S. Yang, Y . Du, S. K. S. Ghasemipour, J. Tompson, L. P. Kaelbling, D. Schuurmans, and P. Abbeel. Learning interactive real-world simulators. InInternational Conference on Learning Representations, 2024

2024
[9]

H. Wu, Y . Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. InInternational Conference on Learning Representations, 2024

2024
[10]

Y . Hu, Y . Guo, P. Wang, X. Chen, Y .-J. Wang, J. Zhang, K. Sreenath, C. Lu, and J. Chen. Video prediction policy: A generalist robot policy with predictive visual representations. InInternational Conference on Machine Learning, pages 24328–24346. PMLR, 2025

2025
[11]

F. Zhu, H. Wu, S. Guo, Y . Liu, C. Cheang, and T. Kong. Irasim: A fine-grained world model for robot manipulation.Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9834–9844, 2025

2025
[12]

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies. InICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026

2026
[13]

J. Jang, S. Ye, Z. Lin, J. Xiang, J. Bjorck, Y . Fang, F. Hu, S. Huang, K. Kundalia, Y .-C. Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models.Conference on Robot Learning, pages 5170–5194, 2025

2025
[14]

Jiang, S

Y . Jiang, S. Chen, S. Huang, L. Chen, P. Zhou, Y . Liao, X. HE, C. Liu, H. Li, M. Yao, et al. Enerverse-ac: Envisioning embodied environments with action condition.NeurIPS 2025 Workshop on Embodied World Models for Decision Making, 2025

2025
[15]

Y . Guo, L. X. Shi, J. Chen, and C. Finn. Ctrl-World: A controllable generative world model for robot manipulation. InInternational Conference on Learning Representations (ICLR), 2026

2026
[16]

S. Gao, S. Zhou, Y . Du, J. Zhang, and C. Gan. Adaworld: Learning adaptable world models with latent actions.International Conference on Machine Learning, pages 18744–18771, 2025

2025
[17]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

C.-L. Cheang, G. Chen, Y . Jing, T. Kong, H. Li, Y . Li, Y . Liu, H. Wu, J. Xu, Y . Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

S. Gao, W. Liang, K. Zheng, A. Malik, S. Ye, S. Yu, W.-C. Tseng, Y . Dong, K. Mo, C.-H. Lin, et al. Dream- dojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026. 9

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Bharadhwaj, R

H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani. Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation. InEuropean Conference on Computer Vision, pages 306–324. Springer, 2024

2024
[20]

X. Yang, B. Li, S. Xu, N. Wang, C. Ye, Z. Chen, M. Qin, Y . Du, X. Jin, H. Zhao, and H. Zhao. ORV: 4D occupancy-centric robot video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026
[21]

Y . Wang, C. Wen, H. Guo, S. Peng, M. Qin, H. Bao, X. Zhou, and R. Hu. Precise action-to-video generation through visual action prompts.Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12713–12724, 2025

2025
[22]

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

H. Zhen, Q. Sun, H. Zhang, J. Li, S. Zhou, Y . Du, and C. Gan. TesserAct: Learning 4D embodied world models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025
[24]

X. Li, K. Hsu, J. Gu, O. Mees, K. Pertsch, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kirmani, et al. Evaluating real-world robot manipulation policies in simulation. InConference on Robot Learning, pages 3705–3728. PMLR, 2025

2025
[25]

Y . Li, Y . Zhu, J. Wen, C. Shen, and Y . Xu. Worldeval: World model as real-world robot policies evaluator. arXiv preprint arXiv:2505.19017, 2025

work page arXiv 2025
[26]

Quevedo, A

J. Quevedo, A. K. Sharma, Y . Sun, V . Suryavanshi, P. Liang, and S. Yang. Worldgym: World model as an environment for policy evaluation.arXiv preprint arXiv:2506.00613, 2025

work page arXiv 2025
[27]

Tseng, J

W.-C. Tseng, J. Gu, Q. Zhang, H. Mao, M.-Y . Liu, F. Shkurti, and L. Yen-Chen. Scalable policy evaluation with video world models.arXiv preprint arXiv:2511.11520, 2025

work page arXiv 2025
[28]

H. Yue, S. Huang, Y . Liao, S. Chen, P. Zhou, L. Chen, M. Yao, and G. Ren. Ewmbench: Evaluating scene, motion, and semantic quality in embodied world models.arXiv preprint arXiv:2505.09694, 2025

work page arXiv 2025
[29]

Romero, D

J. Romero, D. Tzionas, and M. J. Black. Embodied hands: Modeling and capturing hands and bodies together.ACM Transactions on Graphics (Proc. SIGGRAPH Asia), 36(6):245:1–245:17, 2017

2017
[30]

Q. Bu, J. Cai, L. Chen, X. Cui, Y . Ding, S. Feng, S. Gao, X. He, X. Hu, X. Huang, et al. AgiBot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

H.-S. Fang, H. Fang, Z. Tang, J. Liu, J. Wang, H. Zhu, and C. Lu. RH20T: A comprehensive robotic dataset for learning diverse skills in one-shot.RSS 2023 Workshop on Learning for Task and Motion Planning, 2023

2023
[32]

Y . Tian, Y . Yang, Y . Xie, Z. Cai, X. Shi, N. Gao, H. Liu, X. Jiang, Z. Qiu, F. Yuan, et al. Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy.arXiv preprint arXiv:2511.16651, 2025

work page arXiv 2025
[33]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. DROID: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Takanami, P

R. Takanami, P. Khrapchenkov, S. Morikuni, J. Arima, Y . Takaba, S. Maeda, T. Okubo, G. Sano, S. Sekioka, A. Kadoya, et al. Airoa moma dataset: A large-scale hierarchical dataset for mobile manipulation.arXiv preprint arXiv:2509.25032, 2025

work page arXiv 2025
[35]

EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Q. Li, Y . Deng, Y . Liang, L. Luo, L. Zhou, C. Yao, L. Zeng, Z. Feng, H. Liang, S. Xu, et al. Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos. arXiv preprint arXiv:2510.21571, 2025

work page arXiv 2025
[37]

Damen, H

D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al. The epic-kitchens dataset: Collection, challenges and baselines.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020. 10

2020
[38]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11975–11986, 2023

2023
[39]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Horé and D

A. Horé and D. Ziou. Image quality metrics: Psnr vs. ssim. In2010 20th International Conference on Pattern Recognition, pages 2366–2369, 2010

2010
[41]

Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004

2004
[42]

Zhang, P

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018

2018
[43]

Towards Accurate Generative Models of Video: A New Metric & Challenges

T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[44]

Heusel, H

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

2017
[45]

R. Wang, S. Xu, Y . Dong, Y . Deng, J. Xiang, Z. Lv, G. Sun, X. Tong, and J. Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details.Advances in Neural Information Processing Systems, 38:35928–35959, 2026

2026
[46]

J. Lu, Z. Liang, T. Xie, F. Richter, S. Lin, S. Liu, and M. C. Yip. Ctrnet-x: Camera-to-robot pose estimation in real-world conditions using a single camera. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 1914–1920. IEEE, 2025

1914
[47]

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP), pages 611–626, 2023

2023
[48]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2018

2018
[49]

Ho and T

J. Ho and T. Salimans. Classifier-free diffusion guidance.NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2022

2021
[50]

Grauman, A

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18995–19012, 2022. 11 A Technical Appendix A.1 Data curation pipeline A.1.1 Datase...

2022
[51]

You need to describe the main subject of the video in detail, including their appearance, actions, expressions, and surrounding environment
[52]

You need to emphasize movement information in the input and different camera angles
[53]

Your output should convey natural movement attributes, incorporating natural actions related to the described subject category, using simple and direct verbs as much as possible
[54]

You should reference the detailed information in the video, such as character actions, clothing, backgrounds, and emphasize the details in the video
[55]

Control the rewritten prompt to around 80–100 words
[56]

12 Example of the rewritten English prompt:

No matter what language the user inputs, you must always output in English. 12 Example of the rewritten English prompt:
[57]

The girl wears a white square collar puff sleeve dress, decorated with pleats and buttons

A Japanese fresh film-style photo of a young East Asian girl with double braids sitting by the boat. The girl wears a white square collar puff sleeve dress, decorated with pleats and buttons. She has fair skin, delicate features, and slightly melancholic eyes, staring directly at the camera. Her hair falls naturally, with bangs covering part of her forehe...
[58]

She has long dark purple hair and red eyes, wearing a dark gray skirt and a light gray top with a white waist tie and a name tag in bold Chinese characters that says “Ziyang”

An anime illustration in vibrant thick painting style of a white girl with cat ears holding a folder, showing a slightly dissatisfied expression. She has long dark purple hair and red eyes, wearing a dark gray skirt and a light gray top with a white waist tie and a name tag in bold Chinese characters that says “Ziyang”. The background has a light yellow i...
[59]

The crocodile’s skin is rough and grayish-white, resembling stone or wood texture

CG game concept digital art featuring a huge crocodile with its mouth wide open, with trees and thorns growing on its back. The crocodile’s skin is rough and grayish-white, resembling stone or wood texture. Its back is lush with trees, shrubs, and thorny protrusions. With its mouth agape, the crocodile reveals a pink tongue and sharp teeth. The background...
[60]

Breaking Bad

In the style of an American drama promotional poster, Walter White sits in a metal folding chair wearing a yellow protective suit, with the words “Breaking Bad” written in sans-serif English above him, surrounded by piles of dollar bills and blue plastic storage boxes. He wears glasses, staring forward, dressed in a yellow jumpsuit, with his hands resting...

2048

[1] [1]

Atreya, K

P. Atreya, K. Pertsch, T. Lee, M. J. Kim, A. Jain, A. Kuramshin, C. Neary, E. S. Hu, K. Arora, K. Ellis, et al. Roboarena: Distributed real-world evaluation of generalist robot policies. InConference on Robot Learning, pages 336–364. PMLR, 2025

2025

[2] [2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

M. Xu, T. Zhang, T. Liu, Z. Chen, X. Han, and Z. Liu. Kinema4D: Kinematic 4D world modeling for spatiotemporal embodied simulation.arXiv preprint arXiv:2603.16669, 2026

work page arXiv 2026

[5] [5]

Y . Liao, P. Zhou, S. Huang, D. Yang, S. Chen, Y . Jiang, Y . Hu, J. Cai, S. Liu, J. Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

A. Ali, J. Bai, M. Bala, Y . Balaji, A. Blakeman, T. Cai, J. Cao, T. Cao, E. Cha, Y .-W. Chao, and other. World simulation with video foundation models for physical AI.arXiv preprint arXiv:2511.00062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Y . Du, S. Yang, B. Dai, H. Dai, O. Nachum, J. Tenenbaum, D. Schuurmans, and P. Abbeel. Learning universal policies via text-guided video generation. InAdvances in neural information processing systems, volume 36, pages 9156–9172, 2023

2023

[8] [8]

S. Yang, Y . Du, S. K. S. Ghasemipour, J. Tompson, L. P. Kaelbling, D. Schuurmans, and P. Abbeel. Learning interactive real-world simulators. InInternational Conference on Learning Representations, 2024

2024

[9] [9]

H. Wu, Y . Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. InInternational Conference on Learning Representations, 2024

2024

[10] [10]

Y . Hu, Y . Guo, P. Wang, X. Chen, Y .-J. Wang, J. Zhang, K. Sreenath, C. Lu, and J. Chen. Video prediction policy: A generalist robot policy with predictive visual representations. InInternational Conference on Machine Learning, pages 24328–24346. PMLR, 2025

2025

[11] [11]

F. Zhu, H. Wu, S. Guo, Y . Liu, C. Cheang, and T. Kong. Irasim: A fine-grained world model for robot manipulation.Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9834–9844, 2025

2025

[12] [12]

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies. InICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026

2026

[13] [13]

J. Jang, S. Ye, Z. Lin, J. Xiang, J. Bjorck, Y . Fang, F. Hu, S. Huang, K. Kundalia, Y .-C. Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models.Conference on Robot Learning, pages 5170–5194, 2025

2025

[14] [14]

Jiang, S

Y . Jiang, S. Chen, S. Huang, L. Chen, P. Zhou, Y . Liao, X. HE, C. Liu, H. Li, M. Yao, et al. Enerverse-ac: Envisioning embodied environments with action condition.NeurIPS 2025 Workshop on Embodied World Models for Decision Making, 2025

2025

[15] [15]

Y . Guo, L. X. Shi, J. Chen, and C. Finn. Ctrl-World: A controllable generative world model for robot manipulation. InInternational Conference on Learning Representations (ICLR), 2026

2026

[16] [16]

S. Gao, S. Zhou, Y . Du, J. Zhang, and C. Gan. Adaworld: Learning adaptable world models with latent actions.International Conference on Machine Learning, pages 18744–18771, 2025

2025

[17] [17]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

C.-L. Cheang, G. Chen, Y . Jing, T. Kong, H. Li, Y . Li, Y . Liu, H. Wu, J. Xu, Y . Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

S. Gao, W. Liang, K. Zheng, A. Malik, S. Ye, S. Yu, W.-C. Tseng, Y . Dong, K. Mo, C.-H. Lin, et al. Dream- dojo: A generalist robot world model from large-scale human videos.arXiv preprint arXiv:2602.06949, 2026. 9

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

Bharadhwaj, R

H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani. Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation. InEuropean Conference on Computer Vision, pages 306–324. Springer, 2024

2024

[20] [20]

X. Yang, B. Li, S. Xu, N. Wang, C. Ye, Z. Chen, M. Qin, Y . Du, X. Jin, H. Zhao, and H. Zhao. ORV: 4D occupancy-centric robot video generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026

[21] [21]

Y . Wang, C. Wen, H. Guo, S. Peng, M. Qin, H. Bao, X. Zhou, and R. Hu. Precise action-to-video generation through visual action prompts.Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12713–12724, 2025

2025

[22] [22]

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

H. Zhen, Q. Sun, H. Zhang, J. Li, S. Zhou, Y . Du, and C. Gan. TesserAct: Learning 4D embodied world models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025

[24] [24]

X. Li, K. Hsu, J. Gu, O. Mees, K. Pertsch, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kirmani, et al. Evaluating real-world robot manipulation policies in simulation. InConference on Robot Learning, pages 3705–3728. PMLR, 2025

2025

[25] [25]

Y . Li, Y . Zhu, J. Wen, C. Shen, and Y . Xu. Worldeval: World model as real-world robot policies evaluator. arXiv preprint arXiv:2505.19017, 2025

work page arXiv 2025

[26] [26]

Quevedo, A

J. Quevedo, A. K. Sharma, Y . Sun, V . Suryavanshi, P. Liang, and S. Yang. Worldgym: World model as an environment for policy evaluation.arXiv preprint arXiv:2506.00613, 2025

work page arXiv 2025

[27] [27]

Tseng, J

W.-C. Tseng, J. Gu, Q. Zhang, H. Mao, M.-Y . Liu, F. Shkurti, and L. Yen-Chen. Scalable policy evaluation with video world models.arXiv preprint arXiv:2511.11520, 2025

work page arXiv 2025

[28] [28]

H. Yue, S. Huang, Y . Liao, S. Chen, P. Zhou, L. Chen, M. Yao, and G. Ren. Ewmbench: Evaluating scene, motion, and semantic quality in embodied world models.arXiv preprint arXiv:2505.09694, 2025

work page arXiv 2025

[29] [29]

Romero, D

J. Romero, D. Tzionas, and M. J. Black. Embodied hands: Modeling and capturing hands and bodies together.ACM Transactions on Graphics (Proc. SIGGRAPH Asia), 36(6):245:1–245:17, 2017

2017

[30] [30]

Q. Bu, J. Cai, L. Chen, X. Cui, Y . Ding, S. Feng, S. Gao, X. He, X. Hu, X. Huang, et al. AgiBot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

H.-S. Fang, H. Fang, Z. Tang, J. Liu, J. Wang, H. Zhu, and C. Lu. RH20T: A comprehensive robotic dataset for learning diverse skills in one-shot.RSS 2023 Workshop on Learning for Task and Motion Planning, 2023

2023

[32] [32]

Y . Tian, Y . Yang, Y . Xie, Z. Cai, X. Shi, N. Gao, H. Liu, X. Jiang, Z. Qiu, F. Yuan, et al. Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy.arXiv preprint arXiv:2511.16651, 2025

work page arXiv 2025

[33] [33]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. DROID: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Takanami, P

R. Takanami, P. Khrapchenkov, S. Morikuni, J. Arima, Y . Takaba, S. Maeda, T. Okubo, G. Sano, S. Sekioka, A. Kadoya, et al. Airoa moma dataset: A large-scale hierarchical dataset for mobile manipulation.arXiv preprint arXiv:2509.25032, 2025

work page arXiv 2025

[35] [35]

EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Q. Li, Y . Deng, Y . Liang, L. Luo, L. Zhou, C. Yao, L. Zeng, Z. Feng, H. Liang, S. Xu, et al. Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos. arXiv preprint arXiv:2510.21571, 2025

work page arXiv 2025

[37] [37]

Damen, H

D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al. The epic-kitchens dataset: Collection, challenges and baselines.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020. 10

2020

[38] [38]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11975–11986, 2023

2023

[39] [39]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Horé and D

A. Horé and D. Ziou. Image quality metrics: Psnr vs. ssim. In2010 20th International Conference on Pattern Recognition, pages 2366–2369, 2010

2010

[41] [41]

Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004

2004

[42] [42]

Zhang, P

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018

2018

[43] [43]

Towards Accurate Generative Models of Video: A New Metric & Challenges

T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[44] [44]

Heusel, H

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

2017

[45] [45]

R. Wang, S. Xu, Y . Dong, Y . Deng, J. Xiang, Z. Lv, G. Sun, X. Tong, and J. Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details.Advances in Neural Information Processing Systems, 38:35928–35959, 2026

2026

[46] [46]

J. Lu, Z. Liang, T. Xie, F. Richter, S. Lin, S. Liu, and M. C. Yip. Ctrnet-x: Camera-to-robot pose estimation in real-world conditions using a single camera. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 1914–1920. IEEE, 2025

1914

[47] [47]

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP), pages 611–626, 2023

2023

[48] [48]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2018

2018

[49] [49]

Ho and T

J. Ho and T. Salimans. Classifier-free diffusion guidance.NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2022

2021

[50] [50]

Grauman, A

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18995–19012, 2022. 11 A Technical Appendix A.1 Data curation pipeline A.1.1 Datase...

2022

[51] [51]

You need to describe the main subject of the video in detail, including their appearance, actions, expressions, and surrounding environment

[52] [52]

You need to emphasize movement information in the input and different camera angles

[53] [53]

Your output should convey natural movement attributes, incorporating natural actions related to the described subject category, using simple and direct verbs as much as possible

[54] [54]

You should reference the detailed information in the video, such as character actions, clothing, backgrounds, and emphasize the details in the video

[55] [55]

Control the rewritten prompt to around 80–100 words

[56] [56]

12 Example of the rewritten English prompt:

No matter what language the user inputs, you must always output in English. 12 Example of the rewritten English prompt:

[57] [57]

The girl wears a white square collar puff sleeve dress, decorated with pleats and buttons

A Japanese fresh film-style photo of a young East Asian girl with double braids sitting by the boat. The girl wears a white square collar puff sleeve dress, decorated with pleats and buttons. She has fair skin, delicate features, and slightly melancholic eyes, staring directly at the camera. Her hair falls naturally, with bangs covering part of her forehe...

[58] [58]

She has long dark purple hair and red eyes, wearing a dark gray skirt and a light gray top with a white waist tie and a name tag in bold Chinese characters that says “Ziyang”

An anime illustration in vibrant thick painting style of a white girl with cat ears holding a folder, showing a slightly dissatisfied expression. She has long dark purple hair and red eyes, wearing a dark gray skirt and a light gray top with a white waist tie and a name tag in bold Chinese characters that says “Ziyang”. The background has a light yellow i...

[59] [59]

The crocodile’s skin is rough and grayish-white, resembling stone or wood texture

CG game concept digital art featuring a huge crocodile with its mouth wide open, with trees and thorns growing on its back. The crocodile’s skin is rough and grayish-white, resembling stone or wood texture. Its back is lush with trees, shrubs, and thorny protrusions. With its mouth agape, the crocodile reveals a pink tongue and sharp teeth. The background...

[60] [60]

Breaking Bad

In the style of an American drama promotional poster, Walter White sits in a metal folding chair wearing a yellow protective suit, with the words “Breaking Bad” written in sans-serif English above him, surrounded by piles of dollar bills and blue plastic storage boxes. He wears glasses, staring forward, dressed in a yellow jumpsuit, with his hands resting...

2048