SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation

Allen Z. Ren; Florian Shkurti; Gashon Hussein; Jinwei Gu; Lucy X. Shi; Ming-Yu Liu; Quan Vuong; Sergey Levine; Wei-Cheng Tseng; Xudong Wang

arxiv: 2606.18610 · v2 · pith:GAY7ILQOnew · submitted 2026-06-17 · 💻 cs.RO · cs.CV

SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation

Wei-Cheng Tseng , Gashon Hussein , Yuzhu Dong , Allen Z. Ren , Lucy X. Shi , XuDong Wang , Sergey Levine , Zhaoshuo Li

show 4 more authors

Jinwei Gu Florian Shkurti Ming-Yu Liu Quan Vuong

This is my paper

Pith reviewed 2026-06-26 21:15 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords robot policy evaluationvideo world modelsself-consistent generationvision-language-action policiesforward-inverse dynamicscross-view consistencysimulation-based evaluation

0 comments

The pith

Enforcing forward-inverse dynamics, cross-view inpainting, and test-time termination turns a video foundation model into an accurate evaluator of robot policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Evaluating robot manipulation policies in the real world is expensive and hard to scale, so the paper turns video world models into simulators for policy rollouts. Autoregressive video generation drifts over time, multi-view observations become inconsistent, and models struggle with policies outside their training data. SC3-Eval counters these issues by adapting a pre-trained video model through three consistency constraints: joint forward-inverse dynamics training to anchor actions to a plausible manifold, cross-view inpainting to maintain coherent observations without memory, and test-time termination using inverse dynamics uncertainty to halt drifting rollouts. The resulting simulations reproduce real-world failure modes and achieve a closed-loop Pearson correlation of 0.929 and MMRV of 0.119 across seven vision-language-action policies, outperforming prior video baselines while generalizing to new tasks.

Core claim

SC3-Eval adapts a pre-trained video foundation model into a policy evaluator by jointly training forward and inverse dynamics to keep generated rollouts on a physically plausible action manifold, training cross-view inpainting to maintain multi-camera coherence over long horizons without explicit memory, and reusing inverse dynamics at test time as an uncertainty signal to terminate inconsistent rollouts. This self-consistent recipe produces evaluations that match real-world policy performance with a closed-loop Pearson correlation of 0.929 and MMRV of 0.119 across seven policies, reproduce specific failure modes, outperform three prior video-model baselines, and generalize to new tasks.

What carries the argument

SC3-Eval self-consistent video generation recipe that jointly trains forward-inverse dynamics consistency, cross-view inpainting consistency, and test-time consistency via inverse-dynamics uncertainty termination.

If this is right

Policy ranking and fine-grained failure diagnosis become possible from simulated video alone without physical deployment.
The evaluator works for policies whose behaviors were not seen during its own training.
Multi-view coherence holds across long action sequences without needing recurrent memory.
Rollouts remain anchored to physically plausible actions because inverse dynamics penalizes drift that forward-only models miss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same three-consistency recipe could be applied to train better video predictors for other physical domains such as navigation or assembly.
If the correlation holds on broader policy sets, development cycles for robot foundation models could shift toward simulation-first iteration.
Extending the uncertainty signal to continuous per-frame monitoring rather than chunk termination might further reduce compounding error.
The approach suggests that explicit inverse modeling is a general lever for making generative world models more reliable for downstream control tasks.

Load-bearing premise

That jointly training forward-inverse dynamics, cross-view inpainting, and uncertainty-based termination on a pre-trained video model is enough to keep long-horizon rollouts plausible and to generalize to policies outside the training distribution.

What would settle it

Test SC3-Eval on an eighth real-world policy whose action distribution lies substantially outside the original training data and check whether the predicted success rates and failure modes still match the measured real-world execution rates within the reported margins.

Figures

Figures reproduced from arXiv: 2606.18610 by Allen Z. Ren, Florian Shkurti, Gashon Hussein, Jinwei Gu, Lucy X. Shi, Ming-Yu Liu, Quan Vuong, Sergey Levine, Wei-Cheng Tseng, Xudong Wang, Yuzhu Dong, Zhaoshuo Li.

**Figure 1.** Figure 1: The three consistency axes of SC3-Eval. (a) forward-inverse dynamics consistency. Under the same action sequence, Ours closely tracks GT, while a forward-only baseline drifts. (b) multi-view consistency. The model predicts the wrist view from a third-person view, with good versus bad example predictions. (c) test-time consistency. Disagreement between commanded and inverse-mode-recovered actions terminates… view at source ↗

**Figure 2.** Figure 2: Self-consistency training for SC3- Eval. Three joint-training modes over a shared backbone, distinguished only by which tokens are clean (conditioning) versus noisy (denoising targets). Forward Dynamics denoises future video given action and first frames, Cross-View Inpainting denoises held-out views from the remaining view and action, and Inverse Dynamics denoises the action chunk given the full video. … view at source ↗

**Figure 3.** Figure 3: Real-world experiment setup. (a) Workspace, with orange arrows for forward table bussing and green arrows for the destination-swapped reverse variant. (b) The three synchronized camera views observed by the policy and the world model. (c) Example trajectories scored under the three success criteria. (a) Offline evaluation (open-loop) (b) Online evaluation (closed-loop) [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

**Figure 4.** Figure 4: Correlation between predicted and real-world policy performance. Each point is a (checkpoint, success-criterion) pair. Within each of (a) and (b), the left and right panels show the in-distribution (table bussing) and out-of-distribution (reverse table bussing) splits. policy checkpoints with the π0.5 [25] architecture are evaluated in our experiments. More details are in App. A. Evaluation protocol. For e… view at source ↗

**Figure 5.** Figure 5: Qualitative rollouts for online generation. Each example shows the shared initial observation, followed by the real-world rollout and the online rollout generated by our evaluator under the same policy. of-distribution split, most sharply in online mode, reflecting the genuine difficulty of generalizing across a task-semantic shift. Offline versus online. One might expect online evaluation to be less fait… view at source ↗

**Figure 6.** Figure 6: Per-category outcome reproduction rate under online evaluation. Each bar shows the fraction of trajectories whose evaluator rollout matches the real-world outcome category. A policy evaluator that reproduces aggregate success rates is useful for checkpoint selection, but a stronger criterion is whether the policy evaluator reproduces the same kinds of failures a policy exhibits in the real world. Two ev… view at source ↗

**Figure 7.** Figure 7: Qualitative effect of inverse dynamics grounding on offline rollouts. Each row shows the same offline (open-loop) rollout under a different training configuration, sampled at five timesteps from a single bussing trajectory. The variant trained without inverse dynamics matches the ground truth for the first second, then drifts onto a different scene and never recovers, while the variant with inverse dynamic… view at source ↗

**Figure 8.** Figure 8: Qualitative effect of cross-view inpainting on wrist-view scene re-entry. Each row shows the same offline (open-loop) rollout under a different training configuration, sampled at four timesteps spanning a re-entry segment in which the wrist camera turns away from the workspace and then returns. Each panel shows a zoomed wrist view on top and the two third-person views below. The variant trained without cro… view at source ↗

**Figure 9.** Figure 9: Uncertainty-driven early termination on an example rollout. Top, four frames sampled from the imagined rollout. Bottom, the per-chunk consistency error Uchunk over the 14 rollout chunks. The rollout is terminated at the first chunk whose Uchunk exceeds the threshold τ = 0.02 (chunk 10 here), before the visible drift compounds. C.2 Prediction-Execution Horizon Decoupling The prediction-execution horizon dec… view at source ↗

read the original abstract

Evaluating generalist robot manipulation policies in the real world is expensive, slow, and difficult to scale. Action-conditioned video world models offer a scalable alternative by simulating policy rollouts. Autoregressive rollouts accumulate compounding errors, observations across multiple camera views must remain mutually consistent, and the evaluator must generalize to policies whose behaviors lie outside the training distribution. We address these challenges with SC3-Eval, a self-consistent video generation recipe that adapts a pre-trained video foundation model into an accurate policy evaluator by enforcing three complementary forms of consistency. First, forward-inverse dynamics consistency jointly trains the model to predict frames from actions and to recover actions from frames, anchoring generated rollouts to a physically plausible action manifold and counteracting the drift a forward-only model cannot penalize. Second, cross-view consistency trains the model to inpaint each camera view from the other, keeping the multi-camera observation coherent over long rollouts without any explicit memory mechanism. Third, test-time consistency reuses the inverse dynamics mode at inference as a per-action-chunk uncertainty signal that terminates rollouts whose generated frames drift away from the requested actions. We also demonstrate SC3-Eval rollouts reproduce the failure modes that policies exhibit in real-world rollouts, supporting fine-grained diagnostic comparison rather than aggregate ranking alone. Across seven real-world vision-language-action policies, SC3-Eval attains a closed-loop Pearson correlation of $0.929$ and MMRV of $0.119$, outperforming three strong prior video-model-based baselines, and generalizes to new tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SC3-Eval gets a 0.929 correlation and reproduces failure modes with its three consistency constraints, but the OOD generalization still looks like an assumption until the full results are checked.

read the letter

The main things to know are that SC3-Eval reports a 0.929 closed-loop Pearson correlation with real robot rollouts across seven vision-language-action policies, an MMRV of 0.119, and it reproduces the actual failure modes those policies show in the real world while beating three prior video-model baselines.

The concrete new element is the joint recipe: forward-inverse dynamics training to keep generated actions on a plausible manifold, cross-view inpainting to maintain multi-camera coherence without memory, and test-time reuse of the inverse model as an uncertainty signal to stop drifting rollouts. The paper does well by turning these into a practical evaluator that claims to generalize to new tasks and by giving specific numbers instead of vague claims.

The soft spot is exactly the one the stress-test flags. The abstract does not show explicit OOD splits, trajectory-level physics checks, or quantitative evidence that the three constraints actually stop drift once a policy moves outside the video model's training distribution. If the high correlation mostly reflects policies that stay close to the data the model already saw, the central scaling argument weakens. The full paper needs to make that distinction clear.

This is for people who build or benchmark robot foundation models and need cheaper ways to test policies at scale. A reader working on evaluation methods or world models for robotics would get direct value from the empirical setup and the failure-mode reproduction.

It deserves a serious referee because the problem is real, the numbers are specific, and the recipe is testable even if the OOD part requires more evidence.

Referee Report

1 major / 2 minor

Summary. The paper claims that SC3-Eval adapts a pre-trained video foundation model into a policy evaluator by jointly enforcing forward-inverse dynamics consistency (to anchor to a physically plausible action manifold), cross-view consistency (via inpainting for multi-camera coherence), and test-time consistency (inverse-dynamics-based termination to detect drift). Across seven real-world vision-language-action policies it reports a closed-loop Pearson correlation of 0.929 and MMRV of 0.119, outperforming three video-model baselines, reproducing real-world failure modes, and generalizing to new tasks.

Significance. If the reported correlation and outperformance hold under scrutiny, the work would be a meaningful contribution to scalable robot policy evaluation, offering a concrete alternative to costly real-world rollouts. The empirical numbers, the reproduction of failure modes for diagnostics, and the use of complementary consistency objectives on an existing video model are strengths that could influence how video world models are applied in robotics.

major comments (1)

[Abstract and §4–5] Abstract and §4–5 (Experiments): the central claim that the three consistency mechanisms suffice to keep long-horizon autoregressive rollouts physically plausible and to generalize to policies outside the training distribution is load-bearing for the 0.929 closed-loop Pearson correlation. The abstract asserts that forward-inverse training, cross-view inpainting, and test-time termination counteract drift and compounding errors, yet no quantitative support (trajectory-level physics violation rates, explicit OOD policy splits, or ablation isolating drift on held-out behaviors) is referenced; without such evidence the correlation could reflect in-distribution overlap rather than the claimed evaluator properties.

minor comments (2)

[Abstract] The exact definition and computation of MMRV (reported as 0.119) is not stated in the abstract and should be given explicitly with its formula and relation to the Pearson metric.
[§5] Details on policy selection, baseline hyper-parameter tuning, and whether the seven policies or the held-out real-world data were chosen after inspecting results should be added to §5 to allow assessment of post-hoc bias.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for acknowledging the potential contribution of SC3-Eval to scalable policy evaluation. We address the single major comment below.

read point-by-point responses

Referee: [Abstract and §4–5] Abstract and §4–5 (Experiments): the central claim that the three consistency mechanisms suffice to keep long-horizon autoregressive rollouts physically plausible and to generalize to policies outside the training distribution is load-bearing for the 0.929 closed-loop Pearson correlation. The abstract asserts that forward-inverse training, cross-view inpainting, and test-time termination counteract drift and compounding errors, yet no quantitative support (trajectory-level physics violation rates, explicit OOD policy splits, or ablation isolating drift on held-out behaviors) is referenced; without such evidence the correlation could reflect in-distribution overlap rather than the claimed evaluator properties.

Authors: We agree that the manuscript would be strengthened by explicit quantitative support such as trajectory-level physics violation rates, dedicated OOD policy splits, and an ablation isolating drift on held-out behaviors. The current evidence consists of the closed-loop Pearson correlation of 0.929 measured across seven real-world VLA policies (including generalization to new tasks) together with qualitative reproduction of real-world failure modes. While we believe these results are inconsistent with pure in-distribution overlap, we acknowledge they do not directly quantify per-trajectory physical plausibility or isolate each consistency term's effect on drift. In the revised manuscript we will add an ablation that measures the contribution of each consistency mechanism to rollout stability on held-out behaviors and will report any available quantitative proxies for physical violation derived from the inverse-dynamics consistency signal. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical correlation against held-out real-world data

full rationale

The paper reports closed-loop Pearson correlation (0.929) and MMRV (0.119) computed directly against real-world rollouts of seven held-out vision-language-action policies. The training recipe (forward-inverse dynamics, cross-view inpainting, test-time termination) is described as a method to produce the evaluator, but the final metrics are external benchmarks that do not reduce by construction to any fitted parameter or self-citation chain within the paper. No equations, uniqueness theorems, or ansatzes are shown to force the reported numbers from the training data alone. The evaluation therefore remains self-contained against independent real-world measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities beyond reliance on a pre-trained video foundation model and standard supervised training; the three consistency objectives are presented as training signals rather than new postulated objects.

pith-pipeline@v0.9.1-grok · 5856 in / 1157 out tokens · 22124 ms · 2026-06-26T21:15:37.656223+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 1 canonical work pages

[1]

X. Yang, R. Dagli, A. Zook, H. Hadfield, A. Goyal, S. Birchfield, F. Ramos, and J. Tremblay. Robolab: A high-fidelity simulation benchmark for analysis of task generalist policies, 2026. URLhttps://arxiv.org/abs/2604.09860

Pith/arXiv arXiv 2026
[2]

A. Jain, M. Zhang, K. Arora, W. Chen, M. Torne, M. Z. Irshad, S. Zakharov, Y . Wang, S. Levine, C. Finn, W.-C. Ma, D. Shah, A. Gupta, and K. Pertsch. Polaris: Scalable real-to- sim evaluations for generalist robot policies, 2025. URLhttps://arxiv.org/abs/2512. 16881

2025
[3]

F. Zhu, H. Wu, S. Guo, Y . Liu, C. Cheang, and T. Kong. Irasim: A fine-grained world model for robot manipulation. InICCV, 2025. arXiv:2406.14540

arXiv 2025
[4]

Quevedo, A

J. Quevedo, A. K. Sharma, Y . Sun, V . Suryavanshi, P. Liang, and S. Yang. Worldgym: World model as an environment for policy evaluation.arXiv preprint arXiv:2506.00613, 2025. doi: 10.48550/arXiv.2506.00613. URLhttps://arxiv.org/abs/2506.00613

work page doi:10.48550/arxiv.2506.00613 2025
[5]

Tseng, J

W.-C. Tseng, J. Gu, Q. Zhang, H. Mao, M.-Y . Liu, F. Shkurti, and Y .-C. Lin. Scalable policy evaluation with video world models.arXiv preprint arXiv:2511.11520, 2025. URLhttps: //arxiv.org/abs/2511.11520

arXiv 2025
[6]

Y . Guo, L. X. Shi, J. Chen, and C. Finn. Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025. URLhttps://arxiv.org/ abs/2510.10125

Pith/arXiv arXiv 2025
[7]

S. Li, Y . Gao, D. Sadigh, and S. Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025. URLhttps://arxiv.org/abs/2503.00200

Pith/arXiv arXiv 2025
[8]

C. Zhu, R. Yu, S. Feng, B. Burchfiel, P. Shah, and A. Gupta. Unified world models: Cou- pling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025. URLhttps://arxiv.org/abs/2504.02792

Pith/arXiv arXiv 2025
[9]

NVIDIA, A. Ali, J. Bai, M. Bala, Y . Balaji, A. Blakeman, T. Cai, J. Cao, T. Cao, E. Cha, Y .-W. Chao, P. Chattopadhyay, M. Chen, Y . Chen, Y . Chen, S. Cheng, Y . Cui, J. Diamond, Y . Ding, J. Fan, L. Fan, L. Feng, F. Ferroni, S. Fidler, X. Fu, R. Gao, Y . Ge, J. Gu, A. Gupta, S. Gururani, I. El Hanafi, A. Hassani, Z. Hao, J. Huffman, J. Jang, P. Jannaty...

Pith/arXiv arXiv 2025
[10]

Atreya, K

P. Atreya, K. Pertsch, T. Lee, M. J. Kim, A. Jain, A. Kuramshin, C. Eppner, C. Neary, E. Hu, F. Ramos, et al. Roboarena: Distributed real-world evaluation of generalist robot policies. In Proceedings of the Conference on Robot Learning (CoRL 2025), 2025

2025
[11]

X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kir- mani, S. Levine, J. Wu, C. Finn, H. Su, Q. Vuong, and T. Xiao. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024

Pith/arXiv arXiv 2024
[12]

Evaluating gemini robotics policies in a veo world simulator.arXiv preprint arXiv:2512.10675, 2025

Gemini Robotics Team. Evaluating gemini robotics policies in a veo world simulator.arXiv preprint arXiv:2512.10675, 2025. URLhttps://arxiv.org/abs/2512.10675

arXiv 2025
[13]

Y . Li, Z. Zhou, Y . Chen, Y . Xue, and Y . Zhu. dworldeval: Scalable robotic policy evaluation via discrete diffusion world model.arXiv preprint arXiv:2604.22152, 2026. URLhttps: //arxiv.org/abs/2604.22152. 10

Pith/arXiv arXiv 2026
[14]

Y . Wang, R. Syed, F. Wu, M. Zhang, A. Onol, J. Barreiros, H. Nayyeri, T. Dear, H. Zhang, and Y . Li. Interactive world simulator for robot policy training and evaluation.arXiv preprint arXiv:2603.08546, 2026. URLhttps://arxiv.org/abs/2603.08546

arXiv 2026
[15]

M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finn, and J. Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning. In The Fourteenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=wPEIStHxYH

2026
[16]

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xi- ang, A. Malik, K. Lee, W. Liang, N. Ranawaka, J. Gu, Y . Xu, G. Wang, F. Hu, A. Narayan, J. Bjorck, J. Wang, G. Kim, D. Niu, R. Zheng, Y . Xie, J. Wu, Q. Wang, R. Julian, D. Xu, Y . Du, Y . Chebotar, S. Reed, J. Kautz, Y . Zhu, L. J. Fan, and J. Jang. World action m...

Pith/arXiv arXiv 2026
[17]

Y . Liu, F. Feng, L. Kong, W. Lu, J. Tang, K. Zhang, K. Murphy, C. Finn, and Y . Du. World action verifier: Self-improving world models via forward-inverse asymmetry.arXiv preprint arXiv:2604.01985, 2026. URLhttps://arxiv.org/abs/2604.01985

Pith/arXiv arXiv 2026
[18]

K. Chua, R. Calandra, R. McAllister, and S. Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. InAdvances in Neural Information Processing Systems (NeurIPS), 2018. URLhttps://arxiv.org/abs/1805.12114

Pith/arXiv arXiv 2018
[19]

Z. Mei, T. Yin, M. Baker, O. Shorinwa, and A. Majumdar. World models that know when they don’t know: Controllable video generation with calibrated uncertainty.arXiv preprint arXiv:2512.05927, 2025. URLhttps://arxiv.org/abs/2512.05927

arXiv 2025
[20]

Kidambi, A

R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims. MOReL: Model-based offline reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS),
[21]

URLhttps://arxiv.org/abs/2005.05951

arXiv 2005
[22]

T. Yu, G. Thomas, L. Yu, S. Ermon, J. Zou, S. Levine, C. Finn, and T. Ma. MOPO: Model- based offline policy optimization. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. URLhttps://arxiv.org/abs/2005.13239

arXiv 2020
[23]

Cosmos 3: Omnimodal world models for physical ai.arXiv preprint arXiv:2606.02800, 2026

NVIDIA et al. Cosmos 3: Omnimodal world models for physical ai.arXiv preprint arXiv:2606.02800, 2026. URLhttps://arxiv.org/abs/2606.02800

Pith/arXiv arXiv 2026
[24]

Lipman, R

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023. URL https://arxiv.org/abs/2210.02747

Pith/arXiv arXiv 2023
[25]

X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations (ICLR), 2023. URLhttps://arxiv.org/abs/2209.03003

Pith/arXiv arXiv 2023
[26]

Black, N

Physical Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner...

Pith/arXiv arXiv 2025
[27]

Z. Xiao, Y . Lan, Y . Zhou, W. Ouyang, S. Yang, Y . Zeng, and X. Pan. Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025. 11 A Dataset and Evaluation Details The following appendices give the dataset, training, and ablation details that complement Sec. 4 of the main paper. Full rollout videos and additional q...

arXiv 2025

[1] [1]

X. Yang, R. Dagli, A. Zook, H. Hadfield, A. Goyal, S. Birchfield, F. Ramos, and J. Tremblay. Robolab: A high-fidelity simulation benchmark for analysis of task generalist policies, 2026. URLhttps://arxiv.org/abs/2604.09860

Pith/arXiv arXiv 2026

[2] [2]

A. Jain, M. Zhang, K. Arora, W. Chen, M. Torne, M. Z. Irshad, S. Zakharov, Y . Wang, S. Levine, C. Finn, W.-C. Ma, D. Shah, A. Gupta, and K. Pertsch. Polaris: Scalable real-to- sim evaluations for generalist robot policies, 2025. URLhttps://arxiv.org/abs/2512. 16881

2025

[3] [3]

F. Zhu, H. Wu, S. Guo, Y . Liu, C. Cheang, and T. Kong. Irasim: A fine-grained world model for robot manipulation. InICCV, 2025. arXiv:2406.14540

arXiv 2025

[4] [4]

Quevedo, A

J. Quevedo, A. K. Sharma, Y . Sun, V . Suryavanshi, P. Liang, and S. Yang. Worldgym: World model as an environment for policy evaluation.arXiv preprint arXiv:2506.00613, 2025. doi: 10.48550/arXiv.2506.00613. URLhttps://arxiv.org/abs/2506.00613

work page doi:10.48550/arxiv.2506.00613 2025

[5] [5]

Tseng, J

W.-C. Tseng, J. Gu, Q. Zhang, H. Mao, M.-Y . Liu, F. Shkurti, and Y .-C. Lin. Scalable policy evaluation with video world models.arXiv preprint arXiv:2511.11520, 2025. URLhttps: //arxiv.org/abs/2511.11520

arXiv 2025

[6] [6]

Y . Guo, L. X. Shi, J. Chen, and C. Finn. Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025. URLhttps://arxiv.org/ abs/2510.10125

Pith/arXiv arXiv 2025

[7] [7]

S. Li, Y . Gao, D. Sadigh, and S. Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025. URLhttps://arxiv.org/abs/2503.00200

Pith/arXiv arXiv 2025

[8] [8]

C. Zhu, R. Yu, S. Feng, B. Burchfiel, P. Shah, and A. Gupta. Unified world models: Cou- pling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025. URLhttps://arxiv.org/abs/2504.02792

Pith/arXiv arXiv 2025

[9] [9]

NVIDIA, A. Ali, J. Bai, M. Bala, Y . Balaji, A. Blakeman, T. Cai, J. Cao, T. Cao, E. Cha, Y .-W. Chao, P. Chattopadhyay, M. Chen, Y . Chen, Y . Chen, S. Cheng, Y . Cui, J. Diamond, Y . Ding, J. Fan, L. Fan, L. Feng, F. Ferroni, S. Fidler, X. Fu, R. Gao, Y . Ge, J. Gu, A. Gupta, S. Gururani, I. El Hanafi, A. Hassani, Z. Hao, J. Huffman, J. Jang, P. Jannaty...

Pith/arXiv arXiv 2025

[10] [10]

Atreya, K

P. Atreya, K. Pertsch, T. Lee, M. J. Kim, A. Jain, A. Kuramshin, C. Eppner, C. Neary, E. Hu, F. Ramos, et al. Roboarena: Distributed real-world evaluation of generalist robot policies. In Proceedings of the Conference on Robot Learning (CoRL 2025), 2025

2025

[11] [11]

X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kir- mani, S. Levine, J. Wu, C. Finn, H. Su, Q. Vuong, and T. Xiao. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024

Pith/arXiv arXiv 2024

[12] [12]

Evaluating gemini robotics policies in a veo world simulator.arXiv preprint arXiv:2512.10675, 2025

Gemini Robotics Team. Evaluating gemini robotics policies in a veo world simulator.arXiv preprint arXiv:2512.10675, 2025. URLhttps://arxiv.org/abs/2512.10675

arXiv 2025

[13] [13]

Y . Li, Z. Zhou, Y . Chen, Y . Xue, and Y . Zhu. dworldeval: Scalable robotic policy evaluation via discrete diffusion world model.arXiv preprint arXiv:2604.22152, 2026. URLhttps: //arxiv.org/abs/2604.22152. 10

Pith/arXiv arXiv 2026

[14] [14]

Y . Wang, R. Syed, F. Wu, M. Zhang, A. Onol, J. Barreiros, H. Nayyeri, T. Dear, H. Zhang, and Y . Li. Interactive world simulator for robot policy training and evaluation.arXiv preprint arXiv:2603.08546, 2026. URLhttps://arxiv.org/abs/2603.08546

arXiv 2026

[15] [15]

M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finn, and J. Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning. In The Fourteenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=wPEIStHxYH

2026

[16] [16]

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xi- ang, A. Malik, K. Lee, W. Liang, N. Ranawaka, J. Gu, Y . Xu, G. Wang, F. Hu, A. Narayan, J. Bjorck, J. Wang, G. Kim, D. Niu, R. Zheng, Y . Xie, J. Wu, Q. Wang, R. Julian, D. Xu, Y . Du, Y . Chebotar, S. Reed, J. Kautz, Y . Zhu, L. J. Fan, and J. Jang. World action m...

Pith/arXiv arXiv 2026

[17] [17]

Y . Liu, F. Feng, L. Kong, W. Lu, J. Tang, K. Zhang, K. Murphy, C. Finn, and Y . Du. World action verifier: Self-improving world models via forward-inverse asymmetry.arXiv preprint arXiv:2604.01985, 2026. URLhttps://arxiv.org/abs/2604.01985

Pith/arXiv arXiv 2026

[18] [18]

K. Chua, R. Calandra, R. McAllister, and S. Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. InAdvances in Neural Information Processing Systems (NeurIPS), 2018. URLhttps://arxiv.org/abs/1805.12114

Pith/arXiv arXiv 2018

[19] [19]

Z. Mei, T. Yin, M. Baker, O. Shorinwa, and A. Majumdar. World models that know when they don’t know: Controllable video generation with calibrated uncertainty.arXiv preprint arXiv:2512.05927, 2025. URLhttps://arxiv.org/abs/2512.05927

arXiv 2025

[20] [20]

Kidambi, A

R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims. MOReL: Model-based offline reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS),

[21] [21]

URLhttps://arxiv.org/abs/2005.05951

arXiv 2005

[22] [22]

T. Yu, G. Thomas, L. Yu, S. Ermon, J. Zou, S. Levine, C. Finn, and T. Ma. MOPO: Model- based offline policy optimization. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. URLhttps://arxiv.org/abs/2005.13239

arXiv 2020

[23] [23]

Cosmos 3: Omnimodal world models for physical ai.arXiv preprint arXiv:2606.02800, 2026

NVIDIA et al. Cosmos 3: Omnimodal world models for physical ai.arXiv preprint arXiv:2606.02800, 2026. URLhttps://arxiv.org/abs/2606.02800

Pith/arXiv arXiv 2026

[24] [24]

Lipman, R

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023. URL https://arxiv.org/abs/2210.02747

Pith/arXiv arXiv 2023

[25] [25]

X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations (ICLR), 2023. URLhttps://arxiv.org/abs/2209.03003

Pith/arXiv arXiv 2023

[26] [26]

Black, N

Physical Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner...

Pith/arXiv arXiv 2025

[27] [27]

Z. Xiao, Y . Lan, Y . Zhou, W. Ouyang, S. Yang, Y . Zeng, and X. Pan. Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025. 11 A Dataset and Evaluation Details The following appendices give the dataset, training, and ablation details that complement Sec. 4 of the main paper. Full rollout videos and additional q...

arXiv 2025