pith. sign in

arxiv: 2606.18610 · v2 · pith:GAY7ILQOnew · submitted 2026-06-17 · 💻 cs.RO · cs.CV

SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation

Pith reviewed 2026-06-26 21:15 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords robot policy evaluationvideo world modelsself-consistent generationvision-language-action policiesforward-inverse dynamicscross-view consistencysimulation-based evaluation
0
0 comments X

The pith

Enforcing forward-inverse dynamics, cross-view inpainting, and test-time termination turns a video foundation model into an accurate evaluator of robot policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Evaluating robot manipulation policies in the real world is expensive and hard to scale, so the paper turns video world models into simulators for policy rollouts. Autoregressive video generation drifts over time, multi-view observations become inconsistent, and models struggle with policies outside their training data. SC3-Eval counters these issues by adapting a pre-trained video model through three consistency constraints: joint forward-inverse dynamics training to anchor actions to a plausible manifold, cross-view inpainting to maintain coherent observations without memory, and test-time termination using inverse dynamics uncertainty to halt drifting rollouts. The resulting simulations reproduce real-world failure modes and achieve a closed-loop Pearson correlation of 0.929 and MMRV of 0.119 across seven vision-language-action policies, outperforming prior video baselines while generalizing to new tasks.

Core claim

SC3-Eval adapts a pre-trained video foundation model into a policy evaluator by jointly training forward and inverse dynamics to keep generated rollouts on a physically plausible action manifold, training cross-view inpainting to maintain multi-camera coherence over long horizons without explicit memory, and reusing inverse dynamics at test time as an uncertainty signal to terminate inconsistent rollouts. This self-consistent recipe produces evaluations that match real-world policy performance with a closed-loop Pearson correlation of 0.929 and MMRV of 0.119 across seven policies, reproduce specific failure modes, outperform three prior video-model baselines, and generalize to new tasks.

What carries the argument

SC3-Eval self-consistent video generation recipe that jointly trains forward-inverse dynamics consistency, cross-view inpainting consistency, and test-time consistency via inverse-dynamics uncertainty termination.

If this is right

  • Policy ranking and fine-grained failure diagnosis become possible from simulated video alone without physical deployment.
  • The evaluator works for policies whose behaviors were not seen during its own training.
  • Multi-view coherence holds across long action sequences without needing recurrent memory.
  • Rollouts remain anchored to physically plausible actions because inverse dynamics penalizes drift that forward-only models miss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same three-consistency recipe could be applied to train better video predictors for other physical domains such as navigation or assembly.
  • If the correlation holds on broader policy sets, development cycles for robot foundation models could shift toward simulation-first iteration.
  • Extending the uncertainty signal to continuous per-frame monitoring rather than chunk termination might further reduce compounding error.
  • The approach suggests that explicit inverse modeling is a general lever for making generative world models more reliable for downstream control tasks.

Load-bearing premise

That jointly training forward-inverse dynamics, cross-view inpainting, and uncertainty-based termination on a pre-trained video model is enough to keep long-horizon rollouts plausible and to generalize to policies outside the training distribution.

What would settle it

Test SC3-Eval on an eighth real-world policy whose action distribution lies substantially outside the original training data and check whether the predicted success rates and failure modes still match the measured real-world execution rates within the reported margins.

Figures

Figures reproduced from arXiv: 2606.18610 by Allen Z. Ren, Florian Shkurti, Gashon Hussein, Jinwei Gu, Lucy X. Shi, Ming-Yu Liu, Quan Vuong, Sergey Levine, Wei-Cheng Tseng, Xudong Wang, Yuzhu Dong, Zhaoshuo Li.

Figure 1
Figure 1. Figure 1: The three consistency axes of SC3-Eval. (a) forward-inverse dynamics consistency. Under the same action sequence, Ours closely tracks GT, while a forward-only baseline drifts. (b) multi-view consistency. The model predicts the wrist view from a third-person view, with good versus bad example predictions. (c) test-time consistency. Disagreement between commanded and inverse-mode-recovered actions terminates… view at source ↗
Figure 2
Figure 2. Figure 2: Self-consistency training for SC3- Eval. Three joint-training modes over a shared backbone, distinguished only by which tokens are clean (conditioning) versus noisy (denoising tar￾gets). Forward Dynamics denoises future video given action and first frames, Cross-View Inpaint￾ing denoises held-out views from the remaining view and action, and Inverse Dynamics denoises the action chunk given the full video. … view at source ↗
Figure 3
Figure 3. Figure 3: Real-world experiment setup. (a) Workspace, with orange arrows for forward table bussing and green arrows for the destination-swapped reverse variant. (b) The three synchronized camera views observed by the policy and the world model. (c) Example trajectories scored under the three success criteria. (a) Offline evaluation (open-loop) (b) Online evaluation (closed-loop) [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
Figure 4
Figure 4. Figure 4: Correlation between predicted and real-world policy performance. Each point is a (checkpoint, success-criterion) pair. Within each of (a) and (b), the left and right panels show the in-distribution (table bussing) and out-of-distribution (reverse table bussing) splits. policy checkpoints with the π0.5 [25] architecture are evaluated in our experiments. More details are in App. A. Evaluation protocol. For e… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative rollouts for online generation. Each example shows the shared initial ob￾servation, followed by the real-world rollout and the online rollout generated by our evaluator under the same policy. of-distribution split, most sharply in online mode, reflecting the genuine difficulty of generalizing across a task-semantic shift. Offline versus online. One might expect online evaluation to be less fait… view at source ↗
Figure 6
Figure 6. Figure 6: Per-category out￾come reproduction rate under online evaluation. Each bar shows the fraction of trajectories whose evaluator rollout matches the real-world outcome category. A policy evaluator that reproduces aggregate success rates is use￾ful for checkpoint selection, but a stronger criterion is whether the policy evaluator reproduces the same kinds of failures a pol￾icy exhibits in the real world. Two ev… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative effect of inverse dynamics grounding on offline rollouts. Each row shows the same offline (open-loop) rollout under a different training configuration, sampled at five timesteps from a single bussing trajectory. The variant trained without inverse dynamics matches the ground truth for the first second, then drifts onto a different scene and never recovers, while the variant with inverse dynamic… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative effect of cross-view inpainting on wrist-view scene re-entry. Each row shows the same offline (open-loop) rollout under a different training configuration, sampled at four timesteps spanning a re-entry segment in which the wrist camera turns away from the workspace and then returns. Each panel shows a zoomed wrist view on top and the two third-person views below. The variant trained without cro… view at source ↗
Figure 9
Figure 9. Figure 9: Uncertainty-driven early termination on an example rollout. Top, four frames sampled from the imagined rollout. Bottom, the per-chunk consistency error Uchunk over the 14 rollout chunks. The rollout is terminated at the first chunk whose Uchunk exceeds the threshold τ = 0.02 (chunk 10 here), before the visible drift compounds. C.2 Prediction-Execution Horizon Decoupling The prediction-execution horizon dec… view at source ↗
read the original abstract

Evaluating generalist robot manipulation policies in the real world is expensive, slow, and difficult to scale. Action-conditioned video world models offer a scalable alternative by simulating policy rollouts. Autoregressive rollouts accumulate compounding errors, observations across multiple camera views must remain mutually consistent, and the evaluator must generalize to policies whose behaviors lie outside the training distribution. We address these challenges with SC3-Eval, a self-consistent video generation recipe that adapts a pre-trained video foundation model into an accurate policy evaluator by enforcing three complementary forms of consistency. First, forward-inverse dynamics consistency jointly trains the model to predict frames from actions and to recover actions from frames, anchoring generated rollouts to a physically plausible action manifold and counteracting the drift a forward-only model cannot penalize. Second, cross-view consistency trains the model to inpaint each camera view from the other, keeping the multi-camera observation coherent over long rollouts without any explicit memory mechanism. Third, test-time consistency reuses the inverse dynamics mode at inference as a per-action-chunk uncertainty signal that terminates rollouts whose generated frames drift away from the requested actions. We also demonstrate SC3-Eval rollouts reproduce the failure modes that policies exhibit in real-world rollouts, supporting fine-grained diagnostic comparison rather than aggregate ranking alone. Across seven real-world vision-language-action policies, SC3-Eval attains a closed-loop Pearson correlation of $0.929$ and MMRV of $0.119$, outperforming three strong prior video-model-based baselines, and generalizes to new tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that SC3-Eval adapts a pre-trained video foundation model into a policy evaluator by jointly enforcing forward-inverse dynamics consistency (to anchor to a physically plausible action manifold), cross-view consistency (via inpainting for multi-camera coherence), and test-time consistency (inverse-dynamics-based termination to detect drift). Across seven real-world vision-language-action policies it reports a closed-loop Pearson correlation of 0.929 and MMRV of 0.119, outperforming three video-model baselines, reproducing real-world failure modes, and generalizing to new tasks.

Significance. If the reported correlation and outperformance hold under scrutiny, the work would be a meaningful contribution to scalable robot policy evaluation, offering a concrete alternative to costly real-world rollouts. The empirical numbers, the reproduction of failure modes for diagnostics, and the use of complementary consistency objectives on an existing video model are strengths that could influence how video world models are applied in robotics.

major comments (1)
  1. [Abstract and §4–5] Abstract and §4–5 (Experiments): the central claim that the three consistency mechanisms suffice to keep long-horizon autoregressive rollouts physically plausible and to generalize to policies outside the training distribution is load-bearing for the 0.929 closed-loop Pearson correlation. The abstract asserts that forward-inverse training, cross-view inpainting, and test-time termination counteract drift and compounding errors, yet no quantitative support (trajectory-level physics violation rates, explicit OOD policy splits, or ablation isolating drift on held-out behaviors) is referenced; without such evidence the correlation could reflect in-distribution overlap rather than the claimed evaluator properties.
minor comments (2)
  1. [Abstract] The exact definition and computation of MMRV (reported as 0.119) is not stated in the abstract and should be given explicitly with its formula and relation to the Pearson metric.
  2. [§5] Details on policy selection, baseline hyper-parameter tuning, and whether the seven policies or the held-out real-world data were chosen after inspecting results should be added to §5 to allow assessment of post-hoc bias.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for acknowledging the potential contribution of SC3-Eval to scalable policy evaluation. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract and §4–5] Abstract and §4–5 (Experiments): the central claim that the three consistency mechanisms suffice to keep long-horizon autoregressive rollouts physically plausible and to generalize to policies outside the training distribution is load-bearing for the 0.929 closed-loop Pearson correlation. The abstract asserts that forward-inverse training, cross-view inpainting, and test-time termination counteract drift and compounding errors, yet no quantitative support (trajectory-level physics violation rates, explicit OOD policy splits, or ablation isolating drift on held-out behaviors) is referenced; without such evidence the correlation could reflect in-distribution overlap rather than the claimed evaluator properties.

    Authors: We agree that the manuscript would be strengthened by explicit quantitative support such as trajectory-level physics violation rates, dedicated OOD policy splits, and an ablation isolating drift on held-out behaviors. The current evidence consists of the closed-loop Pearson correlation of 0.929 measured across seven real-world VLA policies (including generalization to new tasks) together with qualitative reproduction of real-world failure modes. While we believe these results are inconsistent with pure in-distribution overlap, we acknowledge they do not directly quantify per-trajectory physical plausibility or isolate each consistency term's effect on drift. In the revised manuscript we will add an ablation that measures the contribution of each consistency mechanism to rollout stability on held-out behaviors and will report any available quantitative proxies for physical violation derived from the inverse-dynamics consistency signal. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical correlation against held-out real-world data

full rationale

The paper reports closed-loop Pearson correlation (0.929) and MMRV (0.119) computed directly against real-world rollouts of seven held-out vision-language-action policies. The training recipe (forward-inverse dynamics, cross-view inpainting, test-time termination) is described as a method to produce the evaluator, but the final metrics are external benchmarks that do not reduce by construction to any fitted parameter or self-citation chain within the paper. No equations, uniqueness theorems, or ansatzes are shown to force the reported numbers from the training data alone. The evaluation therefore remains self-contained against independent real-world measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities beyond reliance on a pre-trained video foundation model and standard supervised training; the three consistency objectives are presented as training signals rather than new postulated objects.

pith-pipeline@v0.9.1-grok · 5856 in / 1157 out tokens · 22124 ms · 2026-06-26T21:15:37.656223+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 1 canonical work pages

  1. [1]

    X. Yang, R. Dagli, A. Zook, H. Hadfield, A. Goyal, S. Birchfield, F. Ramos, and J. Tremblay. Robolab: A high-fidelity simulation benchmark for analysis of task generalist policies, 2026. URLhttps://arxiv.org/abs/2604.09860

  2. [2]

    A. Jain, M. Zhang, K. Arora, W. Chen, M. Torne, M. Z. Irshad, S. Zakharov, Y . Wang, S. Levine, C. Finn, W.-C. Ma, D. Shah, A. Gupta, and K. Pertsch. Polaris: Scalable real-to- sim evaluations for generalist robot policies, 2025. URLhttps://arxiv.org/abs/2512. 16881

  3. [3]

    F. Zhu, H. Wu, S. Guo, Y . Liu, C. Cheang, and T. Kong. Irasim: A fine-grained world model for robot manipulation. InICCV, 2025. arXiv:2406.14540

  4. [4]

    Quevedo, A

    J. Quevedo, A. K. Sharma, Y . Sun, V . Suryavanshi, P. Liang, and S. Yang. Worldgym: World model as an environment for policy evaluation.arXiv preprint arXiv:2506.00613, 2025. doi: 10.48550/arXiv.2506.00613. URLhttps://arxiv.org/abs/2506.00613

  5. [5]

    Tseng, J

    W.-C. Tseng, J. Gu, Q. Zhang, H. Mao, M.-Y . Liu, F. Shkurti, and Y .-C. Lin. Scalable policy evaluation with video world models.arXiv preprint arXiv:2511.11520, 2025. URLhttps: //arxiv.org/abs/2511.11520

  6. [6]

    Y . Guo, L. X. Shi, J. Chen, and C. Finn. Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025. URLhttps://arxiv.org/ abs/2510.10125

  7. [7]

    S. Li, Y . Gao, D. Sadigh, and S. Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025. URLhttps://arxiv.org/abs/2503.00200

  8. [8]

    C. Zhu, R. Yu, S. Feng, B. Burchfiel, P. Shah, and A. Gupta. Unified world models: Cou- pling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025. URLhttps://arxiv.org/abs/2504.02792

  9. [9]

    NVIDIA, A. Ali, J. Bai, M. Bala, Y . Balaji, A. Blakeman, T. Cai, J. Cao, T. Cao, E. Cha, Y .-W. Chao, P. Chattopadhyay, M. Chen, Y . Chen, Y . Chen, S. Cheng, Y . Cui, J. Diamond, Y . Ding, J. Fan, L. Fan, L. Feng, F. Ferroni, S. Fidler, X. Fu, R. Gao, Y . Ge, J. Gu, A. Gupta, S. Gururani, I. El Hanafi, A. Hassani, Z. Hao, J. Huffman, J. Jang, P. Jannaty...

  10. [10]

    Atreya, K

    P. Atreya, K. Pertsch, T. Lee, M. J. Kim, A. Jain, A. Kuramshin, C. Eppner, C. Neary, E. Hu, F. Ramos, et al. Roboarena: Distributed real-world evaluation of generalist robot policies. In Proceedings of the Conference on Robot Learning (CoRL 2025), 2025

  11. [11]

    X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kir- mani, S. Levine, J. Wu, C. Finn, H. Su, Q. Vuong, and T. Xiao. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024

  12. [12]

    Evaluating gemini robotics policies in a veo world simulator.arXiv preprint arXiv:2512.10675, 2025

    Gemini Robotics Team. Evaluating gemini robotics policies in a veo world simulator.arXiv preprint arXiv:2512.10675, 2025. URLhttps://arxiv.org/abs/2512.10675

  13. [13]

    Y . Li, Z. Zhou, Y . Chen, Y . Xue, and Y . Zhu. dworldeval: Scalable robotic policy evaluation via discrete diffusion world model.arXiv preprint arXiv:2604.22152, 2026. URLhttps: //arxiv.org/abs/2604.22152. 10

  14. [14]

    Y . Wang, R. Syed, F. Wu, M. Zhang, A. Onol, J. Barreiros, H. Nayyeri, T. Dear, H. Zhang, and Y . Li. Interactive world simulator for robot policy training and evaluation.arXiv preprint arXiv:2603.08546, 2026. URLhttps://arxiv.org/abs/2603.08546

  15. [15]

    M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finn, and J. Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning. In The Fourteenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=wPEIStHxYH

  16. [16]

    S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xi- ang, A. Malik, K. Lee, W. Liang, N. Ranawaka, J. Gu, Y . Xu, G. Wang, F. Hu, A. Narayan, J. Bjorck, J. Wang, G. Kim, D. Niu, R. Zheng, Y . Xie, J. Wu, Q. Wang, R. Julian, D. Xu, Y . Du, Y . Chebotar, S. Reed, J. Kautz, Y . Zhu, L. J. Fan, and J. Jang. World action m...

  17. [17]

    Y . Liu, F. Feng, L. Kong, W. Lu, J. Tang, K. Zhang, K. Murphy, C. Finn, and Y . Du. World action verifier: Self-improving world models via forward-inverse asymmetry.arXiv preprint arXiv:2604.01985, 2026. URLhttps://arxiv.org/abs/2604.01985

  18. [18]

    K. Chua, R. Calandra, R. McAllister, and S. Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. InAdvances in Neural Information Processing Systems (NeurIPS), 2018. URLhttps://arxiv.org/abs/1805.12114

  19. [19]

    Z. Mei, T. Yin, M. Baker, O. Shorinwa, and A. Majumdar. World models that know when they don’t know: Controllable video generation with calibrated uncertainty.arXiv preprint arXiv:2512.05927, 2025. URLhttps://arxiv.org/abs/2512.05927

  20. [20]

    Kidambi, A

    R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims. MOReL: Model-based offline reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS),

  21. [21]

    URLhttps://arxiv.org/abs/2005.05951

  22. [22]

    T. Yu, G. Thomas, L. Yu, S. Ermon, J. Zou, S. Levine, C. Finn, and T. Ma. MOPO: Model- based offline policy optimization. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. URLhttps://arxiv.org/abs/2005.13239

  23. [23]

    Cosmos 3: Omnimodal world models for physical ai.arXiv preprint arXiv:2606.02800, 2026

    NVIDIA et al. Cosmos 3: Omnimodal world models for physical ai.arXiv preprint arXiv:2606.02800, 2026. URLhttps://arxiv.org/abs/2606.02800

  24. [24]

    Lipman, R

    Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023. URL https://arxiv.org/abs/2210.02747

  25. [25]

    X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations (ICLR), 2023. URLhttps://arxiv.org/abs/2209.03003

  26. [26]

    Black, N

    Physical Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner...

  27. [27]

    Z. Xiao, Y . Lan, Y . Zhou, W. Ouyang, S. Yang, Y . Zeng, and X. Pan. Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025. 11 A Dataset and Evaluation Details The following appendices give the dataset, training, and ablation details that complement Sec. 4 of the main paper. Full rollout videos and additional q...