pith. sign in

arxiv: 2606.22363 · v1 · pith:4NK4V37Cnew · submitted 2026-06-21 · 💻 cs.AI · cs.LG· cs.RO

Reference-Free Assessment of Physical Consistency in World Model-based Video Generation

Pith reviewed 2026-06-26 11:10 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.RO
keywords physical consistencyvideo generationworld modelsreference-freeDROID-SLAMSEA-RAFTsimulation-to-realityVLA models
0
0 comments X

The pith

Reference-free measures using SLAM and optical flow assess physical consistency in generated videos and improve robotic task success by over 8%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces reference-free measures that combine relative and absolute assessments to evaluate the physical consistency of videos generated by world models. These measures rely on DROID-SLAM and SEA-RAFT to detect inconsistencies without needing ground-truth references or human judgments. By filtering videos according to the relative consistency score, the success rate of tasks performed by vision-language-action models increases by more than 8 percent. This approach helps close the gap between simulated environments and real-world robot performance. The absolute measure additionally allows visualization of where and when physical artifacts occur in the video.

Core claim

The central claim is that reference-free physical consistency assessment via DROID-SLAM and SEA-RAFT enables effective filtering of generated videos, resulting in over 8% improvement in task success rates for VLA models and providing spatio-temporal localization of inconsistencies.

What carries the argument

DROID-SLAM and SEA-RAFT used to quantify physical inconsistencies in generated videos through structure and flow analysis without references.

Load-bearing premise

DROID-SLAM and SEA-RAFT can be applied directly to generated videos to reliably quantify physical inconsistencies without ground-truth references or additional validation specific to synthetic data.

What would settle it

Observing that high-consistency videos according to the measures do not lead to higher task success rates in real-world robotic experiments, or that obvious physical errors like object interpenetration are not flagged by the measures.

Figures

Figures reproduced from arXiv: 2606.22363 by Sukmin Yun, Yun Oh.

Figure 1
Figure 1. Figure 1: Overview of Relative Assessment. Generated rollouts are evaluated using DROID-SLAM [15] and SEA-RAFT [18] to obtain relative anomaly scores, which enable filtering and ranking of samples based on physical plausibility. DROID-SLAM Frames from a single rollout 𝑗𝑡 = 𝑒𝑡 − 𝑒𝑡−1 , 𝑡 ∗ = arg max 𝑡 𝑗𝑡 𝑀𝑖 𝑢, 𝑣 = |𝛿𝐢 𝑢, 𝑣 |2 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Absolute Assessment. Frames from a sin￾gle rollout are evaluated using DROID-SLAM [15] to obtain the anomaly heatmap, which shows where and when physical feasi￾bility collapses. Although obvious artifacts can be easily identified, eval￾uating the overall quality of generated rollouts becomes challenging when multiple imperfections are present. In such cases, it is unclear which samples are more… view at source ↗
Figure 3
Figure 3. Figure 3: Quality Scores in Real and Generated Videos. 3D and photometric consistency exhibit distinctive gaps between real and generated distributions, whereas median of subjective quality remains comparable. Generated videos contain a significantly higher number of outliers in 3D and photometric consistency. and photometric consistency of generated rollouts against real-world videos, we derive an ‘anomaly score’ t… view at source ↗
Figure 5
Figure 5. Figure 5: Success Rates across different conditions after filter￾ing with our relative assessment with the real-world result [5]. Our results show that the low-anomaly group achieves success rates closest to those of real-world videos. We repeat GPT-4o evalua￾tion 10 times per video to reduce evaluator variance, resulting in a smaller SE compared to real-robot experiments. Detailed per-task results are provided in T… view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of Relative Anomaly Scores. Lower score is better. Density allows normalized comparison for varied Open￾VLA rollouts across unequal sizes. proach described in WorldGym [13]. Rollouts per task were categorized into high-anomaly and low-anomaly groups (n=10 each) based on their relative assessment scores, while excluding the middle 10 rollouts to ensure better distinc￾tion between the two groups… view at source ↗
Figure 6
Figure 6. Figure 6: Artifact Heatmap shows the result of the absolute as￾sessment. Red colors imply low fidelity of the specific edges. The first row displays a deformed robot arm and the second row shows a thickened robot arm, both of which are clearly characterized in the adjacent heatmap [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗
Figure 8
Figure 8. Figure 8 [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: LucyEdit original sample’s first frame (left) and the edited footage’s first frame (right). In the edited frame, a distinct ’V’ shape artifact is visible on her chest around the black garment. This contour is unnatural and does not correspond to either cloth￾ing drape or anatomy. ical failure due to the absence of a clean reference frame. The input natural video recorded a mean score of 0.1874 with a max j… view at source ↗
Figure 10
Figure 10. Figure 10: The edited sequence from LucyEdit displays more significant artifacts in the preceding frames compared to this sam￾ple. The specific editing instruction provided to the model was: ‘Change her clothes to a black mini dress for funeral’ [PITH_FULL_IMAGE:figures/full_fig_p006_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Noise-corrupted. These are one of samples, which is a custom recorded footage. The left one is the original frame, and the right one is the corrupted image that replaced the original frame for the experiments. Both the heatmap generation and score computation fail due to zero values encountered during the calcu￾lation. With several versions of augmentation applying various prompts and hyperparameter tunin… view at source ↗
read the original abstract

We introduce reference-free measures for evaluating the physical consistency of generated videos, combining relative and absolute approaches to assess fidelity. Although tools like WorldGym or WorldEval enable robotic simulation via video generation, physical fidelity gaps often prevent these environments from accurately reproducing real-world task success rates of VLA models. Unlike existing evaluation methods, which require costly human voting (Elo) or unavailable ground-truth references (FVD), our approach utilizes DROID-SLAM and SEA-RAFT to quantify physical inconsistencies, motivated by WorldScore. Videos filtered using our relative consistency assessment show an improvement in task success rates of over 8%, effectively narrowing the simulation-to-reality gap. Furthermore, our absolute assessment enables spatio-temporal localization, providing visualization of when and where physical artifacts occur.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces reference-free relative and absolute measures for physical consistency in world model-generated videos, leveraging DROID-SLAM and SEA-RAFT (motivated by WorldScore) to quantify inconsistencies without ground-truth references or human voting. It claims that filtering videos via the relative consistency assessment yields an over-8% improvement in downstream task success rates for VLA models, narrowing the sim-to-real gap, while the absolute assessment enables spatio-temporal localization of artifacts.

Significance. If the empirical result holds under proper validation, the work would offer a scalable, automated alternative to costly human evaluation or reference-dependent metrics for improving physical fidelity in video-based world models, with direct relevance to robotic simulation and VLA deployment.

major comments (2)
  1. [Abstract] Abstract and results sections: the headline claim of >8% task-success improvement from relative-consistency filtering is presented without experimental details (number of videos, task suite, baseline filters, statistical tests, or error bars), so the attribution of the lift specifically to physical consistency cannot be verified from the provided evidence.
  2. [Methodology] Methodology (DROID-SLAM/SEA-RAFT application): the central assumption that these estimators, developed and benchmarked on real-camera footage, produce scores that reliably track physical violations when applied to generated videos is unvalidated; generated videos can introduce photometric and geometric artifacts (texture flicker, inconsistent lighting, non-rigid motion) that violate the pipelines' assumptions, risking selection on an unrelated signal.
minor comments (2)
  1. [Methodology] Clarify the exact definitions of the relative and absolute consistency scores, including any thresholds or aggregation steps, to allow reproducibility.
  2. [Discussion] Add a limitations paragraph discussing failure modes of DROID-SLAM and SEA-RAFT on synthetic data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and commit to revisions that strengthen the presentation and evidence without misrepresenting the current manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract and results sections: the headline claim of >8% task-success improvement from relative-consistency filtering is presented without experimental details (number of videos, task suite, baseline filters, statistical tests, or error bars), so the attribution of the lift specifically to physical consistency cannot be verified from the provided evidence.

    Authors: We agree the abstract and results presentation lack sufficient experimental details to allow full verification of the claim. The current manuscript reports the >8% improvement but does not embed the supporting parameters in the abstract. In revision we will expand the abstract to include the number of videos evaluated, the VLA task suite, baseline filter comparisons, and references to statistical tests with error bars, while retaining the core claim. This directly addresses the verifiability concern. revision: yes

  2. Referee: [Methodology] Methodology (DROID-SLAM/SEA-RAFT application): the central assumption that these estimators, developed and benchmarked on real-camera footage, produce scores that reliably track physical violations when applied to generated videos is unvalidated; generated videos can introduce photometric and geometric artifacts (texture flicker, inconsistent lighting, non-rigid motion) that violate the pipelines' assumptions, risking selection on an unrelated signal.

    Authors: The manuscript motivates the estimators via WorldScore but does not include explicit validation against generation-specific artifacts. We acknowledge this gap. In the revised version we will add a targeted analysis subsection showing correlation between the consistency scores and human-labeled physical violations on a generated-video subset, plus discussion of robustness to common artifacts. The observed downstream VLA success-rate lift provides supporting (though indirect) evidence that the signal is relevant to physical consistency rather than unrelated factors. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical result independent of inputs

full rationale

The paper applies existing external tools (DROID-SLAM, SEA-RAFT) to generated videos to produce consistency scores, then reports an observed empirical lift (>8% task success) from filtering on those scores. No equations, definitions, or claims in the provided text reduce the metrics or the reported improvement to self-defined quantities, fitted parameters renamed as predictions, or load-bearing self-citations. The central claim remains an externally measurable outcome rather than a constructed equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract only; no information on free parameters, axioms, or invented entities is available.

pith-pipeline@v0.9.1-grok · 5652 in / 1118 out tokens · 30283 ms · 2026-06-26T11:10:17.883682+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references

  1. [1]

    Kevin Black, Noah Brown, et al.π 0: A vision-language- action flow model for general robot control, 2026. 1

  2. [2]

    Worldscore: A unified evaluation benchmark for world generation, 2025

    Haoyi Duan, Hong-Xing Yu, et al. Worldscore: A unified evaluation benchmark for world generation, 2025. 1, 2

  3. [3]

    Scaling rectified flow transformers for high-resolution image synthesis, 2024

    Patrick Esser, Sumith Kulal, et al. Scaling rectified flow transformers for high-resolution image synthesis, 2024. 5

  4. [4]

    Robocerebra: A large- scale benchmark for long-horizon robotic manipulation eval- uation, 2025

    Songhao Han, Boxiang Qiu, et al. Robocerebra: A large- scale benchmark for long-horizon robotic manipulation eval- uation, 2025. 1

  5. [5]

    Openvla: An open-source vision-language-action model, 2024

    Moo Jin Kim, Karl Pertsch, et al. Openvla: An open-source vision-language-action model, 2024. 3, 4, 5

  6. [6]

    Fine-tuning vision-language-action models: Optimizing speed and suc- cess, 2025

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and suc- cess, 2025. 1

  7. [7]

    Cosmos policy: Fine-tuning video models for visuomotor control and planning, 2026

    Moo Jin Kim, Yihuai Gao, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning, 2026. 1

  8. [8]

    Worldmodelbench: Judging video generation models as world models, 2025

    Dacheng Li, Yunhao Fang, et al. Worldmodelbench: Judging video generation models as world models, 2025. 1

  9. [9]

    Worldeval: World model as real-world robot policies evaluator, 2025

    Yaxuan Li, Yichen Zhu, et al. Worldeval: World model as real-world robot policies evaluator, 2025. 1

  10. [10]

    Libero: Benchmarking knowledge transfer for lifelong robot learning, 2023

    Bo Liu, Yifeng Zhu, et al. Libero: Benchmarking knowledge transfer for lifelong robot learning, 2023. 1

  11. [11]

    Sora: Creating video from text.https:// openai.com/index/sora/, 2024

    OpenAI. Sora: Creating video from text.https:// openai.com/index/sora/, 2024. Accessed: 2026- 04-18. 5

  12. [12]

    Gpt-4o system card, 2024

    OpenAI, :, et al. Gpt-4o system card, 2024. 3

  13. [13]

    Worldgym: World model as an environment for policy evaluation, 2025

    Julian Quevedo, Ansh Kumar Sharma, et al. Worldgym: World model as an environment for policy evaluation, 2025. 1, 3, 4

  14. [14]

    Lucy edit: Open-weight text-guided video editing, 2025

    DecartAI Team. Lucy edit: Open-weight text-guided video editing, 2025. 5

  15. [15]

    Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras, 2022

    Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras, 2022. 1, 2, 3

  16. [16]

    Towards accurate generative models of video: A new metric & chal- lenges, 2019

    Thomas Unterthiner, Sjoerd van Steenkiste, et al. Towards accurate generative models of video: A new metric & chal- lenges, 2019. 1

  17. [17]

    Bridgedata v2: A dataset for robot learning at scale, 2024

    Homer Walke, Kevin Black, et al. Bridgedata v2: A dataset for robot learning at scale, 2024. 3

  18. [18]

    Sea-raft: Simple, efficient, accurate raft for optical flow, 2024

    Yihan Wang, Lahav Lipson, and Jia Deng. Sea-raft: Simple, efficient, accurate raft for optical flow, 2024. 1, 2

  19. [19]

    Grok imagine.https://x.ai/news/grok- imagine-api, 2026

    xAI. Grok imagine.https://x.ai/news/grok- imagine-api, 2026. Accessed: 2026-04-18. 5

  20. [20]

    Vlabench: A large-scale bench- mark for language-conditioned robotics manipulation with long-horizon reasoning tasks, 2024

    Shiduo Zhang, Zhe Xu, et al. Vlabench: A large-scale bench- mark for language-conditioned robotics manipulation with long-horizon reasoning tasks, 2024. 1

  21. [21]

    X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language- action model, 2025

    Jinliang Zheng, Jianxiong Li, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language- action model, 2025. 1