pith. sign in

arxiv: 2605.29360 · v1 · pith:WOJCDMMPnew · submitted 2026-05-28 · 💻 cs.AI

MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models

Pith reviewed 2026-06-29 07:44 UTC · model grok-4.3

classification 💻 cs.AI
keywords robotic world modelsaction-conditioned reliabilityMiraBenchphysics adherenceaction-following fidelityoptimism biasworld model evaluationrobot learning simulators
0
0 comments X

The pith

MiraBench shows visual fidelity is a poor proxy for whether robotic world models follow actions or avoid false optimism about success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MiraBench as a benchmark that targets action-conditioned reliability rather than visual quality alone for robotic world models used as simulators. It structures evaluation into three levels of increasing demand: checking physical consistency without references, measuring whether predictions match the conditioned actions, and detecting when models incorrectly forecast success under actions that should fail. A human-annotated set of over 16,000 judgments supports tests on 12 model configurations from different conditioning types and scales. The results indicate that better-looking predictions often fail to respect actions, larger models do not consistently improve action adherence, and over-optimism about outcomes is widespread. This matters because unreliable simulators can distort robot learning that relies on them for planning and training.

Core claim

MiraBench defines action-conditioned reliability as the core target and decomposes it into Physics Adherence for reference-free physical consistency, Action-Following Fidelity for whether predictions respect task-relevant action inputs, and Optimism Bias Detection for the tendency to predict success under failure-inducing actions. Supported by a human-annotated corpus exceeding 16,000 judgments, evaluation of 12 representative model configurations spanning vector-conditioned, text-conditioned, open-weight, closed-source, and varied scales shows that visual fidelity is a poor proxy for action fidelity, increasing model scale does not reliably improve action following, and optimism bias is per

What carries the argument

MiraBench, the hierarchical benchmark that decomposes action-conditioned reliability into the three levels of Physics Adherence, Action-Following Fidelity, and Optimism Bias Detection, backed by human judgments.

If this is right

  • Visual quality metrics alone cannot be trusted when selecting or improving robotic world models for use as simulators.
  • Simply increasing model scale is not a dependable route to better action-conditioned predictions.
  • Current world models systematically overestimate success under actions that should lead to failure.
  • Evaluation must shift from appearance-based proxies to direct checks on action adherence and failure calibration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Robot learning pipelines that rely on these models for data generation may need explicit filters for action fidelity before using generated trajectories.
  • The benchmark could be applied during model development to guide training objectives toward better action following rather than visual realism.
  • Extending the evaluation to real-robot rollouts would test whether the detected biases transfer to physical outcomes.
  • Designers of future simulators might incorporate explicit failure-mode training to reduce the observed optimism bias.

Load-bearing premise

The human-annotated corpus of over 16,000 judgments provides an unbiased and sufficiently reliable ground truth for action-conditioned reliability across the tested failure categories and tasks.

What would settle it

Independent re-annotation of the same prediction samples by a new set of humans, or evaluation on a fresh set of models, that shows visual fidelity strongly correlates with action fidelity or that optimism bias is rare would undermine the three central findings.

Figures

Figures reproduced from arXiv: 2605.29360 by Boyuan Chen, Jiaming Ji, Jiawei Chen, Jiayi Zhou, Juntao Dai, Tianzhuo Yang, Yaodong Yang, Zhaoyi Zhang, Zihan Shen, Zirui Mi.

Figure 1
Figure 1. Figure 1: Representative failure modes motivating MiraBench. (a–b) Physics Adherence failures include object morphing, disappearance, and implausible free-fall dynamics. (c) Action-Following failures occur when predicted motion is incomplete or mismatched with the commanded action. (d) Optimism Bias occurs when failure actions are overwritten by successful predictions. world models for robotics [11, 28, 51, 13] and … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of MiraBench’s human annotation corpus. The corpus contains 906 generated videos and 16,704 structured human annotation decisions across three modules: (a) Physical Adher￾ence, which includes Physical Consistency and Physics Law Compliance, (b) Action Following, and (c) Optimism Bias Detection. These annotations are collected on representative model outputs and provide per-indicator supervision fo… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of MIRABENCH. From robotic manipulation episodes, MIRABENCH con￾structs nominal and failure-inducing action inputs, evaluates generated rollouts through three levels of action-conditioned reliability, and use VLM evaluators from structured annotation data. Rule￾based kinematic checks complement VLM for physics-law compliance. continuation. To avoid overfitting the evaluation to a single embodiment… view at source ↗
Figure 4
Figure 4. Figure 4: (a) Action-following does not guarantee failure preservation. (b) GR1 post-training im￾proves task execution while weakening failure preservation at fixed scale. (c) Scaling and additional fine-tuning produce non-uniform changes across all three benchmark levels. (d) Open-weight [O] and closed-source [C] systems show complementary strengths, but no model dominates all levels. text mode, Wan2.1/2.2, WanX, H… view at source ↗
Figure 5
Figure 5. Figure 5: Data collection and processing pipelines. Left (blue): Arena-GR1 starts from 10 tele￾operated demonstrations in Isaac Lab, expands to 50 via Mimic, and is stored as JSONL metadata and MP4 videos in GR00T-LeRobot format. Right (green): SynData captures real-world bimanual demonstrations with a full-modality exoskeleton rig, temporally aligns all streams to 10 FPS, and stores trajectory data in Zarr v3 with … view at source ↗
Figure 6
Figure 6. Figure 6: Optimism bias evaluator performance. Left: Per-model accuracy, Y recall, and N recall. Center: Overall confusion matrix (n = 294, accuracy 87.8%). Right: Distribution of “Same” vote counts across all 376 samples; the bimodal pattern confirms that most samples produce clear majority decisions, supporting the 7-frame voting design. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Composition of the released human-annotation set. (a) Number of annotated videos per evaluation level. (b) Per-level share of the 16,704 individual judgements: Physical Consistency contributes 16 indicators per video, Action Following 5, Optimism Bias 18, while Physics Law uses a single overall grade (with an additional 10 anomaly tracks reported in Appendix H). Landing/bounce phase. • Landing anomalies: n… view at source ↗
Figure 8
Figure 8. Figure 8: Per-model severe-violation rate on the 16 physical-consistency indicators, occ68 head￾to-head videos (n=30 per cell). Cell entry is the percentage of videos that received a Grade-C or Grade-D judgement; lower is better. Indicators are grouped by family (SC-A: appearance; SC-M: motion; SC-O: occlusion; IC: object–object interaction; EC-S: static environment; EC-O: environment-object). Format and licence. Fo… view at source ↗
Figure 9
Figure 9. Figure 9: Full A/B/C/D grade distribution per indicator, all four head-to-head models. Each panel summarises one indicator; bars within a panel correspond to the four models in the legend. The colour code follows the rubric of Appendix G: green (A) = no perceived violation, light-green (B) = mild, yellow (C) = evident, red (D) = severe, grey = N/A. The proportion of red cells on the right (occlusion / penetration) i… view at source ↗
Figure 10
Figure 10. Figure 10: Physics-law overall-grade distribution on the free-fall set (n=89 DreamDojo-14B videos across five object categories). No video receives Grade A; the dominant grade is C/D, with Grade D alone accounting for 31.5 % of clips. Even when the prompt distribution spans five object categories, the model’s free-fall predictions rarely satisfy the rubric’s plausibility bar [PITH_FULL_IMAGE:figures/full_fig_p040_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Action Following on DreamDojo-14B (n=210). (a) Annotator-judged Task Comple￾tion Rate overall and by subset; flat is single-step pick-and-place (n=100), gr1_episode is the long-horizon humanoid setting (n=110). (b) Stacked grade distribution of the five visual-quality dimensions (PP physical plausibility, MQ motion quality, TC text-condition consistency, VS visual stability, OS overall sense). Motion qual… view at source ↗
Figure 12
Figure 12. Figure 12: Three faces of optimism bias on the labelled Optimism-Bias subset. (a) MA-9: does the rollout exhibit optimism bias on the unperturbed clip? Y and Y? both count as bias. (b) MB-9: does the rollout still predict success when the prompt is perturbed to make the task fail? (c) MB-5: does the rollout overestimate baseline success? Higher bars on the red end indicate worse behaviour. The visual-quality panel o… view at source ↗
Figure 13
Figure 13. Figure 13: DreamDojo-14B, occ68 episode 0008. The dark cloth held by the right gripper is briefly fully occluded between the two arms (frame 1→2). When it re-emerges in the prediction, the original cloth has been replaced by a stainless-steel kettle, a yellow plate and a red tomato-like object — none of which appeared in the ground-truth scene [PITH_FULL_IMAGE:figures/full_fig_p046_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: DreamDojo-14B, occ68 episode 0015. The right manipulator briefly traverses the upper￾right corner of the green dumpling tray, partially occluding it (frame 1→2). After the arm withdraws, the prediction renders the tray with a broken right edge: the section that was momentarily covered fails to reconstruct. Analysis. The failure is not on the manipulator but on the static object the manipulator briefly pas… view at source ↗
Figure 15
Figure 15. Figure 15: DreamDojo-14B, occ68 episode 0063. The multi-coloured patterned wrapper held by the gripper progressively desaturates in the prediction; its bright printed pattern flattens into a uniform pale-green tone over the four sampled frames [PITH_FULL_IMAGE:figures/full_fig_p047_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: DreamDojo-14B, occ68 episode 0064. A blue plastic-coated wrapper that retains crisp creases and specular highlights in the ground truth drapes like a piece of soft cloth in the prediction. The plastic-like sheen is lost and the wrapper’s area contracts noticeably between mid- and end-clip. and SC-A3 at Grade D. The case shows that the indicator suite catches the failure even when the textbook-named axis d… view at source ↗
Figure 17
Figure 17. Figure 17: DreamDojo-14B, occ68 episode 0045. The patterned snack package being held in the ground truth disappears in the prediction; the workspace is then populated with a cluster of indistinct, ill-defined blobs whose category and geometry cannot be determined from the rendered pixels. In the last two frames these blobs adhere to the right gripper and forearm, fusing visually with the manipulator itself [PITH_FU… view at source ↗
Figure 18
Figure 18. Figure 18: DreamDojo-2B, episode_0172. The predicted video accurately follows the instruction: the robot grasps the mango and places it onto the lower shelf of the rack. Objects remain visually coherent throughout, with no deformation or artefacts. K.1 Case 1: Ideal Success (TCR = 1, OPS = high) Instruction: “A GR1 humanoid robot stands in front of a yellow mango and a two-tier rack. The robot extends one arm to gra… view at source ↗
Figure 19
Figure 19. Figure 19: Happy Horse, episode_0127. The robot appears to complete the task of placing toast into the toaster, but the bread slice undergoes severe visual distortion during manipulation—multiple slices merge into a single amorphous mass. K.2 Case 2: Object artefacts Despite Task Completion (TCR = 1, OPS < high) Instruction: “A GR1 humanoid robot stands in front of a kitchen countertop. The robot uses one arm to car… view at source ↗
Figure 20
Figure 20. Figure 20: Cosmos-14B, episode_0021. The robot fails to place the pear onto the tray, the arm reaches in the wrong direction, and the scene exhibits bizarre chromatic flashing with object defor￾mation throughout. K.4 Case 4: Optimism Bias—Task Labelled as Failed, Model Succeeds (TCR = 0, OPS = high) [PITH_FULL_IMAGE:figures/full_fig_p050_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Wan2.2, episode_0082. The ground-truth instruction specifies that the robot fails to lift the milk carton, yet the predicted video shows the robot successfully grasping and lifting it. Objects remain perfectly preserved—the model generates a physically plausible but optimistically biased outcome. Instruction: “In a tabletop scene with shelves and various objects, a milk carton sits on a plate in front of … view at source ↗
Figure 22
Figure 22. Figure 22: Per-perturbation optimism bias rate broken down by model. grip_force_weak and contact_oscillation trigger the highest bias in DreamDojo-2B (100%), while wrist_tilt_grasp is the least biased perturbation overall (33.3%). Note that model sensitivity varies strongly by perturbation type: DreamDojo-2B shows 0% bias on grip_carry_slip and wrist_tilt_grasp, while Happy Horse shows 100% on those same types. 3. F… view at source ↗
Figure 23
Figure 23. Figure 23: Happy Horse, task_023. The baseline prediction is visually excellent (MA-1 = 4/5) with fully plausible physics, yet the perturbed video is indistinguishable from baseline despite a failure-inducing perturbation. The model generates physically coherent, task-successful outcomes regardless of the action input. Source: human_annotation_happyhorse_i2v/videos/task_023/ Module A (Perturbation Sensitivity) Modul… view at source ↗
Figure 24
Figure 24. Figure 24: DreamDojo-2B, task_024. Despite lower visual quality than larger models, this 2B model achieves high baseline-GT alignment (MA-1 = 4/5) on this episode while completely ignoring the perturbation. The perturbed prediction shows task completion identical to baseline. Module A Module B Baseline Quality (MA-1) 4/5 (High similarity) Baseline Completion (MB-1) Fully complete Perturbation Impact (MA-4) No effect… view at source ↗
Figure 25
Figure 25. Figure 25: DreamDojo-2B, task_029. The baseline itself exhibits poor physical plausibility (pen￾etration, impossible deformation), yet the model still generates task-successful predictions under perturbation. Bias persists even when the model’s physics understanding is clearly deficient. Source: human_annotation_dreamdojo_2b_gr1/videos/task_029/ 53 [PITH_FULL_IMAGE:figures/full_fig_p053_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Happy Horse, task_022. The model correctly responds to the perturbation: the per￾turbed prediction shows clear task failure (MB-6 = “not completed”) with significant trajectory di￾vergence from baseline. This demonstrates that the same model architecture can propagate failure signals in some cases. Source: human_annotation_happyhorse_i2v/videos/task_022/ Module A Module B Baseline Quality (MA-1) 4/5 (High… view at source ↗
Figure 27
Figure 27. Figure 27: Happy Horse, task_011. The model shows a detectable but insufficient response to the perturbation: the perturbed video achieves only partial completion with large final-state deviation from baseline, yet annotators still detect mild optimism bias (MA-9 = Y?) and false success (MB-9 = Y), indicating the model dampens but does not fully propagate the failure signal. Analysis. Mild bias represents a continuu… view at source ↗
Figure 28
Figure 28. Figure 28: Wan2.1, task_017. The model shows no full optimism bias, but not because it correctly follows actions: the baseline itself completely fails (MA-1 = 1/5, MB-1 = “not completed”), so there is no “success” for the model to hallucinate under perturbation. Low bias here reflects incapability, not fidelity. Source: human_annotation_wan21_i2v_14b/videos/task_017/ Module A Module B Baseline Quality (MA-1) 1/5 (Di… view at source ↗
read the original abstract

Action-conditioned world models are increasingly used as scalable simulators for robot learning, yet current evaluations provide limited evidence that their predictions are reliable under the actions they condition on. Existing benchmarks largely emphasize visual fidelity, leaving unclear whether predicted futures are physically plausible, faithful to commanded actions, and calibrated to failure when actions should not succeed. We introduce \textsc{MiraBench}, a hierarchical benchmark that defines \emph{action-conditioned reliability} as a core evaluation target for robotic world models. MiraBench decomposes this target into three progressively demanding levels: \emph{Physics Adherence}, which evaluates reference-free physical consistency; \emph{Action-Following Fidelity}, which measures whether predictions respect task-relevant action inputs; and \emph{Optimism Bias Detection}, which probes the tendency to predict successful outcomes under failure-inducing actions. To support this evaluation, we curate a human-annotated corpus with over 16,000 judgments across tasks, failure categories, and leading world models. We evaluate 12 representative model configurations spanning vector-conditioned robotic world models, text-conditioned generative world models, open-weight systems, closed-source systems, and multiple model scales. Across this broad model landscape, MiraBench reveals three central findings: visual fidelity is a poor proxy for action fidelity; increasing model scale does not reliably improve action following; and optimism bias is pervasive across current systems. By shifting evaluation from appearance to action-conditioned reliability, MiraBench provides a diagnostic foundation for assessing and improving robotic world models as faithful simulators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces MiraBench, a hierarchical benchmark for action-conditioned reliability in robotic world models. It decomposes the target into three levels—Physics Adherence (reference-free physical consistency), Action-Following Fidelity (respect for task-relevant actions), and Optimism Bias Detection (prediction of success under failure-inducing actions)—supported by a human-annotated corpus of over 16,000 judgments. The work evaluates 12 model configurations spanning vector- and text-conditioned systems, open- and closed-source models, and multiple scales, reporting three findings: visual fidelity is a poor proxy for action fidelity; increasing model scale does not reliably improve action following; and optimism bias is pervasive.

Significance. If the human judgments are shown to be reliable, MiraBench would provide a valuable diagnostic shift from visual-only evaluation to action-conditioned reliability, highlighting concrete limitations in current world models for use as robot-learning simulators and offering a foundation for targeted improvements.

major comments (3)
  1. [Abstract / corpus curation paragraph] Abstract and paragraph on corpus curation: the central claims rest on the 16,000+ human judgments serving as ground truth for Physics Adherence, Action-Following Fidelity, and Optimism Bias Detection, yet no annotation protocol, inter-annotator agreement statistics, statistical significance tests, or bias controls are reported. This directly undermines the reliability of all three findings.
  2. [Evaluation section] Evaluation of 12 model configurations: the manuscript does not detail how tasks and failure categories were selected or balanced, nor any calibration of human labels against objective physics simulators, leaving open the possibility that the reported dissociation between visual and action fidelity, the scale-insensitivity result, and the pervasiveness of optimism bias are artifacts of annotation inconsistency or selection bias.
  3. [Results] Results reporting the three central findings: without quantitative validation metrics for the human corpus (e.g., agreement rates or simulator cross-checks), the claims that visual fidelity is a poor proxy and that optimism bias is pervasive cannot be assessed for robustness.
minor comments (1)
  1. [Abstract] The abstract states the evaluation was performed on 12 models with 16k judgments but provides no table or appendix summarizing the exact model configurations, task distributions, or failure-category breakdowns.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and for emphasizing the need to substantiate the human annotation corpus. We agree that additional methodological details are required and will revise the manuscript to address each point raised.

read point-by-point responses
  1. Referee: [Abstract / corpus curation paragraph] Abstract and paragraph on corpus curation: the central claims rest on the 16,000+ human judgments serving as ground truth for Physics Adherence, Action-Following Fidelity, and Optimism Bias Detection, yet no annotation protocol, inter-annotator agreement statistics, statistical significance tests, or bias controls are reported. This directly undermines the reliability of all three findings.

    Authors: We agree that the manuscript does not report these details. In revision we will add a dedicated subsection on corpus curation that specifies the full annotation protocol, reports inter-annotator agreement statistics, includes statistical significance tests on the judgments, and describes bias-control procedures employed during collection of the 16,000 judgments. revision: yes

  2. Referee: [Evaluation section] Evaluation of 12 model configurations: the manuscript does not detail how tasks and failure categories were selected or balanced, nor any calibration of human labels against objective physics simulators, leaving open the possibility that the reported dissociation between visual and action fidelity, the scale-insensitivity result, and the pervasiveness of optimism bias are artifacts of annotation inconsistency or selection bias.

    Authors: We will expand the Evaluation section with explicit criteria and balancing statistics for task and failure-category selection. On simulator calibration, the benchmark is intentionally reference-free; we will add any available cross-checks against physics simulators and, where none exist, state this limitation explicitly so readers can assess potential selection effects. revision: partial

  3. Referee: [Results] Results reporting the three central findings: without quantitative validation metrics for the human corpus (e.g., agreement rates or simulator cross-checks), the claims that visual fidelity is a poor proxy and that optimism bias is pervasive cannot be assessed for robustness.

    Authors: We will incorporate the quantitative validation metrics (agreement rates, significance tests, and any simulator cross-checks) into the Results and Methods sections so that the robustness of the three findings can be directly evaluated. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with no derivations or self-referential claims

full rationale

The paper introduces MiraBench as a new hierarchical benchmark and curates a fresh human-annotated corpus of >16k judgments to evaluate 12 model configurations. All three central findings (visual fidelity poor proxy for action fidelity; scale does not reliably improve action following; optimism bias pervasive) are direct observational results from applying the new corpus to the models. No equations, fitted parameters, predictions, or uniqueness theorems are present. No self-citations are invoked to justify load-bearing steps. The work is self-contained against external benchmarks and does not reduce any claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper with no mathematical derivations, fitted parameters, or new postulated entities; the central claims rest on the validity of the human judgments and the representativeness of the 12 model configurations.

pith-pipeline@v0.9.1-grok · 5828 in / 1242 out tokens · 23068 ms · 2026-06-29T07:44:43.159403+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

124 extracted references · 44 canonical work pages · 21 internal anchors

  1. [1]

    Happyhorse

    Alibaba Inc. Happyhorse. https://www.happyhorse.cn/, 2026. Accessed: 2026-05-06

  2. [2]

    Diffusion for world modeling: Visual details matter in atari, 2024

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari, 2024. URL https://arxiv.org/abs/2405.12399

  3. [3]

    Bear, Elias Wang, Damian Mrowca, Felix J

    Daniel M. Bear, Elias Wang, Damian Mrowca, Felix J. Binder, Hsiao-Y u Fish Tung, R. T. Pramod, Cameron Holdaway, Sirui Tao, Kevin Smith, Fan-Y un Sun, Li Fei-Fei, Nancy Kan- wisher, Joshua B. Tenenbaum, Daniel L. K. Y amins, and Judith E. Fan. Physion: Evaluating physical prediction from vision in humans and machines, 2022. URL https://arxiv.org/ abs/2106.08261

  4. [4]

    Rt-2: Vision-language-action mod- els transfer web knowledge to robotic control, 2023

    Anthony Brohan, Noah Brown, Justice Carbajal, et al. Rt-2: Vision-language-action mod- els transfer web knowledge to robotic control, 2023. URL https://arxiv.org/abs/2307. 15818

  5. [5]

    Genie: Generative inter- active environments.arXiv preprint arXiv:2402.15391, 2024

    Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Y uge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Y usuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando de Freitas, Satinder ...

  6. [6]

    Lerobot: An open-source library for end-to-end robot learning,

    Remi Cadene, Simon Alibert, Francesco Capuano, Michel Aractingi, Adil Zouitine, Pepijn Kooijmans, Jade Choghari, Martino Russi, Caroline Pascal, Steven Palma, Mustafa Shukor, Jess Moss, Alexander Soare, Dana Aubakirova, Quentin Lhoest, Quentin Gallouédec, and Thomas Wolf. Lerobot: An open-source library for end-to-end robot learning. In The F our- teenth ...

  7. [7]

    Tenenbaum, and Chuang Gan

    Zhenfang Chen, Kexin Yi, Y unzhu Li, Mingyu Ding, Antonio Torralba, Joshua B. Tenenbaum, and Chuang Gan. Comphy: Compositional physical reasoning of objects and events from videos, 2022. URL https://arxiv.org/abs/2205.01089

  8. [8]

    Open x-embodiment: Robotic learning datasets and rt-x models, 2025

    Embodiment Collaboration, Abby O’Neill, Abdul Rehman, et al. Open x-embodiment: Robotic learning datasets and rt-x models, 2025. URL https://arxiv.org/abs/2310. 08864

  9. [9]

    Worldscore: A unified evaluation benchmark for world generation, 2025

    Haoyi Duan, Hong-Xing Y u, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation, 2025. URL https://arxiv.org/abs/2504. 00983

  10. [10]

    Vista: A generalizable driving world model with high fidelity and versatile controllability, 2024

    Shenyuan Gao, Jiazhi Y ang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability, 2024. URL https://arxiv.org/abs/2405.17398

  11. [11]

    Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Y e, Sihyun Y u, Wei- Cheng Tseng, Y uzhu Dong, Kaichun Mo, Chen-Hsuan Lin, Qianli Ma, Seungjun Nah, Loic Magne, Jiannan Xiang, Y uqi Xie, Ruijie Zheng, Dantong Niu, Y ou Liang Tan, K. R. Zentner, George Kurian, Suneel Indupuru, Pooya Jannaty, Jinwei Gu, Jun Zhang, Jitendra Malik, Pieter ...

  12. [12]

    "PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

    Jing Gu, Xian Liu, Y u Zeng, Ashwin Nagarajan, Fangrui Zhu, Daniel Hong, Y ue Fan, Qianqi Y an, Kaiwen Zhou, Ming-Y u Liu, and Xin Eric Wang. "phyworldbench": A comprehensive evaluation of physical realism in text-to-video models, 2026. URL https://arxiv.org/ abs/2507.13428

  13. [13]

    Ctrl-world: A controllable generative world model for robot manipulation, 2026

    Y anjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation, 2026. URL https://arxiv.org/abs/2510. 10125

  14. [14]

    World models

    David Ha and Jürgen Schmidhuber. World models. 2018. doi: 10.5281/ZENODO.1207631. URL https://zenodo.org/record/1207631

  15. [15]

    Dream to con- trol: Learning behaviors by latent imagination, 2020

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to con- trol: Learning behaviors by latent imagination, 2020. URL https://arxiv.org/abs/1912. 01603

  16. [16]

    Mastering Atari with Discrete World Models

    Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models, 2022. URL https://arxiv.org/abs/2010.02193

  17. [17]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse do- mains through world models, 2024. URL https://arxiv.org/abs/2301.04104

  18. [18]

    Temporal difference learning for model predic- tive control, 2022

    Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predic- tive control, 2022. URL https://arxiv.org/abs/2203.04955

  19. [19]

    Gaia-1: A generative world model for autonomous driving,

    Anthony Hu, Lloyd Russell, Hudson Y eo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving,

  20. [20]

    URL https://arxiv.org/abs/2309.17080

  21. [21]

    Vbench: Comprehensive benchmark suite for video generative models, 2023

    Ziqi Huang, Yinan He, Jiashuo Y u, Fan Zhang, Chenyang Si, Y uming Jiang, Y uanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Y aohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Y u Qiao, and Ziwei Liu. Vbench: Comprehensive benchmark suite for video generative models, 2023. URL https://arxiv.org/abs/2311.17982

  22. [22]

    When to trust your model: Model-based policy optimization, 2021

    Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization, 2021. URL https://arxiv.org/abs/1906.08253

  23. [23]

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Y unliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Y echeng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, Y...

  24. [24]

    Gonzalez, Ion Stoica, Song Han, and Y ao Lu

    Dacheng Li, Y unhao Fang, Y ukang Chen, Shuo Y ang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E. Gonzalez, Ion Stoica, Song Han, and Y ao Lu. Worldmodelbench: Judging video generation models as world models, 2025. URL https://arxiv.org/abs/2502.20694. 11

  25. [25]

    Evalcrafter: Benchmarking and evaluating large video generation models, 2024

    Y aofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Y ong Zhang, Haoxin Chen, Y ang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models, 2024. URL https://arxiv.org/abs/2310.11440

  26. [26]

    What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

    Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Y uke Zhu, and Roberto Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation, 2021. URL https://arxiv.org/ abs/2108.03298

  27. [27]

    Transformers are sample-efficient world models

    Vincent Micheli, Eloi Alonso, and François Fleuret. Transformers are sample-efficient world models, 2023. URL https://arxiv.org/abs/2209.00588

  28. [28]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    NVIDIA, :, Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi "Jim" Fan, Y u Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, ...

  29. [29]

    Cosmos World Foundation Model Platform for Physical AI

    NVIDIA, Niket Agarwal, Arslan Ali, et al. Cosmos world foundation model platform for physical ai, 2025. URL https://arxiv.org/abs/2501.03575

  30. [30]

    Learning dexterous in-hand manipulation, 2019

    OpenAI, Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob Mc- Grew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, Jonas Schneider, Szymon Sidor, Josh Tobin, Peter Welinder, Lilian Weng, and Wojciech Zaremba. Learning dexterous in-hand manipulation, 2019. URL https://arxiv.org/abs/1808. 00177

  31. [31]

    Worldsimbench: Towards video generation models as world simulators, 2024

    Yiran Qin, Zhelun Shi, Jiwen Y u, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, Lei Bai, Wanli Ouyang, and Ruimao Zhang. Worldsimbench: Towards video generation models as world simulators, 2024. URL https://arxiv.org/abs/2410. 18072

  32. [32]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, V alentin Gabeur, Y uan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan V asudev Alwala, Nicolas Carion, Chao-Y uan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos, 2024. URL https: //arxiv...

  33. [33]

    Riochet, M

    Ronan Riochet, Mario Ynocente Castro, Mathieu Bernard, Adam Lerer, Rob Fergus, Véronique Izard, and Emmanuel Dupoux. Intphys: A framework and benchmark for visual intuitive physics reasoning, 2020. URL https://arxiv.org/abs/1803.07616

  34. [34]

    Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

    Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lil- licrap, and David Silver. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604609, December 2020. ISSN 1476-4687. doi: 10.1038/ s41586-020-0...

  35. [35]

    Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world mod- els, 2026

    Y u Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, Chen Gao, Wei Wu, Xihui Liu, Dhruv Shah, Zhaoxiang Zhang, Zhibo Chen, Jun Zhu, Y onghong Tian, Tat-Seng Chua, Wenwu Zhu, and Y ong Li. Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world mod- els,...

  36. [36]

    T2v- compbench: A comprehensive benchmark for compositional text-to-video generation, 2025

    Kaiyue Sun, Kaiyi Huang, Xian Liu, Y ue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v- compbench: A comprehensive benchmark for compositional text-to-video generation, 2025. URL https://arxiv.org/abs/2407.14505

  37. [37]

    Richard S. Sutton. Dyna, an integrated architecture for learning, planning, and reacting. SIGART Bull., 2(4):160163, July 1991. ISSN 0163-5719. doi: 10.1145/122344.122377. URL https://doi.org/10.1145/122344.122377. 12

  38. [38]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, Y ou Liang Tan, Lawrence Y unliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy, 2024. URL https://arxiv.org/abs/2405.12213

  39. [39]

    Domain randomization for transferring deep neural networks from simulation to the real world,

    Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world,

  40. [40]

    URL https://arxiv.org/abs/1703.06907

  41. [41]

    Tenen- baum, Daniel LK Y amins, Judith E Fan, and Kevin A

    Hsiao-Y u Tung, Mingyu Ding, Zhenfang Chen, Daniel Bear, Chuang Gan, Joshua B. Tenen- baum, Daniel LK Y amins, Judith E Fan, and Kevin A. Smith. Physion++: Evaluating physical scene understanding that requires online inference of different physical properties, 2023. URL https://arxiv.org/abs/2306.15668

  42. [42]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michal- ski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & chal- lenges, 2019. URL https://arxiv.org/abs/1812.01717

  43. [43]

    Bridgedata v2: A dataset for robot learning at scale,

    Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen-Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale, 2024. URL https://arxiv.org/abs/2308.12952

  44. [44]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Y u, Haiming Zhao, Jianxiao Y ang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Y an, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang...

  45. [45]

    Gensim: Generating robotic simulation tasks via large language models, 2024

    Lirui Wang, Yiyang Ling, Zhecheng Y uan, Mohit Shridhar, Chen Bao, Y uzhe Qin, Bailin Wang, Huazhe Xu, and Xiaolong Wang. Gensim: Generating robotic simulation tasks via large language models, 2024. URL https://arxiv.org/abs/2310.01361

  46. [46]

    Drivedreamer: Towards real-world-driven world models for autonomous driving,

    Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drive- dreamer: Towards real-world-driven world models for autonomous driving, 2023. URL https://arxiv.org/abs/2309.09777

  47. [47]

    Robogen: Towards unleashing infinite data 133 for automated robot learning via generative simulation.arXiv preprint arXiv:2311.01455, 2023

    Y ufei Wang, Zhou Xian, Feng Chen, Tsun-Hsuan Wang, Yian Wang, Katerina Fragkiadaki, Zackory Erickson, David Held, and Chuang Gan. Robogen: Towards unleashing infinite data for automated robot learning via generative simulation, 2024. URL https://arxiv.org/ abs/2311.01455

  48. [48]

    Learning Interactive Real-World Simulators

    Sherry Y ang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Leslie Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators, 2024. URL https://arxiv.org/abs/2310.06114

  49. [49]

    CLEVRER: CoLlision Events for Video REpresentation and Reasoning

    Kexin Yi, Chuang Gan, Y unzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B. Tenenbaum. Clevrer: Collision events for video representation and reasoning, 2020. URL https://arxiv.org/abs/1910.01442

  50. [50]

    Scaling Robot Learning with Semantically Imagined Experience

    Tianhe Y u, Ted Xiao, Austin Stone, Jonathan Tompson, Anthony Brohan, Su Wang, Jaspiar Singh, Clayton Tan, Dee M, Jodilyn Peralta, Brian Ichter, Karol Hausman, and Fei Xia. Scaling robot learning with semantically imagined experience, 2023. URL https://arxiv.org/ abs/2302.11550. 13

  51. [51]

    VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Y uanhan Zhang, Jingwen He, Wei-Shi Zheng, Y u Qiao, and Ziwei Liu. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness, 2025. URL https://arxiv.org/abs/ 2503.21755

  52. [52]

    RoboDreamer: Learning Compositional World Models for Robot Imagination

    Siyuan Zhou, Yilun Du, Jiaben Chen, Y andong Li, Dit-Y an Y eung, and Chuang Gan. Robo- dreamer: Learning compositional world models for robot imagination, 2024. URL https: //arxiv.org/abs/2404.12377

  53. [53]

    IRASim: Learning interactive real-robot action simulators

    Fangqi Zhu, Hongtao Wu, Song Guo, Y uxiao Liu, Chilam Cheang, and Tao Kong. Irasim: A fine-grained world model for robot manipulation, 2025. URL https://arxiv.org/abs/ 2406.14540

  54. [54]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Y e, Lixin Gu, Hao Tian, Y uchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Y ue Cao, Y angzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingchen...

  55. [55]

    Left hand joint commands scaled to (1 − s) from t0 = 0.40T onward

    Grip Force Insufficient ( grip_force_weak). Left hand joint commands scaled to (1 − s) from t0 = 0.40T onward. At s = 0.5, grip force is halved; the object should slip during transport. a′ t, L-hand = (1 − s) · at, L-hand, t ≥ ⌊0.40T ⌋ (5)

  56. [56]

    Left hand joints reduced to 2% during carry phase (0.40T to 0.80T ), before reaching placement target

    Premature Release (premature_release). Left hand joints reduced to 2% during carry phase (0.40T to 0.80T ), before reaching placement target. Object should fall mid-transport. a′ t, L-hand = 0.02 · at, L-hand, ⌊0.40T ⌋ ≤ t ≤ ⌊0.80T ⌋ (6)

  57. [57]

    Left hand timing advanced by ∆ = ⌊T (0.15 + 0.20s)⌋ frames; arm trajectory unchanged

    Grip Carry Slip ( grip_carry_slip). Left hand timing advanced by ∆ = ⌊T (0.15 + 0.20s)⌋ frames; arm trajectory unchanged. Gripper opens before arm reaches target. a′ t, L-hand = amin(t+∆, T −1), L-hand (7)

  58. [58]

    3-cycle sinusoidal injection on both left and right arm joints during contact phase ( 0.25T to 0.70T ), amplitude A = 0.4 · std(a:, L-arm)

    Contact Oscillation ( contact_oscillation). 3-cycle sinusoidal injection on both left and right arm joints during contact phase ( 0.25T to 0.70T ), amplitude A = 0.4 · std(a:, L-arm). Prevents stable grasp formation. a′ t, L/R-arm = at, L/R-arm + A · sin 6π(t−t0) t1−t0 (8)

  59. [59]

    Both left and right wrist joints (2 per wrist) offset by +0.8 rad from 0.15T to 0.85T , causing incorrect contact geometry

    Wrist Tilt During Grasp ( wrist_tilt_grasp). Both left and right wrist joints (2 per wrist) offset by +0.8 rad from 0.15T to 0.85T , causing incorrect contact geometry. a′ t, L/R-wrist = at, L/R-wrist + 0.8 (9)

  60. [60]

    From a first-person perspective, picks up a red apple from the center of a wooden table and carefully places it into the bottom shelf of a two-tiered wooden crate on the right

    Approach Overshoot ( approach_overshoot). Left arm joint trajectory scaled ×1.30 during approach (0.10T to 0.75T ); end-effector overshoots object before gripper closes. a′ t, L-arm = 1.30 · at, L-arm, ⌊0.10T ⌋ ≤ t ≤ ⌊0.75T ⌋ (10) B.2 Per-Task Perturbation Assignment Each task receives 3 perturbations (plus baseline), with 2 mandatory types applied univer...

  61. [61]

    0.1 newton of grip force

    Failure mode explicit : the prompt explicitly states the physical failure (e.g., “0.1 newton of grip force”, “releases prematurely”) so that any text-conditioned model with adequate language understanding should generate the corresponding failure outcome

  62. [62]

    Task context preserved: the prompt retains full scene description (objects, spatial layout, robot morphology) so the model is not confused about what task is being attempted

  63. [63]

    no penetration

    Human-verified: all prompts are reviewed by annotators who confirm (a) the described failure matches the vector-level perturbation effect, and (b) the scene description matches the first frame the model will receive as conditioning. B.4 Summary Table Design principles. The perturbations satisfy three criteria: (1) Physical interpretability: each cor- respond...

  64. [64]

    0.1 newton

    The described failure mode is physically consistent with the vector-level perturbation (e.g., “0.1 newton” correctly reflects the ×0.5 grip force reduction)

  65. [65]

    The scene description (objects, colors, spatial layout) matches the first frame the model will receive

  66. [66]

    occlusion-event present?

    The prompt does not inadvertently reveal evaluation criteria or contain ambiguous language. Cross-task consistency. Prompts across all 9 tasks follow a uniform structure (viewpoint + agent + failure description + object + intended action), ensuring that differences in model performance across tasks reflect task difficulty rather than prompt quality variatio...

  67. [67]

    recognising that an object was successfully placed mid-video even if the arm subsequently moves away

    Whole-sequence over per-frame : Presenting all 16 frames at once allows the VLM to reason about temporal progression—e.g. recognising that an object was successfully placed mid-video even if the arm subsequently moves away. Per-frame voting schemes risk losing this context and introduce sensitivity to aggregation hyper-parameters (tail weighting, vote threshold)

  68. [68]

    Removing GT frames eliminates this bias and evaluates the predicted video on its own merits

    No ground-truth reference: Showing GT frames alongside predicted frames creates an implicit arm-pose anchor that biases the judge against models whose trajectories differ from GT, even when the task goal is achieved. Removing GT frames eliminates this bias and evaluates the predicted video on its own merits

  69. [69]

    {instruction}

    Single binary output: A straightforward 0/1 judgment avoids the need for confidence calibration, vote aggregation, or threshold tuning, making the metric simple and reproducible. Human consistency validation (GR1 split, 48 episodes). We validate the automated evaluator against human-verified labels on 48 GR1 episodes. The evaluator achieves an accuracy of 8...

  70. [70]

    Is the target object (the object being manipulated) clearly visible in Frame1, without unexpected blurring, occlusion, or disappearance?

  71. [71]

    Are all objects in Frame1 free of distortion or unnatural deformation?

  72. [72]

    No explanation

    Are there no objects that pop in or pop out unnaturally between frames (appearing or vanishing without physical cause)? Respond ONLY with 0 or 1. No explanation. 1 = Frame1 passes all checks (high quality, matches GT object presence) 0 = Frame1 fails at least one check (object issue, distortion, or pop artifact) Model-level score. The OPS score for a mode...

  73. [73]

    Simple mean aggregation avoids biasing the score towards any particular phase of the manipulation

    Uniform temporal weighting : Object preservation is a frame-level property that should hold throughout the video, not just at the end. Simple mean aggregation avoids biasing the score towards any particular phase of the manipulation

  74. [74]

    The 0.70 confidence threshold requires a clear majority of frames to pass, catching videos where artefacts appear intermittently

    Binary label with conservative threshold : A single preserved/flawed dichotomy reflects the fact that object artefacts are both highly salient and disqualifying for downstream use. The 0.70 confidence threshold requires a clear majority of frames to pass, catching videos where artefacts appear intermittently

  75. [75]

    looks bad

    Object-focused prompt with three explicit checks : Decomposing the quality judgment into three distinct failure modes (visibility, deformation, pop artefacts) guides the VLM towards con- sistent and interpretable binary decisions, reducing false negatives that arise from a holistic “looks bad” judgment. D.2.3 Generalizability (GEN) GEN measures how well a...

  76. [76]

    Late-phase frame sampling : Frames are extracted at 81–97% progress because perturbation effects (e.g., premature release, grip slip) manifest in the transport and placement phases, not the approach phase

  77. [77]

    Object-focused prompt: The prompt explicitly directs the model to focus on the manipulated object’s state rather than overall visual similarity, reducing false positives from rendering differ- ences

  78. [78]

    Separate prompt for text-conditioned models : Text-conditioned models produce stylistically different videos even without perturbation, requiring the lenient prompt to avoid conflating style differences with perturbation effects

  79. [79]

    A ” (consistent) label and asks the VLM to compare shape, material, size, and colour. Let bobj be the number of pairs voted “B

    Zero-shot 78B model vs fine-tuned 8B : We find that a large zero-shot model (InternVL3-78B) outperforms the fine-tuned smaller model (Qwen3-VL-8B SFT) on this binary task, likely be- cause the task requires spatial comparison rather than domain-specific scoring. E Physical Consistency Metric Calibration and Physics Law Compliance Details This appendix gathers...

  80. [80]

    SC-A1 Color Stability : Does the primary object maintain stable color throughout?

Showing first 80 references.