pith. sign in

arxiv: 2606.31672 · v1 · pith:KWMJVSH3new · submitted 2026-06-30 · 💻 cs.CV · cs.AI

WorldRoamBench: An Open-World Benchmark for Long-Horizon Stability of Interactive World Models

Pith reviewed 2026-07-01 05:58 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords interactive world modelslong-horizon stabilitybenchmarkaction followingvision driftphysics consistencymemory evaluationopen-world scenes
0
0 comments X

The pith

WorldRoamBench shows no interactive world model reliably satisfies all four long-horizon stability dimensions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WorldRoamBench to evaluate interactive world models on long-horizon stability across action, vision, physics, and memory. It replaces trajectory-level checks with four specific metrics that address gaps like hidden per-frame errors, mid-sequence collapses, physics under control, and decoupled memory. Testing more than ten open and closed models on over six hundred cases across varied scenes finds that none meet all criteria reliably and the strongest reach only moderate scores. A sympathetic reader would care because these models are intended for extended real-world interaction yet current ones lose consistency over time.

Core claim

WorldRoamBench is an open-world benchmark comprising more than six hundred test cases in nature, urban, and indoor scenes that evaluates interactive world models on long-horizon stability using a per-frame action metric, a segment-based vision drift metric, controllability-gated physics evaluation, and an action-decoupled memory protocol; results show that none of the ten-plus tested models reliably satisfies all four dimensions and even the best achieves only moderate scores.

What carries the argument

The four tailored metrics (per-frame action, segment-based vision drift, controllability-gated physics, and action-decoupled memory) that replace simple trajectory-level evaluation.

If this is right

  • Action evaluation must occur per frame rather than at the full trajectory level to expose hidden failures.
  • Vision assessment must check for non-monotonic drift in the middle of sequences rather than only start versus end.
  • Physics scoring must be gated on faithful action execution across mechanics, optics, and 3D consistency.
  • Memory must be tested separately from actions using localized 3D reconstruction for scenes and tracking plus reasoning for subjects.
  • Advances require simultaneous improvement on all four dimensions for models to become stable and deployable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark's design could be applied to measure progress after targeted training that adds explicit memory or physics losses.
  • Real deployment in robotics may still need extra tests for sensor noise or multi-agent interaction not covered here.
  • The gap between open and closed models on specific dimensions could point to architectural choices worth investigating further.
  • Extending the test length beyond sixty seconds might reveal additional failure modes in current models.

Load-bearing premise

The four tailored metrics accurately capture long-horizon stability without introducing model-specific biases or evaluation artifacts.

What would settle it

A model that scores high on all four metrics but still exhibits clear instability during extended continuous interaction in new scenes would challenge the metrics.

Figures

Figures reproduced from arXiv: 2606.31672 by Baoquan Chen, Fan Jiang, Hongyu Pan, Jiacheng Sui, Kewei Shi, Mingchao Sun, Mu Xu, Qi Fan, Ting-Bing Xu, Wenjin Yang, Yong Li, Zhaoxu Sun, Zhe Gao, Zhicheng Liu.

Figure 1
Figure 1. Figure 1: Illustration of WorldRoamBench. WorldRoamBench evaluates interactive world models across action following, visual quality, memory, and interaction physics, covering open-source and closed-source models in diverse scenarios and viewing conditions. Abstract Despite rapid progress in interactive world models (IWMs), existing benchmarks evaluate action following only at tra￾jectory level and ignore memory and … view at source ↗
Figure 2
Figure 2. Figure 2: Failures revealed by long-horizon interaction. Ex￾tended rollouts expose failures often missed by short clips: high trajectory scores hide per-step action mismatches, visual quality degrades, physical constraints are violated, and revisited scenes are regenerated inconsistently. Yet as these models proliferate, a critical question re￾mains unanswered: How well do they respond to user inputs over extended i… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the WorldRoamBench evaluation pipeline. Given videos generated from shared initial frames and WASD/IJKL action programs, WorldRoamBench evaluates four complementary dimensions: action following through pose-aligned frame scoring, visual quality through frame-level and drift-aware metrics, interaction physics through mechanics, optics, and 3D-consistency tests, and memory through turning-point-a… view at source ↗
Figure 4
Figure 4. Figure 4: Action following evaluation pipeline. A shared ViPE trajectory estimation stage feeds two branches: (left) per-frame ac￾tion accuracy via latent-stride discretization, and (right) TrajScore via adaptive GT construction and arc-length resampling. an atomic action (e.g., forward) or a compound action. Each action is decomposed into a set of atomic sub-actions P(gt) = {p1, p2, . . .}. The atomic vocabulary co… view at source ↗
Figure 5
Figure 5. Figure 5: Memory evaluation pipeline. WorldRoamBench eval￾uates memory with two trajectory-aware tracks. Scene memory localizes the executed observation–revisit transition and compares reconstructed scene geometry across the two segments. Subject memory tracks the third-person protagonist and evaluates identity, structure, and appearance preservation with a holistic Qwen3-VL￾PLUS judgment. The full pipeline is provi… view at source ↗
Figure 6
Figure 6. Figure 6: Action key distribution across the three evaluation di￾mensions. Each pie chart shows the proportion and count of per￾frame key presses aggregated over all test cases in that dimension. 0 50 100 150 200 Number of Test Cases Memory Action Physics 99 113 212 118 99 217 96 75 171 (a) Case Counts FPV TPV 0 300 300 400 400 500 500 1000 1000 1100 0 100 200 300 400 Number of Cases 115 349 107 19 10 (b) Action Seq… view at source ↗
Figure 7
Figure 7. Figure 7: Test case count and action-sequence length distribu￾tion. (a) Test cases by dimension, split by 1st and 3rd perspective. (b) Distribution of action-sequence lengths in frames. related artifacts. 4.3. Test Suite Statistics [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-frame visual quality curves and drift comparison (first-person). Top: mean imaging and aesthetic scores at each frame index (averaged over 120 videos per model, first 300 frames). Bottom: drift scores quantifying quality degradation from the best to the worst segment of the rollout (lower = more stable); bar height is the mean drift across videos, and black error bars span from mean−1 std (lower cap) t… view at source ↗
Figure 10
Figure 10. Figure 10: Per-frame action accuracy by difficulty (first-person). Strict accuracy (left) and partial accuracy (right) grouped by difficulty level (easy = constant action, medium = 1 action switch, hard = 2 action switches). Bar height is the mean across test cases. Most models degrade from easy to hard, but the magnitude of degradation varies substantially. stantially stronger third-person control than the open mod… view at source ↗
Figure 11
Figure 11. Figure 11: Test suite gallery. Each panel shows a representative test case with the first frame, trajectory schematic, and action sequence. The gallery is organized by scene category from top to bottom (Indoor, Urban, Nature) and by perspective from left to right (first-person, third-person), yielding six panels. checkpoint, which generates frames conditioned on both past and future context within each chunk, and ev… view at source ↗
Figure 12
Figure 12. Figure 12: Closed-source automated interaction pipeline for Happy Oyster and Genie 3. The system reads benchmark test cases (action [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Latent-stride single-frame discretization. [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: Detailed interaction-physics evaluation pipeline. The full pipeline includes action-responsiveness validation, me￾chanics protocols, optics protocols, 3D-consistency evaluation, Qwen3-VL-PLUS-based VLM scoring, fallback queries, and domain-level aggregation. execute the prescribed action rather than remaining static or unresponsive. This prevents action-following failures from being conflated with physica… view at source ↗
Figure 16
Figure 16. Figure 16: Detailed memory evaluation pipeline. The full pipeline includes action-aware transition localization, point-cloud reconstruc￾tion, registration, filtering, scene-memory scoring, subject tracking, controllability gating, and Qwen3-VL-PLUS-based VLM subject￾memory evaluation. camera-pose chain. Pose drift can introduce a rigid dis￾placement unrelated to memory, artificially increasing both forgetting and ha… view at source ↗
Figure 17
Figure 17. Figure 17: Failure mode of frame-pair memory evaluation. Frame-pair metrics assume that an observation frame and its re￾visit counterpart depict the same spatial location. In long-horizon interactive rollouts, however, imperfect action execution can shift the revisit trajectory, so the paired frames correspond to different viewpoints or even different scene regions. Image-level discrep￾ancies in such pairs therefore… view at source ↗
Figure 18
Figure 18. Figure 18: Action gap in closed-source models. Top: Genie 3. When the D key (rightward translation) is issued, the model rotates the camera to the right instead of translating the character rightward within a stable scene. The character remains centered while the entire scene rotates, making controlled navigation impossible. Bottom: Happy Oyster. When the D key (rightward translation) is issued, the model faithfully… view at source ↗
Figure 19
Figure 19. Figure 19: Qualitative examples of WorldRoamBench scoring across physics, memory, and action. Each block contrasts a positive and a negative rollout that share the same initial frame and action schedule. Physics: W+A in a corridor—Happy Oyster (No Clipping) vs. Genie 3 (Clipping). Memory: J→L Observation/Revisit in an alley—Matrix-Game 3.0 (Good Memory) vs. SANA-WM (Bad Memory). Action: S→A (backward→left)—Genie 3 (… view at source ↗
Figure 20
Figure 20. Figure 20: Collision evaluation prompts. Full VLM prompts used for approach detection and collision-response verification. Deformation Evaluation Prompt — FPV: Trace Detection You are analyzing a video clip ({duration}s total). Showing {N} frames at timestamps: [{t1, t2, ..., tN}]. Think step by step about the following question. Question: In these frames from the later part of a first-person walking video, look car… view at source ↗
Figure 21
Figure 21. Figure 21: Deformation evaluation prompts. Full VLM prompts used for first-person trace detection, third-person reference comparison, and third-person real-time interaction checks. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Clipping evaluation prompts. Full VLM prompts used by the staged clipping-detection cascade. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Gravity evaluation prompts. Full VLM prompts used for scenario-specific gravity checks and the third-person general gravity check. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Terrain-following evaluation prompts. Full VLM prompts used for naturalness, third-person ground contact, and boundary checks. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Reflection evaluation prompts. Full VLM prompts used for coarse reflection plausibility checks and per-frame reflection scoring. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Occlusion and shadow evaluation prompts. Full VLM prompts used for expected-shadow, fallback shadow, and generic shadow checks. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Subject memory evaluation prompt. Full VLM prompt used for holistic third-person subject-memory scoring and diagnostic flag extraction. 41 [PITH_FULL_IMAGE:figures/full_fig_p041_27.png] view at source ↗
read the original abstract

Despite rapid progress in interactive world models (IWMs), existing benchmarks evaluate action following only at trajectory level and ignore memory and interaction physics. We introduce WorldRoamBench, an open-world benchmark for long-horizon stability across four dimensions, each with tailored innovations: (i) Action: per-frame action metric bypassing cross-model semantic scale disparity and exposing failures hidden by trajectory; (ii) Vision: segment-based drift metric capturing non-monotonic mid-sequence collapse missed by start-vs-end comparisons; (iii) Physics: controllability-gated evaluation over mechanics, optics, and 3D consistency, scoring plausibility under faithful action execution; (iv) Memory: action-decoupled protocol evaluating scene memory via transition-localized 3D point-cloud reconstruction and subject memory via tracking-plus-VLM reasoning. The benchmark comprises 600+ test cases across Nature, Urban, and Indoor scenes in first/third-person views with WASD 10-60s continuous interaction. Evaluating 10+ open/closed-source models reveals none reliably satisfies all dimensions; even the best achieves only moderate scores. Advances on WorldRoamBench are steps toward IWMs that are stable, physically grounded, memory-faithful, and deployable in real-world applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces WorldRoamBench, an open-world benchmark for assessing long-horizon stability of interactive world models across four dimensions (action, vision, physics, memory), each with custom metrics: per-frame action evaluation, segment-based vision drift, controllability-gated physics plausibility, and action-decoupled memory via 3D reconstruction and VLM reasoning. It describes 600+ test cases in varied scenes and reports that evaluation of 10+ models shows none reliably satisfy all dimensions, with even the best achieving only moderate scores.

Significance. If the four metrics are shown to be robust and free of evaluation artifacts, the benchmark would address a clear gap in existing trajectory-level evaluations by emphasizing memory fidelity and interaction physics over long horizons, potentially serving as a useful standard for guiding IWM development toward real-world deployability.

major comments (1)
  1. The central empirical claim that no model satisfies all four dimensions rests on the reliability of the four tailored metrics, yet the manuscript supplies no validation of these metrics (e.g., no human correlation studies, ablation on metric components, or error analysis), which is load-bearing for interpreting the reported model rankings and the conclusion that advances on the benchmark are steps toward stable IWMs.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the importance of metric validation. We address the concern directly below and commit to revisions that strengthen the empirical support for the benchmark.

read point-by-point responses
  1. Referee: [—] The central empirical claim that no model satisfies all four dimensions rests on the reliability of the four tailored metrics, yet the manuscript supplies no validation of these metrics (e.g., no human correlation studies, ablation on metric components, or error analysis), which is load-bearing for interpreting the reported model rankings and the conclusion that advances on the benchmark are steps toward stable IWMs.

    Authors: We agree that the absence of explicit validation studies (human correlation, ablations, or error analysis) is a limitation in the current manuscript. Each metric is motivated in the text by concrete shortcomings of prior trajectory-level evaluations (per-frame action to avoid semantic scale issues; segment-based vision to capture non-monotonic drift; controllability-gated physics; action-decoupled memory via 3D reconstruction and VLM). However, these motivations are design-based rather than empirically validated against human judgments or alternative formulations. We will add a new subsection in the revised version containing: (i) human correlation studies on a sampled subset of vision and memory cases, (ii) ablation of key metric components (e.g., segment length, gating threshold), and (iii) error analysis of failure modes. These additions will directly support the reported rankings and the broader claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is an empirical benchmark paper that defines four new evaluation metrics (per-frame action, segment-based vision drift, controllability-gated physics, action-decoupled memory) and applies them to 10+ models across 600+ test cases. No equations, derivations, fitted parameters, or self-citations are present that reduce any reported result or claim to an input by construction. The metric definitions and protocols are stated directly and independently of the model outputs they measure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces an evaluation benchmark rather than a theoretical derivation; no free parameters, axioms, or invented entities are required to support the central claim.

pith-pipeline@v0.9.1-grok · 5797 in / 1296 out tokens · 39363 ms · 2026-07-01T05:58:44.438075+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 17 canonical work pages · 7 internal anchors

  1. [1]

    Genie: Generative interactive environments

    Bruce, J., Dennis, M., Edwards, A., et al. Genie: Generative interactive environments. InICML, 2024

  2. [2]

    Genie 3: A new frontier for world models.https://deepmind.google/discover/ blog/genie- 3- a- new- frontier- for- world- models/, 2025

    Google DeepMind. Genie 3: A new frontier for world models.https://deepmind.google/discover/ blog/genie- 3- a- new- frontier- for- world- models/, 2025

  3. [3]

    Happy Oyster: An open-ended world model for real-time world creation and interaction.https:// happyoyster.cn/, 2026

    Alibaba Group. Happy Oyster: An open-ended world model for real-time world creation and interaction.https:// happyoyster.cn/, 2026

  4. [4]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan Team. Wan 2.1: A comprehensive and unified video generation model.arXiv preprint arXiv:2503.20314, 2025

  5. [5]

    Kling 3.0: Next-generation AI video generation

    Kuaishou. Kling 3.0: Next-generation AI video generation. https://klingai.com/, 2025

  6. [6]

    Sora 2: A large-scale video generation model

    OpenAI. Sora 2: A large-scale video generation model. https://openai.com/sora/, 2025

  7. [7]

    Veo 3: State-of-the-art video generation

    Google DeepMind. Veo 3: State-of-the-art video generation. https : / / deepmind . google / technologies / veo/, 2025

  8. [8]

    Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm.arXiv preprint arXiv:2506.05218, 2025

    ByteDance Seed Team. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2506.05218, 2025

  9. [9]

    Matrix-game 2.0: An open-source real-time and streaming interactive world model

    Matrix-Game Team. Matrix-Game 2.0: An open-source, real-time, and streaming interactive world model.arXiv preprint arXiv:2508.13009, 2025

  10. [10]

    Matrix-Game 3.0: Real-time and streaming interactive world model with long-horizon mem- ory.arXiv preprint, 2026

    Matrix-Game Team. Matrix-Game 3.0: Real-time and streaming interactive world model with long-horizon mem- ory.arXiv preprint, 2026

  11. [11]

    HY-World 1.5: A systematic framework for interactive world modeling with real-time latency and ge- ometric consistency.arXiv preprint, 2025

    HY-World Team. HY-World 1.5: A systematic framework for interactive world modeling with real-time latency and ge- ometric consistency.arXiv preprint, 2025

  12. [12]

    Yume 1.5: A text-controlled interactive world generation model.arXiv preprint, 2026

    Yume Team. Yume 1.5: A text-controlled interactive world generation model.arXiv preprint, 2026

  13. [13]

    Advancing open-source world models.arXiv preprint, 2026

    LingBot Team. Advancing open-source world models.arXiv preprint, 2026. 16

  14. [14]

    MIND: Benchmarking memory consis- tency and action following in world models.arXiv preprint arXiv:2602.08025, 2026

    Ye, H., Lu, J., et al. MIND: Benchmarking memory consis- tency and action following in world models.arXiv preprint arXiv:2602.08025, 2026

  15. [15]

    WorldMark: A Unified Benchmark Suite for Interactive Video World Models

    Alaya Studio. WorldMark: A unified benchmark suite for interactive video world models.arXiv preprint arXiv:2604.21686, 2026

  16. [16]

    iWorld-Bench: A benchmark for interactive world models with a unified action generation framework

    Li, Y ., et al. iWorld-Bench: A benchmark for interactive world models with a unified action generation framework. InICML, 2026

  17. [17]

    WildWorld: A large-scale dataset for dynamic world modeling with actions and explicit state toward gener- ative ARPG.arXiv preprint arXiv:2603.23497, 2026

    Shanda AI. WildWorld: A large-scale dataset for dynamic world modeling with actions and explicit state toward gener- ative ARPG.arXiv preprint arXiv:2603.23497, 2026

  18. [18]

    VBench: Comprehensive benchmark suite for video generative mod- els

    Huang, Z., He, Y ., Yu, J., Zhang, F., Si, C., et al. VBench: Comprehensive benchmark suite for video generative mod- els. InCVPR, 2024

  19. [19]

    Vbench++: Comprehensive and ver- satile benchmark suite for video generative models.arXiv preprint arXiv:2411.13503, 2024

    Huang, Z., Zhang, F., Xu, X., He, Y ., Yu, J., et al. VBench++: Comprehensive and versatile benchmark suite for video gen- erative models.arXiv preprint arXiv:2411.13503, 2024

  20. [20]

    VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    Zheng, D., Huang, Z., Liu, H., Zou, K., He, Y ., et al. VBench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025

  21. [21]

    WorldScore: A unified evaluation benchmark for world generation

    Stanford. WorldScore: A unified evaluation benchmark for world generation. InICCV, 2025

  22. [22]

    WorldModelBench: Judging video generation models as world models

    UC Berkeley. WorldModelBench: Judging video generation models as world models. InNeurIPS, 2025

  23. [23]

    Towards world simulator: Crafting physical commonsense-based benchmark for video generation

    SJTU, et al. Towards world simulator: Crafting physical commonsense-based benchmark for video generation. In ICML, 2025

  24. [24]

    WorldBench: Disambiguating physics for di- agnostic evaluation of world models.arXiv preprint arXiv:2601.21282, 2026

    UCLA. WorldBench: Disambiguating physics for di- agnostic evaluation of world models.arXiv preprint arXiv:2601.21282, 2026

  25. [25]

    Video PreTraining (VPT): Learning to act by watching unlabeled online videos

    Baker, B., et al. Video PreTraining (VPT): Learning to act by watching unlabeled online videos. InNeurIPS, 2022

  26. [26]

    MineWorld: A real-time and open-source interactive world model on Minecraft.arXiv preprint arXiv:2504.08388, 2025

    Microsoft. MineWorld: A real-time and open-source interactive world model on Minecraft.arXiv preprint arXiv:2504.08388, 2025

  27. [27]

    SANA-WM: Efficient minute-scale world model- ing with hybrid linear diffusion transformer.arXiv preprint, 2026

    NVIDIA. SANA-WM: Efficient minute-scale world model- ing with hybrid linear diffusion transformer.arXiv preprint, 2026

  28. [28]

    Lyra 2.0: Explorable generative 3D worlds.arXiv preprint, 2026

    NVIDIA. Lyra 2.0: Explorable generative 3D worlds.arXiv preprint, 2026

  29. [29]

    minWM: A full-stack open-source frame- work for real-time interactive video world models.arXiv preprint, 2026

    minWM Team. minWM: A full-stack open-source frame- work for real-time interactive video world models.arXiv preprint, 2026

  30. [30]

    LAION- 5B: An open large-scale dataset for training next generation image-text models

    Schuhmann, C., Beaumont, R., Vencu, R., et al. LAION- 5B: An open large-scale dataset for training next generation image-text models. InNeurIPS, 2022

  31. [31]

    MUSIQ: Multi-scale image quality transformer

    Ke, J., Wang, Q., Wang, Y ., Milanfar, P., and Yang, F. MUSIQ: Multi-scale image quality transformer. InICCV, 2021

  32. [32]

    Helios: A comprehensive benchmark for video generative models.arXiv preprint, 2025

    Helios Team. Helios: A comprehensive benchmark for video generative models.arXiv preprint, 2025

  33. [33]

    WorldCompass: Reinforcement learning for long-horizon world models.arXiv preprint, 2026

    WorldCompass Team. WorldCompass: Reinforcement learning for long-horizon world models.arXiv preprint, 2026

  34. [34]

    ViPE: Visual pose estimation for camera trajectory recovery

  35. [35]

    and Deng, J

    Teed, Z. and Deng, J. DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras. InNeurIPS, 2021

  36. [36]

    WBench: A comprehensive benchmark for evaluating world models via action-conditioned video gener- ation.arXiv preprint, 2026

    Ying, Z., et al. WBench: A comprehensive benchmark for evaluating world models via action-conditioned video gener- ation.arXiv preprint, 2026

  37. [37]

    and Medioni, G

    Chen, Y . and Medioni, G. Object modelling by registra- tion of multiple range images.Image and Vision Computing, 10(3):145–155, 1992

  38. [38]

    B., Blodow, N., and Beetz, M

    Rusu, R. B., Blodow, N., and Beetz, M. Fast Point Feature Histograms (FPFH) for 3D registration. InICRA, 2009

  39. [39]

    Point Transformer V3: Simpler, faster, stronger

    Wu, X., Jiang, L., Wang, P.-S., et al. Point Transformer V3: Simpler, faster, stronger. InCVPR, 2024

  40. [40]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N., Gabeur, V ., Hu, Y .-T., et al. SAM 2: Seg- ment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

  41. [41]

    Two-frame motion estimation based on poly- nomial expansion

    Farneb ¨ack, G. Two-frame motion estimation based on poly- nomial expansion. InScandinavian Conference on Image Analysis (SCIA), 2003

  42. [42]

    What limits virtual agent appli- cation? OmniBench: A scalable multi-dimensional bench- mark for essential virtual agent capabilities

    Bu, W., Wu, Y ., Yu, Q., et al. What limits virtual agent appli- cation? OmniBench: A scalable multi-dimensional bench- mark for essential virtual agent capabilities. InICML, 2025

  43. [43]

    Omni-WorldBench: Towards a comprehensive interaction-centric evaluation for world models.arXiv preprint arXiv:2603.22212, 2026

    Wu, M., Cai, Z., Zhao, F., et al. Omni-WorldBench: Towards a comprehensive interaction-centric evaluation for world models.arXiv preprint arXiv:2603.22212, 2026

  44. [44]

    Do vision-language models have internal world models? Towards an atomic evaluation

    Gao, Q., Pi, X., Liu, K., et al. Do vision-language models have internal world models? Towards an atomic evaluation. InACL, 2025

  45. [45]

    How far is video generation from world model: A physical law perspective

    Kang, B., Yue, Y ., Lu, R., et al. How far is video generation from world model: A physical law perspective. InICML, 2025

  46. [46]

    WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World

    Liang, A., Kong, L., Yan, T., et al. WorldLens: Full-spectrum evaluations of driving world models in real world.arXiv preprint arXiv:2512.10958, 2025

  47. [47]

    WorldArena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

    Shang, Y ., Li, Z., Ma, Y ., et al. WorldArena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

  48. [48]

    ACT-Bench: Towards action controllable world models for autonomous driving.arXiv preprint arXiv:2412.05337, 2024

    Arai, H., Ishihara, K., Takahashi, T., and Yamaguchi, Y . ACT-Bench: Towards action controllable world models for autonomous driving.arXiv preprint arXiv:2412.05337, 2024

  49. [49]

    EWMBench: Evaluat- ing scene, motion, and semantic quality in embodied world models

    Hu, Y ., Huang, S., Liao, Y ., et al. EWMBench: Evaluat- ing scene, motion, and semantic quality in embodied world models. InBMVC, 2025

  50. [50]

    WorldOlympiad: Can Your World Model Survive a Triathlon?

    Zhao, Y ., Zhao, W., Wang, W., Zhang, Z., An, D., Liu, A., Yu, Y ., Tang, J., Wang, F., Wang, W., and Zhuang, B. Worl- dOlympiad: Can Your World Model Survive a Triathlon? arXiv preprint arXiv:2606.11129, 2026

  51. [51]

    Person moves forward; Camera turns left

    Teed, Z. and Deng, J. RAFT: Recurrent All-Pairs Field Transforms for Optical Flow.arXiv preprint arXiv:2003.12039, 2020. 17 A. Test Suite Gallery To provide a qualitative overview of the visual and ac- tion coverage in WorldRoamBench, we include a gallery of representative test cases in Figure 11. Each panel shows the first-frame image together with an ac...

  52. [52]

    camera-pose chain

    Depth-percentile Filteringretain nearest p%Far Near Cross-Segment Registration Coarse:PTY3 / FPFH+ RANSACFine:point-to-planeICPQuality-awareacceptance:fitness ≥ τᵩand Chamferimprovest … Video + GTActions⋯WW(Explore) SS(Revisit)⋯ Observation Segment(t ≤ t*) Frame-LevelPointClouds t Revisit Segment(t > t*) t Frame-LevelPointClouds t*Memory Evaluation Pipeli...

  53. [53]

    Holistic scoring accommodates smooth viewpoint and illumination changes that can confound frame-level com- parisons, while producing a single benchmark-compatible scalar without requiring an external reference-feature li- brary. G. Subject Memory Prompt For third-person memory evaluation, the Subject Memory Evaluation Prompt in Figure 27 scores video-leve...

  54. [54]

    Is there a shadow visible on the wall or floor?

  55. [55]

    If yes, does the shadow move or change in a way that is physically consistent with the camera movement and the light source position? Answer ’yes’ if the shadow appears and behaves correctly, ’no’ if the shadow is missing or behaves incorrectly. After your reasoning, conclude with exactly one line: Answer: yes or Answer: no Figure 26.Occlusion and shadow ...

  56. [56]

    Identity change: the subject becomes a different individual or category

  57. [57]

    Structural distortion: the body, anatomy, proportions, limbs, head, face, or key parts become deformed or implausible

  58. [58]

    Appearance drift: color, texture, clothing, hair, material, or style is truly rewritten

  59. [59]

    Subject disappearance: the subject becomes partly or fully invisible

  60. [60]

    subject memory score

    Quality degradation: severe blur, low resolution, diffused edges, or loss of key details makes the subject hard to identify. Important calibration rules: - Viewpoint change alone is not inconsistency. Back-to-front, front-to-back, side-to-front, or far-to-close changes are acceptable if the subject can reasonably be the same individual under the new view....