arxiv: 2605.15185 · v1 · submitted 2026-05-14 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Quantitative Video World Model Evaluation for Geometric-Consistency

Jiaxin Wu , Yihao Pi , Yinling Zhang , Yuheng Li , Xueyan Zou

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video generationgeometric consistencyworld models3D reconstructionevaluation metricsperspective distortionmotion consistencystructural rigidity

0 comments

The pith

PDI-Bench quantifies geometric coherence in generated videos by measuring projective residuals from 3D lifts of tracked points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PDI-Bench as a quantitative framework to audit whether generative video models produce consistent 3D structure and motion. It segments objects, tracks points across frames, lifts them to world-space coordinates via monocular reconstruction, and computes residuals that capture scale-depth alignment, motion consistency, and structural rigidity. These signals expose specific geometric failures across current models that perceptual quality metrics overlook. A sympathetic reader would care because the method supplies an objective diagnostic for treating video generators as physical world models rather than just image synthesizers. The accompanying PDI-Dataset stresses these constraints across varied scenarios to enable systematic comparison.

Core claim

Given a generated video clip, object-centric observations are obtained via segmentation and point tracking, then lifted to 3D world-space coordinates via monocular reconstruction; a set of projective-geometry residuals is computed to quantify three failure dimensions: scale-depth alignment, 3D motion consistency, and 3D structural rigidity. Across state-of-the-art generators this index reveals consistent geometry-specific failure modes invisible to common perceptual metrics and supplies a diagnostic signal for progress toward physically grounded video generation.

What carries the argument

The Perspective Distortion Index (PDI), which aggregates projective-geometry residuals computed on 3D world coordinates lifted from segmented and tracked points to measure scale-depth alignment, motion consistency, and structural rigidity.

If this is right

Video generators can be ranked and improved by targeting measurable failures in scale consistency, motion trajectories, and rigidity instead of relying solely on visual appeal.
Training loops gain an objective gradient signal for enforcing projective constraints that current perceptual losses do not provide.
Evaluation of implicit world models shifts from subjective human ratings to repeatable 3D residual measurements across controlled datasets.
Models that reduce PDI scores on the benchmark are expected to produce outputs more suitable for downstream tasks requiring spatial reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If PDI scores improve over time while perceptual metrics plateau, the field may be making genuine progress on physical plausibility even when human raters notice little change.
PDI could be extended to multi-view or stereo video inputs to cross-validate the monocular reconstruction step and reduce its influence on the final score.
Combining PDI with existing 2D metrics might yield a composite benchmark that better predicts performance in robotics simulation or planning applications.

Load-bearing premise

Monocular 3D reconstruction from the generated video produces accurate enough world-space coordinates to reveal the generator's own geometric errors rather than injecting reconstruction artifacts.

What would settle it

Generate videos with deliberately perfect 3D geometry using known camera paths and rigid objects, run the full PDI pipeline including monocular lift, and verify whether the index scores remain near zero; persistently high scores on perfect inputs would falsify the claim that PDI isolates generator errors.

read the original abstract

Generative video models are increasingly studied as implicit world models, yet evaluating whether they produce physically plausible 3D structure and motion remains challenging. Most existing video evaluation pipelines rely heavily on human judgment or learned graders, which can be subjective and weakly diagnostic for geometric failures. We introduce PDI-Bench (Perspective Distortion Index), a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, we obtain object-centric observations via segmentation and point tracking (e.g., SAM 2, MegaSaM, and CoTracker3), lift them to 3D world-space coordinates via monocular reconstruction, and compute a set of projective-geometry residuals capturing three failure dimensions: scale-depth alignment, 3D motion consistency, and 3D structural rigidity. To support systematic evaluation, we build PDI-Dataset, covering diverse scenarios designed to stress these geometric constraints. Across state-of-the-art video generators, PDI reveals consistent geometry-specific failure modes that are not captured by common perceptual metrics, and provides a diagnostic signal for progress toward physically grounded video generation and physical world model. Our code and dataset can be found at https://pdi-bench.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PDI-Bench adds a concrete geometric diagnostic for video generators via 3D projective residuals, but its signal may be entangled with errors from the monocular reconstructor.

read the letter

The main point is that this paper defines PDI-Bench as a set of residuals computed after lifting tracked points from generated videos into 3D using monocular methods like MegaSaM. It targets three specific issues—scale-depth alignment, motion consistency, and structural rigidity—and pairs the metric with a new dataset of stressing scenarios. They show that current generators produce consistent failures on these checks that standard perceptual scores miss, and they release the code and data.

Referee Report

2 major / 1 minor

Summary. The paper introduces PDI-Bench, a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, it applies segmentation and point tracking (SAM 2, MegaSaM, CoTracker3), lifts observations to 3D world-space coordinates via monocular reconstruction, and computes projective-geometry residuals across three dimensions: scale-depth alignment, 3D motion consistency, and 3D structural rigidity. It also releases PDI-Dataset covering diverse scenarios and reports that, across state-of-the-art video generators, PDI exposes geometry-specific failure modes not captured by common perceptual metrics, offering a diagnostic signal for physically grounded video generation.

Significance. If the residuals can be shown to be dominated by generator-induced geometric errors rather than upstream reconstruction artifacts, PDI-Bench would supply an objective, geometry-specific complement to existing perceptual and human-judgment metrics, directly supporting evaluation of video models as implicit world models.

major comments (2)

[Abstract] Abstract and evaluation description: the central claim that PDI residuals diagnose generator failures requires evidence that monocular lifting (MegaSaM) produces sufficiently accurate 3D coordinates on generated video; no quantitative validation (e.g., reconstruction error on synthetic ground-truth video, ablation swapping the reconstructor, or correlation with known geometric perturbations) is supplied, leaving open the possibility that residuals are confounded by reconstruction priors on inconsistent lighting, texture, or motion patterns typical of generated content.
[Methods] Methods / PDI-Dataset construction: the three residual definitions (scale-depth alignment, 3D motion consistency, 3D structural rigidity) are derived directly from projective geometry applied to tracked points; without an explicit isolation experiment or ground-truth comparison, it remains unclear whether the reported failure modes are load-bearing for the generator or artifacts of the monocular pipeline.

minor comments (1)

[Abstract] The abstract states that code and dataset are available at https://pdi-bench.github.io/; the manuscript should include a brief reproducibility checklist (exact versions of SAM 2, MegaSaM, CoTracker3, and any post-processing steps) to allow independent verification of the residual computations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the need to validate the monocular reconstruction step and isolate generator effects in PDI-Bench. We address each major comment below and will incorporate additional experiments and clarifications in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and evaluation description: the central claim that PDI residuals diagnose generator failures requires evidence that monocular lifting (MegaSaM) produces sufficiently accurate 3D coordinates on generated video; no quantitative validation (e.g., reconstruction error on synthetic ground-truth video, ablation swapping the reconstructor, or correlation with known geometric perturbations) is supplied, leaving open the possibility that residuals are confounded by reconstruction priors on inconsistent lighting, texture, or motion patterns typical of generated content.

Authors: We agree that explicit validation of the monocular lifting on generated content is essential to support the central claim. While the manuscript employs established state-of-the-art methods (MegaSaM for reconstruction alongside SAM 2 and CoTracker3), we acknowledge the absence of dedicated quantitative checks in the current version. In the revision we will add a new validation subsection that (i) measures reconstruction error on synthetic ground-truth videos with known 3D geometry, (ii) performs an ablation by swapping the reconstructor, and (iii) correlates PDI residuals against controlled geometric perturbations injected into otherwise consistent clips. These experiments will demonstrate that the reported residuals are dominated by generator-induced inconsistencies rather than upstream reconstruction artifacts. revision: yes
Referee: [Methods] Methods / PDI-Dataset construction: the three residual definitions (scale-depth alignment, 3D motion consistency, 3D structural rigidity) are derived directly from projective geometry applied to tracked points; without an explicit isolation experiment or ground-truth comparison, it remains unclear whether the reported failure modes are load-bearing for the generator or artifacts of the monocular pipeline.

Authors: The three residual definitions follow directly from projective geometry and are therefore independent of any particular reconstruction implementation. Nevertheless, we recognize the value of explicit isolation. In the revised manuscript we will include ground-truth comparison experiments using rendered videos that provide perfect 3D structure and motion; PDI scores will be computed both on the original renders and on versions with controlled generator-like perturbations. We will also report results across multiple reconstructors and trackers to confirm that the observed failure modes persist and are attributable to the video generators rather than the analysis pipeline. revision: yes

Circularity Check

0 steps flagged

No significant circularity in PDI derivation

full rationale

The paper defines PDI-Bench by lifting tracked points from generated video via external monocular reconstruction (MegaSaM, SAM 2, CoTracker3) then computing direct projective-geometry residuals on scale-depth alignment, 3D motion consistency, and structural rigidity. These residuals follow from standard projective constraints applied to the lifted coordinates; no equations, parameters, or self-citations reduce the reported values to quantities fitted on the same evaluation videos. The central claim therefore rests on an independent geometric calculation rather than tautological re-expression of inputs. This is the expected non-circular outcome for a metric constructed from first-principles geometry.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the framework assumes that off-the-shelf segmentation (SAM 2), tracking (CoTracker3), and monocular reconstruction (MegaSaM) produce 3D coordinates accurate enough to expose generator failures. No free parameters or invented entities are mentioned.

axioms (1)

domain assumption Monocular depth and point tracking tools yield sufficiently accurate 3D world coordinates for the purpose of measuring geometric inconsistency
Invoked when lifting 2D observations to 3D and computing projective residuals

pith-pipeline@v0.9.0 · 5509 in / 1291 out tokens · 46999 ms · 2026-05-15T03:06:28.592571+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

we lift them to 3D world-space coordinates via monocular reconstruction, and compute a set of projective-geometry residuals capturing three failure dimensions: scale-depth alignment, 3D motion consistency, and 3D structural rigidity
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ℎₜ ⋅ 𝑍ₜ = 𝑓 ⋅ 𝐻 = Constant

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 10 internal anchors

[1]

Allen, C

K. Allen, C. Doersch, G. Zhou, M. Suhail, D. Driess, I. Rocco, Y. Rubanova, T. Kipf, M. S. M. Sajjadi, K. Murphy, J. Carreira, and S. van Steenkiste. Direct motion models for assessing generated videos,

work page
[2]

URLhttps://arxiv.org/abs/2505.00209

work page arXiv
[3]

M. Asim, C. Wewer, T. Wimmer, B. Schiele, and J. E. Lenssen. Met3r: Measuring multi-view consistency in generated images, 2026. URLhttps://arxiv.org/abs/2501.06336

work page arXiv 2026
[4]

Bansal, Z

H. Bansal, Z. Lin, T. Xie, Z. Zong, M. Yarom, Y. Bitton, C. Jiang, Y. Sun, K.-W. Chang, and A. Grover. Videophy: Evaluating physical commonsense for video generation, 2024. URLhttps://arxiv. org/abs/2406.03520

work page arXiv 2024
[5]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V. Voleti, A. Letts, V. Jampani, and R. Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023. URLhttps://arxiv.org/abs/2311.15127

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Doubao: A family of large language models

ByteDance. Doubao: A family of large language models. https://www.volcengine.com/ product/doubao, 2026. Accessed: 2026-05-06

work page 2026
[7]

Seedance 2.0 fast: High-efficiency video generation foundation model.https://www

ByteDance. Seedance 2.0 fast: High-efficiency video generation foundation model.https://www. doubao.com/, 2026. Accessed: 2026-04-19

work page 2026
[8]

W. Chow, J. Mao, B. Li, D. Seita, V. Guizilini, and Y. Wang. Physbench: Benchmarking and enhancing vision-language models for physical world understanding, 2025. URLhttps://arxiv.org/abs/ 2501.16411

work page arXiv 2025
[9]

Duan, H.-X

H. Duan, H.-X. Yu, S. Chen, L. Fei-Fei, and J. Wu. Worldscore: A unified evaluation benchmark for world generation, 2025. URLhttps://arxiv.org/abs/2504.00983

work page arXiv 2025
[10]

Flow: Where the next wave of storytelling happens.https://labs.google/fx/tools/ flow, 2026

Google. Flow: Where the next wave of storytelling happens.https://labs.google/fx/tools/ flow, 2026. Accessed: 2026-03-04

work page 2026
[11]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models, 2020. URL https: //arxiv.org/abs/2006.11239

work page internal anchor Pith review Pith/arXiv arXiv 2020
[12]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

X. Huang, Z. Li, G. He, M. Zhou, and E. Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion, 2025. URLhttps://arxiv.org/abs/2506.08009

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Huang, Y

Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisit, Y. Wang, X. Chen, L. Wang, D. Lin, Y. Qiao, and Z. Liu. Vbench: Comprehensive benchmark suite for video generative models, 2023. URLhttps://arxiv.org/abs/2311.17982

work page arXiv 2023
[14]

Cotracker3: Simpler and better point tracking by pseudo-labelling real videos

N. Karaev, I. Makarov, J. Wang, N. Neverova, A. Vedaldi, and C. Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos, 2024. URLhttps://arxiv.org/abs/ 2410.11831

work page arXiv 2024
[15]

W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, K. Wu, Q. Lin, J. Yuan, Y. Long, A. Wang, A. Wang, C. Li, D. Huang, F. Yang, H. Tan, H. Wang, J. Song, J. Bai, J. Wu, J. Xue, J. Wang, K. Wang, M. Liu, P. Li, S. Li, W. Wang, W. Yu, X. Deng, Y. Li, Y. Chen, Y. Cui, Y. Peng, 13 Z. Yu, Z. He, Z. Xu, Z. Zhou, Z. Xu, Y. Tao...

work page
[16]

URLhttps://arxiv.org/abs/2412.03603

work page internal anchor Pith review Pith/arXiv arXiv
[17]

D. Li, Y. Fang, Y. Chen, S. Yang, S. Cao, J. Wong, M. Luo, X. Wang, H. Yin, J. E. Gonzalez, I. Stoica, S. Han, and Y. Lu. Worldmodelbench: Judging video generation models as world models, 2025. URLhttps://arxiv.org/abs/2502.20694

work page arXiv 2025
[18]

Z. Li, R. Tucker, F. Cole, Q. Wang, L. Jin, V. Ye, A. Kanazawa, A. Holynski, and N. Snavely. Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos, 2024. URLhttps: //arxiv.org/abs/2412.04463

work page arXiv 2024
[19]

Y. Liu, K. Zhang, Y. Li, Z. Yan, C. Gao, R. Chen, Z. Yuan, Y. Huang, H. Sun, J. Gao, L. He, and L. Sun. Sora: A review on background, technology, limitations, and opportunities of large vision models,

work page
[20]

URLhttps://arxiv.org/abs/2402.17177

work page internal anchor Pith review Pith/arXiv arXiv
[21]

F. Meng, J. Liao, X. Tan, W. Shao, Q. Lu, K. Zhang, Y. Cheng, D. Li, Y. Qiao, and P. Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation, 2024. URLhttps://arxiv.org/abs/2410.05363

work page arXiv 2024
[22]

Sora: Creating video from text.https://openai.com/sora, 2025

OpenAI. Sora: Creating video from text.https://openai.com/sora, 2025. Accessed: 2026- 03-20

work page 2025
[23]

Learning Transferable Visual Models From Natural Language Supervision

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision, 2021. URLhttps://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V. Alwala, N. Carion, C.-Y. Wu, R. Girshick, P. Dollár, and C. Feichtenhofer. Sam 2: Segment anything in images and videos, 2024. URLhttps://arxiv.org/abs/2408.00714

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Improved Techniques for Training GANs

T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans, 2016. URLhttps://arxiv.org/abs/1606.03498

work page internal anchor Pith review Pith/arXiv arXiv 2016
[26]

K. Sun, K. Huang, X. Liu, Y. Wu, Z. Xu, Z. Li, and X. Liu. T2v-compbench: A comprehensive benchmark for compositional text-to-video generation, 2025. URLhttps://arxiv.org/abs/ 2407.14505

work page arXiv 2025
[27]

Towards Accurate Generative Models of Video: A New Metric & Challenges

T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. Towards accurate generative models of video: A new metric & challenges, 2019. URLhttps://arxiv. org/abs/1812.01717

work page internal anchor Pith review Pith/arXiv arXiv 2019
[28]

Upadhyay, H

R. Upadhyay, H. Zhang, J. Solomon, A. Agrawal, P. Boreddy, S. S. Narayana, Y. Ba, A. Wong, C. M. de Melo, and A. Kadambi. Worldbench: Disambiguating physics for diagnostic evaluation of world models, 2026. URLhttps://arxiv.org/abs/2601.21282

work page arXiv 2026
[29]

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng, J. Wang, J. Zhang, J. Zhou, J. Wang, J. Chen, K. Zhu, K. Zhao, K. Yan, L. Huang, M. Feng, N. Zhang, P. Li, P. Wu, R. Chu, R. Feng, S. Zhang, S. Sun, T. Fang, T. Wang, T. Gui, T. Weng, T. Shen, W. Lin, W. Wang, W. Wang, W. Zhou, W. Wang, W. Shen, W. Yu, X. Shi, X....

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Wan2.2: Wan: Open and advanced large-scale video generative models.https: //github.com/Wan-Video/Wan2.2, 2025

Wan-Video. Wan2.2: Wan: Open and advanced large-scale video generative models.https: //github.com/Wan-Video/Wan2.2, 2025. GitHub repository

work page 2025
[31]

B. Xiao, H. Wu, W. Xu, X. Dai, H. Hu, Y. Lu, M. Zeng, C. Liu, and L. Yuan. Florence-2: Advancing a unified representation for a variety of vision tasks, 2023. URLhttps://arxiv.org/abs/2311. 06242

work page 2023
[32]

Cogvideox-3: Text-to-video diffusion models.https://chatglm.cn/video, 2026

Zhipu AI. Cogvideox-3: Text-to-video diffusion models.https://chatglm.cn/video, 2026. Accessed: 2026-4-18. 15 A. Additional Experimental Details A.1. PDI-Dataset Construction The PDI-Dataset consists of 183 video sequences in total, partitioned into real-world and synthetic subsets. Real-world sequences.The real-world portion of PDI-Dataset contains 15 sh...

work page 2026
[33]

All synthetic videos presented in our benchmark reflect the baseline commercial performance available to end-users at the time of evaluation. Note that the Sora samples in our dataset were generated using the $20 monthly consumer subscription rather than the enterprise API, representing the baseline commercial performance of the model. The 28 text prompts...

work page
[34]

A handheld following shot of a red vintage car driving away on a straight desert highway, harsh noon light and heat haze on the horizon, subtle shake and lateral drift

work page
[35]

A high-speed train moving toward the viewer on a straight track, low-angle handheld perspective, rails and gravel receding toward a clear vanishing point

work page
[36]

A yellow school bus driving away on a straight tree-lined suburban street, the shot tracking from a low position behind, morning light and clean asphalt

work page
[37]

A silver metallic sphere rolling away on a long reflective marble floor in a bright gallery, the shot following closely with slight sway

work page
[38]

A heavy cargo truck moving away on a straight bridge at night, tail lights glowing, subtle frame shake, city lights in the distance

work page
[39]

Dynamic Tracking

A large shipping container being pushed away on a straight industrial dock, cranes and water behind, moving viewpoint, overcast industrial light. Dynamic Tracking

work page
[40]

A handheld following shot of a red sports car driving on a straight multi-lane highway, city skyline and roadside trees in the background receding rapidly with parallax

work page
[41]

A smooth following shot of an autonomous suitcase moving through a vast airport terminal, repeated columns and floor patterns rushing past in frame

work page
[42]

A close handheld shot following a large chrome sphere rolling along a straight, reflective museum corridor, exhibits and windows flowing past

work page
[43]

A following shot from a vehicle alongside, keeping pace with a large truck carrying a blue container on a long bridge, waves and bridge cables creating dynamic background motion

work page
[44]

A smooth following shot of a metal logistics crate moving along a straight automated conveyor, complex factory machinery in the background rushing past

work page
[45]

Biological Motion Continued on next page 17 Table 4 –Continued from previous page Category Text Prompt

A handheld following shot of a large metal ball rolling through a straight modern art gallery, surrounding artworks and viewers receding rapidly with parallax. Biological Motion Continued on next page 17 Table 4 –Continued from previous page Category Text Prompt

work page
[46]

A smooth following shot of a large eagle flying at high speed parallel to a cliff, rock face and sea below, clear sky

work page
[47]

A following shot from a moving boat of a dolphin swimming and leaping in the waves alongside, spray and sunlight

work page
[48]

A handheld shot of a large octopus swimming away in a complex coral reef, tentacles waving, colorful fish and coral, blue water and light shafts

work page
[49]

A backward-moving shot following a snake slithering through dense colorful flowers on the ground, petals and stems, soft daylight

work page
[50]

Curved Motion

A moving shot following a peacock walking and shaking its tail feathers in a palace garden, fountains and trimmed hedges, ornate tiles. Curved Motion

work page
[51]

The view orbits slightly to capture the vehicle transitioning from a front-view to a side-view against the pine forest background

A handheld tracking perspective follows a silver compact SUV navigating a sharp hairpin turn on a winding mountain road. The view orbits slightly to capture the vehicle transitioning from a front-view to a side-view against the pine forest background

work page
[52]

The car rotates intensely while the moving shot emphasizes the shifting vanishing lines of the curb and tire marks

A low-angle shot follows a sports car drifting through a 90-degree corner on a professional race track. The car rotates intensely while the moving shot emphasizes the shifting vanishing lines of the curb and tire marks

work page
[53]

The view maintains a side perspective, showing the bus constantly changing its orientation relative to the central fountain and surrounding city traffic

A cinematic tracking shot follows a city bus driving through a large, ornate stone round- about. The view maintains a side perspective, showing the bus constantly changing its orientation relative to the central fountain and surrounding city traffic

work page
[54]

The shot stays close, highlighting the rotation of the robot’s boxy frame against the detailed brickwork

A ground-level perspective tracking a small delivery robot as it makes a sharp turn at a sidewalk corner. The shot stays close, highlighting the rotation of the robot’s boxy frame against the detailed brickwork

work page
[55]

The view moves with the vehicle, capturing the shifting angles of the heavy wheels and mechanical parts against the vast landscape

A handheld shot follows a green tractor making a wide turn at the edge of a plowed field. The view moves with the vehicle, capturing the shifting angles of the heavy wheels and mechanical parts against the vast landscape. Partial Occlusion

work page
[56]

A car driving along a street at night, wheels briefly obscured by a low roadside guardrail for under a second, handheld shot moving alongside, street lamps and storefronts

work page
[57]

A train passing behind a row of thin vertical power line poles, the shot tracking its movement from a moving platform, sky and industrial landscape

work page
[58]

A bus moving through a city street, briefly partially hidden by a thin traffic sign, the shot following from the sidewalk

work page
[59]

A vintage car driving past a row of thin trees, never fully leaving the moving view, autumn leaves and road

work page
[60]

Continued on next page 18 Table 4 –Continued from previous page Category Text Prompt

A boat sailing behind a thin pier support, remaining partially visible throughout, handheld shot from the dock, sea and sky. Continued on next page 18 Table 4 –Continued from previous page Category Text Prompt

work page
[61]

physics-perfect

A robot crate moving through a warehouse, passing behind a thin metal rack, the shot following alongside, shelves and boxes, industrial lighting. Reconstruction-aware weighting.The final PDI score is synthesized as a weighted sum of three orthogonal physical residuals: PDI Score=𝑤 1 ⋅RMSE(𝜖 𝑠𝑐𝑎𝑙𝑒 )+𝑤 2 ⋅RMSE(𝜖𝑡𝑟𝑎 𝑗)+𝑤 3 ⋅𝜖 𝑟𝑖𝑔𝑖𝑑𝑖𝑡 𝑦 ,(11) where ∑𝑖 𝑤𝑖 = 1....

work page
[62]

3D Pairwise Rigidity (Primary).We sample world-space pointsq𝑛 𝑡 from Mega-SAM pointmaps at CoTracker locations. Anchor pairs are selected at𝑡= 0by triple filtering: (i) visibility filtering, (ii) depth-gradient reliability filtering, and (iii) pair scoring that favors both large 3D separation and interior-region reliability (distance to mask boundary). Fo...

work page
[63]

3D Height Stability (Fallback when Strategy 1 is not entered).If 3D points are valid but strategy 1 is unavailable at the dispatcher level, we compute per-frame 3D object height from foreground𝑦-span: ℎ3𝐷 𝑡 =𝑃 95(𝑦𝑡)−𝑃 5(𝑦𝑡), and use coefficient of variation: 𝜖 (2) rigid = std({ℎ3𝐷 𝑡 }𝑇 𝑡=1) mean({ℎ3𝐷 𝑡 }𝑇 𝑡=1)+𝜖 . 22

work page
[64]

speed” of receding (trajectory) and its “rate

2D Pairwise Consistency (Degraded fallback).When 3D evidence is unavailable, we use 2D CoTracker pairwise distance ratios: 𝑟2𝐷 𝑖 𝑗 (𝑡)= 𝑑𝑖 𝑗(𝑡) 𝑑𝑖 𝑗(0), 𝜌 2𝐷 𝑡 = std({𝑟2𝐷 𝑖 𝑗 (𝑡)}) mean({𝑟2𝐷 𝑖 𝑗 (𝑡)})+𝜖 , and compute 𝜖 (3) rigid = 1 𝑇 𝑇 ∑ 𝑡=1 𝜌2𝐷 𝑡 . Finally, the rigidity component used by PDI is 𝜖rigid = ⎧⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎩ 𝜖 (1) rigid,if Strategy 1 is sel...

work page