Recognition: 2 theorem links
· Lean TheoremQuantitative Video World Model Evaluation for Geometric-Consistency
Pith reviewed 2026-05-15 03:06 UTC · model grok-4.3
The pith
PDI-Bench quantifies geometric coherence in generated videos by measuring projective residuals from 3D lifts of tracked points.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Given a generated video clip, object-centric observations are obtained via segmentation and point tracking, then lifted to 3D world-space coordinates via monocular reconstruction; a set of projective-geometry residuals is computed to quantify three failure dimensions: scale-depth alignment, 3D motion consistency, and 3D structural rigidity. Across state-of-the-art generators this index reveals consistent geometry-specific failure modes invisible to common perceptual metrics and supplies a diagnostic signal for progress toward physically grounded video generation.
What carries the argument
The Perspective Distortion Index (PDI), which aggregates projective-geometry residuals computed on 3D world coordinates lifted from segmented and tracked points to measure scale-depth alignment, motion consistency, and structural rigidity.
If this is right
- Video generators can be ranked and improved by targeting measurable failures in scale consistency, motion trajectories, and rigidity instead of relying solely on visual appeal.
- Training loops gain an objective gradient signal for enforcing projective constraints that current perceptual losses do not provide.
- Evaluation of implicit world models shifts from subjective human ratings to repeatable 3D residual measurements across controlled datasets.
- Models that reduce PDI scores on the benchmark are expected to produce outputs more suitable for downstream tasks requiring spatial reasoning.
Where Pith is reading between the lines
- If PDI scores improve over time while perceptual metrics plateau, the field may be making genuine progress on physical plausibility even when human raters notice little change.
- PDI could be extended to multi-view or stereo video inputs to cross-validate the monocular reconstruction step and reduce its influence on the final score.
- Combining PDI with existing 2D metrics might yield a composite benchmark that better predicts performance in robotics simulation or planning applications.
Load-bearing premise
Monocular 3D reconstruction from the generated video produces accurate enough world-space coordinates to reveal the generator's own geometric errors rather than injecting reconstruction artifacts.
What would settle it
Generate videos with deliberately perfect 3D geometry using known camera paths and rigid objects, run the full PDI pipeline including monocular lift, and verify whether the index scores remain near zero; persistently high scores on perfect inputs would falsify the claim that PDI isolates generator errors.
read the original abstract
Generative video models are increasingly studied as implicit world models, yet evaluating whether they produce physically plausible 3D structure and motion remains challenging. Most existing video evaluation pipelines rely heavily on human judgment or learned graders, which can be subjective and weakly diagnostic for geometric failures. We introduce PDI-Bench (Perspective Distortion Index), a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, we obtain object-centric observations via segmentation and point tracking (e.g., SAM 2, MegaSaM, and CoTracker3), lift them to 3D world-space coordinates via monocular reconstruction, and compute a set of projective-geometry residuals capturing three failure dimensions: scale-depth alignment, 3D motion consistency, and 3D structural rigidity. To support systematic evaluation, we build PDI-Dataset, covering diverse scenarios designed to stress these geometric constraints. Across state-of-the-art video generators, PDI reveals consistent geometry-specific failure modes that are not captured by common perceptual metrics, and provides a diagnostic signal for progress toward physically grounded video generation and physical world model. Our code and dataset can be found at https://pdi-bench.github.io/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PDI-Bench, a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, it applies segmentation and point tracking (SAM 2, MegaSaM, CoTracker3), lifts observations to 3D world-space coordinates via monocular reconstruction, and computes projective-geometry residuals across three dimensions: scale-depth alignment, 3D motion consistency, and 3D structural rigidity. It also releases PDI-Dataset covering diverse scenarios and reports that, across state-of-the-art video generators, PDI exposes geometry-specific failure modes not captured by common perceptual metrics, offering a diagnostic signal for physically grounded video generation.
Significance. If the residuals can be shown to be dominated by generator-induced geometric errors rather than upstream reconstruction artifacts, PDI-Bench would supply an objective, geometry-specific complement to existing perceptual and human-judgment metrics, directly supporting evaluation of video models as implicit world models.
major comments (2)
- [Abstract] Abstract and evaluation description: the central claim that PDI residuals diagnose generator failures requires evidence that monocular lifting (MegaSaM) produces sufficiently accurate 3D coordinates on generated video; no quantitative validation (e.g., reconstruction error on synthetic ground-truth video, ablation swapping the reconstructor, or correlation with known geometric perturbations) is supplied, leaving open the possibility that residuals are confounded by reconstruction priors on inconsistent lighting, texture, or motion patterns typical of generated content.
- [Methods] Methods / PDI-Dataset construction: the three residual definitions (scale-depth alignment, 3D motion consistency, 3D structural rigidity) are derived directly from projective geometry applied to tracked points; without an explicit isolation experiment or ground-truth comparison, it remains unclear whether the reported failure modes are load-bearing for the generator or artifacts of the monocular pipeline.
minor comments (1)
- [Abstract] The abstract states that code and dataset are available at https://pdi-bench.github.io/; the manuscript should include a brief reproducibility checklist (exact versions of SAM 2, MegaSaM, CoTracker3, and any post-processing steps) to allow independent verification of the residual computations.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback emphasizing the need to validate the monocular reconstruction step and isolate generator effects in PDI-Bench. We address each major comment below and will incorporate additional experiments and clarifications in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract and evaluation description: the central claim that PDI residuals diagnose generator failures requires evidence that monocular lifting (MegaSaM) produces sufficiently accurate 3D coordinates on generated video; no quantitative validation (e.g., reconstruction error on synthetic ground-truth video, ablation swapping the reconstructor, or correlation with known geometric perturbations) is supplied, leaving open the possibility that residuals are confounded by reconstruction priors on inconsistent lighting, texture, or motion patterns typical of generated content.
Authors: We agree that explicit validation of the monocular lifting on generated content is essential to support the central claim. While the manuscript employs established state-of-the-art methods (MegaSaM for reconstruction alongside SAM 2 and CoTracker3), we acknowledge the absence of dedicated quantitative checks in the current version. In the revision we will add a new validation subsection that (i) measures reconstruction error on synthetic ground-truth videos with known 3D geometry, (ii) performs an ablation by swapping the reconstructor, and (iii) correlates PDI residuals against controlled geometric perturbations injected into otherwise consistent clips. These experiments will demonstrate that the reported residuals are dominated by generator-induced inconsistencies rather than upstream reconstruction artifacts. revision: yes
-
Referee: [Methods] Methods / PDI-Dataset construction: the three residual definitions (scale-depth alignment, 3D motion consistency, 3D structural rigidity) are derived directly from projective geometry applied to tracked points; without an explicit isolation experiment or ground-truth comparison, it remains unclear whether the reported failure modes are load-bearing for the generator or artifacts of the monocular pipeline.
Authors: The three residual definitions follow directly from projective geometry and are therefore independent of any particular reconstruction implementation. Nevertheless, we recognize the value of explicit isolation. In the revised manuscript we will include ground-truth comparison experiments using rendered videos that provide perfect 3D structure and motion; PDI scores will be computed both on the original renders and on versions with controlled generator-like perturbations. We will also report results across multiple reconstructors and trackers to confirm that the observed failure modes persist and are attributable to the video generators rather than the analysis pipeline. revision: yes
Circularity Check
No significant circularity in PDI derivation
full rationale
The paper defines PDI-Bench by lifting tracked points from generated video via external monocular reconstruction (MegaSaM, SAM 2, CoTracker3) then computing direct projective-geometry residuals on scale-depth alignment, 3D motion consistency, and structural rigidity. These residuals follow from standard projective constraints applied to the lifted coordinates; no equations, parameters, or self-citations reduce the reported values to quantities fitted on the same evaluation videos. The central claim therefore rests on an independent geometric calculation rather than tautological re-expression of inputs. This is the expected non-circular outcome for a metric constructed from first-principles geometry.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Monocular depth and point tracking tools yield sufficiently accurate 3D world coordinates for the purpose of measuring geometric inconsistency
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
we lift them to 3D world-space coordinates via monocular reconstruction, and compute a set of projective-geometry residuals capturing three failure dimensions: scale-depth alignment, 3D motion consistency, and 3D structural rigidity
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ℎₜ ⋅ 𝑍ₜ = 𝑓 ⋅ 𝐻 = Constant
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
- [2]
- [3]
- [4]
-
[5]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V. Voleti, A. Letts, V. Jampani, and R. Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023. URLhttps://arxiv.org/abs/2311.15127
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Doubao: A family of large language models
ByteDance. Doubao: A family of large language models. https://www.volcengine.com/ product/doubao, 2026. Accessed: 2026-05-06
work page 2026
-
[7]
Seedance 2.0 fast: High-efficiency video generation foundation model.https://www
ByteDance. Seedance 2.0 fast: High-efficiency video generation foundation model.https://www. doubao.com/, 2026. Accessed: 2026-04-19
work page 2026
- [8]
-
[9]
H. Duan, H.-X. Yu, S. Chen, L. Fei-Fei, and J. Wu. Worldscore: A unified evaluation benchmark for world generation, 2025. URLhttps://arxiv.org/abs/2504.00983
-
[10]
Flow: Where the next wave of storytelling happens.https://labs.google/fx/tools/ flow, 2026
Google. Flow: Where the next wave of storytelling happens.https://labs.google/fx/tools/ flow, 2026. Accessed: 2026-03-04
work page 2026
-
[11]
J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models, 2020. URL https: //arxiv.org/abs/2006.11239
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[12]
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
X. Huang, Z. Li, G. He, M. Zhou, and E. Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion, 2025. URLhttps://arxiv.org/abs/2506.08009
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [13]
-
[14]
Cotracker3: Simpler and better point tracking by pseudo-labelling real videos
N. Karaev, I. Makarov, J. Wang, N. Neverova, A. Vedaldi, and C. Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos, 2024. URLhttps://arxiv.org/abs/ 2410.11831
-
[15]
W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, K. Wu, Q. Lin, J. Yuan, Y. Long, A. Wang, A. Wang, C. Li, D. Huang, F. Yang, H. Tan, H. Wang, J. Song, J. Bai, J. Wu, J. Xue, J. Wang, K. Wang, M. Liu, P. Li, S. Li, W. Wang, W. Yu, X. Deng, Y. Li, Y. Chen, Y. Cui, Y. Peng, 13 Z. Yu, Z. He, Z. Xu, Z. Zhou, Z. Xu, Y. Tao...
-
[16]
URLhttps://arxiv.org/abs/2412.03603
work page internal anchor Pith review Pith/arXiv arXiv
- [17]
- [18]
-
[19]
Y. Liu, K. Zhang, Y. Li, Z. Yan, C. Gao, R. Chen, Z. Yuan, Y. Huang, H. Sun, J. Gao, L. He, and L. Sun. Sora: A review on background, technology, limitations, and opportunities of large vision models,
-
[20]
URLhttps://arxiv.org/abs/2402.17177
work page internal anchor Pith review Pith/arXiv arXiv
- [21]
-
[22]
Sora: Creating video from text.https://openai.com/sora, 2025
OpenAI. Sora: Creating video from text.https://openai.com/sora, 2025. Accessed: 2026- 03-20
work page 2025
-
[23]
Learning Transferable Visual Models From Natural Language Supervision
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision, 2021. URLhttps://arxiv.org/abs/2103.00020
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[24]
N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V. Alwala, N. Carion, C.-Y. Wu, R. Girshick, P. Dollár, and C. Feichtenhofer. Sam 2: Segment anything in images and videos, 2024. URLhttps://arxiv.org/abs/2408.00714
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Improved Techniques for Training GANs
T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans, 2016. URLhttps://arxiv.org/abs/1606.03498
work page internal anchor Pith review Pith/arXiv arXiv 2016
- [26]
-
[27]
Towards Accurate Generative Models of Video: A New Metric & Challenges
T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. Towards accurate generative models of video: A new metric & challenges, 2019. URLhttps://arxiv. org/abs/1812.01717
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[28]
R. Upadhyay, H. Zhang, J. Solomon, A. Agrawal, P. Boreddy, S. S. Narayana, Y. Ba, A. Wong, C. M. de Melo, and A. Kadambi. Worldbench: Disambiguating physics for diagnostic evaluation of world models, 2026. URLhttps://arxiv.org/abs/2601.21282
-
[29]
T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng, J. Wang, J. Zhang, J. Zhou, J. Wang, J. Chen, K. Zhu, K. Zhao, K. Yan, L. Huang, M. Feng, N. Zhang, P. Li, P. Wu, R. Chu, R. Feng, S. Zhang, S. Sun, T. Fang, T. Wang, T. Gui, T. Weng, T. Shen, W. Lin, W. Wang, W. Wang, W. Zhou, W. Wang, W. Shen, W. Yu, X. Shi, X....
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Wan-Video. Wan2.2: Wan: Open and advanced large-scale video generative models.https: //github.com/Wan-Video/Wan2.2, 2025. GitHub repository
work page 2025
-
[31]
B. Xiao, H. Wu, W. Xu, X. Dai, H. Hu, Y. Lu, M. Zeng, C. Liu, and L. Yuan. Florence-2: Advancing a unified representation for a variety of vision tasks, 2023. URLhttps://arxiv.org/abs/2311. 06242
work page 2023
-
[32]
Cogvideox-3: Text-to-video diffusion models.https://chatglm.cn/video, 2026
Zhipu AI. Cogvideox-3: Text-to-video diffusion models.https://chatglm.cn/video, 2026. Accessed: 2026-4-18. 15 A. Additional Experimental Details A.1. PDI-Dataset Construction The PDI-Dataset consists of 183 video sequences in total, partitioned into real-world and synthetic subsets. Real-world sequences.The real-world portion of PDI-Dataset contains 15 sh...
work page 2026
-
[33]
All synthetic videos presented in our benchmark reflect the baseline commercial performance available to end-users at the time of evaluation. Note that the Sora samples in our dataset were generated using the $20 monthly consumer subscription rather than the enterprise API, representing the baseline commercial performance of the model. The 28 text prompts...
-
[34]
A handheld following shot of a red vintage car driving away on a straight desert highway, harsh noon light and heat haze on the horizon, subtle shake and lateral drift
-
[35]
A high-speed train moving toward the viewer on a straight track, low-angle handheld perspective, rails and gravel receding toward a clear vanishing point
-
[36]
A yellow school bus driving away on a straight tree-lined suburban street, the shot tracking from a low position behind, morning light and clean asphalt
-
[37]
A silver metallic sphere rolling away on a long reflective marble floor in a bright gallery, the shot following closely with slight sway
-
[38]
A heavy cargo truck moving away on a straight bridge at night, tail lights glowing, subtle frame shake, city lights in the distance
-
[39]
A large shipping container being pushed away on a straight industrial dock, cranes and water behind, moving viewpoint, overcast industrial light. Dynamic Tracking
-
[40]
A handheld following shot of a red sports car driving on a straight multi-lane highway, city skyline and roadside trees in the background receding rapidly with parallax
-
[41]
A smooth following shot of an autonomous suitcase moving through a vast airport terminal, repeated columns and floor patterns rushing past in frame
-
[42]
A close handheld shot following a large chrome sphere rolling along a straight, reflective museum corridor, exhibits and windows flowing past
-
[43]
A following shot from a vehicle alongside, keeping pace with a large truck carrying a blue container on a long bridge, waves and bridge cables creating dynamic background motion
-
[44]
A smooth following shot of a metal logistics crate moving along a straight automated conveyor, complex factory machinery in the background rushing past
-
[45]
A handheld following shot of a large metal ball rolling through a straight modern art gallery, surrounding artworks and viewers receding rapidly with parallax. Biological Motion Continued on next page 17 Table 4 –Continued from previous page Category Text Prompt
-
[46]
A smooth following shot of a large eagle flying at high speed parallel to a cliff, rock face and sea below, clear sky
-
[47]
A following shot from a moving boat of a dolphin swimming and leaping in the waves alongside, spray and sunlight
-
[48]
A handheld shot of a large octopus swimming away in a complex coral reef, tentacles waving, colorful fish and coral, blue water and light shafts
-
[49]
A backward-moving shot following a snake slithering through dense colorful flowers on the ground, petals and stems, soft daylight
-
[50]
A moving shot following a peacock walking and shaking its tail feathers in a palace garden, fountains and trimmed hedges, ornate tiles. Curved Motion
-
[51]
A handheld tracking perspective follows a silver compact SUV navigating a sharp hairpin turn on a winding mountain road. The view orbits slightly to capture the vehicle transitioning from a front-view to a side-view against the pine forest background
-
[52]
A low-angle shot follows a sports car drifting through a 90-degree corner on a professional race track. The car rotates intensely while the moving shot emphasizes the shifting vanishing lines of the curb and tire marks
-
[53]
A cinematic tracking shot follows a city bus driving through a large, ornate stone round- about. The view maintains a side perspective, showing the bus constantly changing its orientation relative to the central fountain and surrounding city traffic
-
[54]
A ground-level perspective tracking a small delivery robot as it makes a sharp turn at a sidewalk corner. The shot stays close, highlighting the rotation of the robot’s boxy frame against the detailed brickwork
-
[55]
A handheld shot follows a green tractor making a wide turn at the edge of a plowed field. The view moves with the vehicle, capturing the shifting angles of the heavy wheels and mechanical parts against the vast landscape. Partial Occlusion
-
[56]
A car driving along a street at night, wheels briefly obscured by a low roadside guardrail for under a second, handheld shot moving alongside, street lamps and storefronts
-
[57]
A train passing behind a row of thin vertical power line poles, the shot tracking its movement from a moving platform, sky and industrial landscape
-
[58]
A bus moving through a city street, briefly partially hidden by a thin traffic sign, the shot following from the sidewalk
-
[59]
A vintage car driving past a row of thin trees, never fully leaving the moving view, autumn leaves and road
-
[60]
Continued on next page 18 Table 4 –Continued from previous page Category Text Prompt
A boat sailing behind a thin pier support, remaining partially visible throughout, handheld shot from the dock, sea and sky. Continued on next page 18 Table 4 –Continued from previous page Category Text Prompt
-
[61]
A robot crate moving through a warehouse, passing behind a thin metal rack, the shot following alongside, shelves and boxes, industrial lighting. Reconstruction-aware weighting.The final PDI score is synthesized as a weighted sum of three orthogonal physical residuals: PDI Score=𝑤 1 ⋅RMSE(𝜖 𝑠𝑐𝑎𝑙𝑒 )+𝑤 2 ⋅RMSE(𝜖𝑡𝑟𝑎 𝑗)+𝑤 3 ⋅𝜖 𝑟𝑖𝑔𝑖𝑑𝑖𝑡 𝑦 ,(11) where ∑𝑖 𝑤𝑖 = 1....
-
[62]
3D Pairwise Rigidity (Primary).We sample world-space pointsq𝑛 𝑡 from Mega-SAM pointmaps at CoTracker locations. Anchor pairs are selected at𝑡= 0by triple filtering: (i) visibility filtering, (ii) depth-gradient reliability filtering, and (iii) pair scoring that favors both large 3D separation and interior-region reliability (distance to mask boundary). Fo...
-
[63]
3D Height Stability (Fallback when Strategy 1 is not entered).If 3D points are valid but strategy 1 is unavailable at the dispatcher level, we compute per-frame 3D object height from foreground𝑦-span: ℎ3𝐷 𝑡 =𝑃 95(𝑦𝑡)−𝑃 5(𝑦𝑡), and use coefficient of variation: 𝜖 (2) rigid = std({ℎ3𝐷 𝑡 }𝑇 𝑡=1) mean({ℎ3𝐷 𝑡 }𝑇 𝑡=1)+𝜖 . 22
-
[64]
speed” of receding (trajectory) and its “rate
2D Pairwise Consistency (Degraded fallback).When 3D evidence is unavailable, we use 2D CoTracker pairwise distance ratios: 𝑟2𝐷 𝑖 𝑗 (𝑡)= 𝑑𝑖 𝑗(𝑡) 𝑑𝑖 𝑗(0), 𝜌 2𝐷 𝑡 = std({𝑟2𝐷 𝑖 𝑗 (𝑡)}) mean({𝑟2𝐷 𝑖 𝑗 (𝑡)})+𝜖 , and compute 𝜖 (3) rigid = 1 𝑇 𝑇 ∑ 𝑡=1 𝜌2𝐷 𝑡 . Finally, the rigidity component used by PDI is 𝜖rigid = ⎧⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎩ 𝜖 (1) rigid,if Strategy 1 is sel...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.