GeCo: Evaluating Geometric Consistency for Video Generation via Motion and Structure

Charles Herrmann; Deqing Sun; Fangneng Zhan; Hanspeter Pfister; Junhwa Hur; Leslie Gu; Todd Zickler

arxiv: 2512.22274 · v2 · submitted 2025-12-25 · 💻 cs.CV

GeCo: Evaluating Geometric Consistency for Video Generation via Motion and Structure

Leslie Gu , Junhwa Hur , Charles Herrmann , Fangneng Zhan , Todd Zickler , Deqing Sun , Hanspeter Pfister This is my paper

Pith reviewed 2026-05-16 20:05 UTC · model grok-4.3

classification 💻 cs.CV

keywords geometric consistencyvideo generationmotion residualsdepth priorsocclusion artifactsdeformation detectionAI evaluation metric

0 comments

The pith

GeCo detects geometric deformation and occlusion inconsistencies in generated videos of static scenes by fusing residual motion and depth priors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GeCo as a new metric for evaluating geometric consistency in video generation. It focuses on static scenes and jointly identifies issues like object deformation and incorrect occlusions. By integrating residual motion cues with depth estimates, it creates detailed consistency maps that make these problems visible. Researchers apply GeCo to assess current video models and find recurring problems. They also show it can guide the generation process itself to produce fewer artifacts without needing retraining.

Core claim

GeCo is a geometry-grounded metric for jointly detecting geometric deformation and occlusion-inconsistency artifacts in static scenes. By fusing residual motion and depth priors, GeCo produces interpretable, dense consistency maps that reveal these artifacts. It is used to benchmark recent video generation models and as a training-free guidance loss to reduce deformation artifacts during video generation.

What carries the argument

The GeCo metric, which combines residual motion and depth priors to generate dense, interpretable consistency maps for identifying geometric and occlusion issues.

If this is right

Systematic benchmarking of video generation models reveals common geometric failure modes.
GeCo can serve as a training-free loss to guide generation and reduce deformation artifacts.
Dense consistency maps provide interpretable visualizations of where inconsistencies occur in generated videos.
Joint detection of deformation and occlusion errors allows for more comprehensive evaluation than separate metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

GeCo's approach might be adapted to evaluate consistency in other generative tasks like image synthesis or 3D reconstruction.
The metric could help in developing more robust video generation models by providing direct feedback on structural fidelity.
Extending GeCo to handle dynamic scenes with moving objects would broaden its applicability to real-world video content.

Load-bearing premise

That residual motion combined with depth priors is sufficient to reliably separate true geometric deformation and occlusion errors from other sources of inconsistency in generated videos of static scenes.

What would settle it

A video generation output of a static scene with visible geometric warping or wrong occlusions that nonetheless receives high consistency scores from GeCo would challenge the metric's effectiveness.

Figures

Figures reproduced from arXiv: 2512.22274 by Charles Herrmann, Deqing Sun, Fangneng Zhan, Hanspeter Pfister, Junhwa Hur, Leslie Gu, Todd Zickler.

**Figure 1.** Figure 1: Geometric deformation detection on a generated video. Top: Input frames; the white box marks the target frame for detection. Middle: Zoomed-in deformations. Red box: the front chess piece (indicated by the arrow) gradually moves toward the piece behind it until they merge into a single piece, with the merged region highlighted by a red dashed circle. Blue box: a bishop morphs into a queen. Bottom: Comparis… view at source ↗

**Figure 2.** Figure 2: GeCo pipeline. Within a sliding window, we jointly estimate dense optical flow and 3D geometry (depth and camera pose) for frame pairs. We compute residual motion and depth errors and fuse them into scale-invariant inconsistency maps. Aggregation over the window localizes artifacts in the target frame, while motion and structure maps provide complementary diagnostics. penalize artifacts such as sudden appe… view at source ↗

**Figure 3.** Figure 3: WarpBench deformation process. (Left) Input frame with foreground segmentation mask (cyan), sampled thin-plate spline (TPS) control points (red), and their destination points (blue). (Middle) Warped frame after the TPS deformation. (Right) Ground-truth dense displacement field from the deformation. {mc, sc,Mgeo,c}. We then compute the spatial mean of each map to define scalar frame-level scores. Finally, v… view at source ↗

**Figure 5.** Figure 5: GeCo guidance improves geometric consistency for 3D reconstruction. Top: 3D reconstructions from videos generated by CogVideoX-5B without (left) and with GeCo guidance (right). Bottom: corresponding video frames. Both guided videos yield cleaner geometry with fewer deformation and drifting artifacts across views, which enables higher reconstruction quality [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: “The Globe That Can’t Be Stopped.” A common failure mode that models consistently make the globe rotate despite static prompts. GeCo localizes this spurious object motion on the globe surface, clearly separating it from the intended egocentric camera motion [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Stopping the globe and freezing the dog with GeCoguidance. We compare video generations for prompts specifying a camera orbiting a nominally static globe (top rows) and a static dog (bottom rows). Without guidance, the model introduces spurious object motion, causing the globe to spin and the dog to move. Applying GeCo guidance effectively suppresses this non-ego motion, enforcing geometric consistency … view at source ↗

read the original abstract

We introduce GeCo, a geometry-grounded metric for jointly detecting geometric deformation and occlusion-inconsistency artifacts in static scenes. By fusing residual motion and depth priors, GeCo produces interpretable, dense consistency maps that reveal these artifacts. We use GeCo to systematically benchmark recent video generation models, uncovering common failure modes, and further employ it as a training-free guidance loss to reduce deformation artifacts during video generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces GeCo, a geometry-grounded metric that fuses residual motion (optical flow) with monocular depth priors to produce dense consistency maps for jointly detecting geometric deformation and occlusion-inconsistency artifacts in generated videos of static scenes. It applies the metric to benchmark recent video generation models, identify common failure modes, and serve as a training-free guidance loss to mitigate deformation artifacts during synthesis.

Significance. If the depth priors remain sufficiently accurate on artifact-laden generated frames, GeCo could provide a useful, interpretable tool for structural evaluation and guidance in video synthesis, addressing limitations in existing perceptual metrics. The training-free guidance application is a practical strength that could be directly adopted by practitioners.

major comments (2)

[Abstract / Method] Abstract and method description: the central claim that fusing residual motion with depth priors reliably separates true geometric deformation and occlusion errors from other inconsistency sources (e.g., texture flicker or lighting) is load-bearing for both the benchmarking and guidance results, yet no validation or ablation demonstrates that monocular depth estimates remain accurate rather than hallucinating or smoothing over the very artifacts GeCo targets.
[Experiments] Experiments section: the reported benchmarking of video models and quantitative gains from the guidance loss depend on GeCo's maps being faithful; without error analysis on depth network outputs for deformed frames or comparison against ground-truth depth where available, the improvements cannot be confidently attributed to the metric rather than post-hoc choices.

minor comments (2)

[Method] Clarify the precise fusion formula (e.g., how residual flow and depth are combined into the consistency map) with an explicit equation to improve reproducibility.
[Abstract / Experiments] The assumption of 'static scenes' is stated but not operationalized; specify how camera motion or object motion is excluded or handled in the benchmark datasets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for explicit validation of depth prior robustness. We address each major comment below and will incorporate the suggested analyses in the revised manuscript to strengthen the claims.

read point-by-point responses

Referee: [Abstract / Method] Abstract and method description: the central claim that fusing residual motion with depth priors reliably separates true geometric deformation and occlusion errors from other inconsistency sources (e.g., texture flicker or lighting) is load-bearing for both the benchmarking and guidance results, yet no validation or ablation demonstrates that monocular depth estimates remain accurate rather than hallucinating or smoothing over the very artifacts GeCo targets.

Authors: We agree that the separation claim is central and that direct validation of depth accuracy on artifact-laden frames was not provided. The fusion is motivated by the observation that residual motion captures local inconsistencies while depth provides global structure, but we acknowledge the absence of targeted ablations. In revision, we will add a dedicated analysis subsection with (i) qualitative depth map comparisons on clean vs. deformed generated frames and (ii) quantitative error metrics on synthetic data with controlled geometric artifacts to demonstrate that monocular estimates do not systematically hallucinate or smooth the targeted inconsistencies. revision: yes
Referee: [Experiments] Experiments section: the reported benchmarking of video models and quantitative gains from the guidance loss depend on GeCo's maps being faithful; without error analysis on depth network outputs for deformed frames or comparison against ground-truth depth where available, the improvements cannot be confidently attributed to the metric rather than post-hoc choices.

Authors: We concur that faithful attribution of benchmarking results and guidance gains requires evidence that GeCo maps reflect true geometric errors. The current experiments rely on the metric's design and qualitative visualizations, but lack the requested error analysis. We will revise the experiments section to include (i) depth network error statistics on frames with documented deformations and (ii) comparisons against ground-truth depth on available synthetic video subsets, allowing readers to assess whether the reported improvements are driven by accurate inconsistency detection. revision: yes

Circularity Check

0 steps flagged

No circularity: GeCo metric is constructed from external priors

full rationale

The paper defines GeCo explicitly as the fusion of residual motion and depth priors to generate consistency maps for detecting artifacts in static scenes. No equations, self-citations, or fitted parameters are presented that reduce the metric definition to its own outputs or predictions. The construction is presented as a direct combination of independent external signals (optical flow residuals and monocular depth), with no load-bearing step that renames a fit or imports uniqueness from prior author work. This is the common case of a self-contained metric definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view; no explicit free parameters, axioms, or invented entities are stated. The metric is described as fusing existing motion residuals and depth priors, which are treated as given inputs.

pith-pipeline@v0.9.0 · 5372 in / 986 out tokens · 19240 ms · 2026-05-16T20:05:59.275638+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mind the Gap: Geometrically Accurate Generative Reconstruction from Disjoint Views
cs.CV 2026-05 unverdicted novelty 8.0

GLADOS reconstructs 3D geometry from disjoint views by generating intermediate perspectives, performing robust coarse alignment that tolerates generative inconsistencies, and iteratively expanding context for consistency.
GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

GeoFlow adds a geometry-consistency reward based on rigid camera flow and object appearance preservation, integrated via reinforcement fine-tuning to improve geometric coherence in video generation.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 2 Pith papers · 8 internal anchors

[1]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foun- dation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

MEt3R: Measuring multi-view consistency in generated images

Mohammad Asim, Christopher Wewer, Thomas Wimmer, Bernt Schiele, and Jan Eric Lenssen. MEt3R: Measuring multi-view consistency in generated images. InCVPR, pages 6034–6044, 2025. 1, 2, 3, 6, 7, 8

work page 2025
[3]

Universal guidance for diffusion models

Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas Geip- ing, and Tom Goldstein. Universal guidance for diffusion models. InCVPRW, pages 843–852, 2023. 3, 5

work page 2023
[4]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M ¨uller. ZoeDepth: Zero-shot trans- fer by combining relative and metric depth.arXiv preprint arXiv:2302.12288, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Video generation models as world simulators,

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators,

work page
[6]

OpenAI technical report. 6

work page
[7]

Emerg- ing properties in self-supervised vision transformers.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9630–9640, 2021

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv’e J’egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9630–9640, 2021. 2, 3

work page 2021
[8]

Guess what moves: Unsupervised video and image segmentation by anticipating motion

Subhabrata Choudhury, Laurynas Karazija, Iro Laina, An- drea Vedaldi, and Christian Rupprecht. Guess what moves: Unsupervised video and image segmentation by anticipating motion. InBMVC, 2022. 3

work page 2022
[9]

Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In CVPR, 2017. 5, 3

work page 2017
[10]

Superpoint: Self-supervised interest point detection and description.2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 337–33712, 2017

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description.2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 337–33712, 2017. 3

work page 2018
[11]

WorldScore: A unified evaluation benchmark for world generation

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Ji- ajun Wu. WorldScore: A unified evaluation benchmark for world generation. InICCV, 2025. 1, 2, 3

work page 2025
[12]

Motion guidance: Diffusion-based image editing with differentiable motion es- timators

Daniel Geng and Andrew Owens. Motion guidance: Diffusion-based image editing with differentiable motion es- timators. InICLR, 2024. 3

work page 2024
[13]

Veo: a text-to-video generation system

Google DeepMind. Veo: a text-to-video generation system. Technical report, 2025. Veo 3 technical report. 6

work page 2025
[14]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weiss- buch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. LTX-Video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Multistable shape from shading emerges from patch diffusion.NeurIPS, 37:34686–34711, 2024

Xinran Han, Todd Zickler, and Ko Nishino. Multistable shape from shading emerges from patch diffusion.NeurIPS, 37:34686–34711, 2024. 3

work page 2024
[16]

Cambridge University Press,

Richard Hartley and Andrew Zisserman.Multiple View Ge- ometry in Computer Vision. Cambridge University Press,

work page
[17]

Zico Kolter, Ruslan Salakhutdinov, and Stefano Ermon

Yutong He, Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Dongjun Kim, Wei-Hsiang Liao, Yuki Mitsufuji, J. Zico Kolter, Ruslan Salakhutdinov, and Stefano Ermon. Manifold preserving guided diffusion. InICLR,

work page
[18]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. CogVideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022. 6

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

JOG3R: Towards 3D- consistent video generators

Chun-Hao Paul Huang, Niloy Mitra, Hyeonho Jeong, Jae Shin Yoon, and Duygu Ceylan. JOG3R: Towards 3D- consistent video generators. InBMVC, 2025. 3

work page 2025
[20]

Segment any motion in videos

Nan Huang, Wenzhao Zheng, Chenfeng Xu, Kurt Keutzer, Shanghang Zhang, Angjoo Kanazawa, and Qianqian Wang. Segment any motion in videos. InCVPR, pages 3406–3416,

work page
[21]

VBench: Com- prehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Com- prehensive benchmark suite for video generative models. CVPR, pages 21807–21818, 2023. 2, 3, 6

work page 2023
[22]

VBench++: Comprehensive and versatile benchmark suite for video generative models.ArXiv, 2024

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Yingcong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench++: Comprehensive and versatile benchmark suite for video generative models.ArXiv, 2024. 2, 3, 6

work page 2024
[23]

Frame guidance: Training-free guidance for frame-level control in video diffusion models

Sang-Sub Jang, Taekyung Ki, Jaehyeong Jo, Jaehong Yoon, Soo Ye Kim, Zhe L. Lin, and Sung Ju Hwang. Frame guid- ance: Training-free guidance for frame-level control in video diffusion models.ArXiv, abs/2506.07177, 2025. 3, 5

work page arXiv 2025
[24]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Seg- menting invisible moving objects.BMVC, 2021

Hala Lamdouar, Weidi Xie, and Andrew Zisserman. Seg- menting invisible moving objects.BMVC, 2021. 3

work page 2021
[26]

Lightglue: Local feature matching at light speed

Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Polle- feys. Lightglue: Local feature matching at light speed. 2023 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 17581–17592, 2023. 3

work page 2023
[27]

Chan, and Ying Shan

Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond H. Chan, and Ying Shan. EvalCrafter: Benchmarking and eval- uating large video generation models.CVPR, pages 22139– 22149, 2023. 5

work page 2023
[28]

David G. Lowe. Object recognition from local scale- invariant features.Proceedings of the Seventh IEEE Interna- tional Conference on Computer Vision, 2:1150–1157 vol.2,

work page
[29]

RePaint: Inpainting using denoising diffusion probabilistic models.CVPR, pages 11451–11461, 2022

Andreas Lugmayr, Martin Danelljan, Andr´es Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. RePaint: Inpainting using denoising diffusion probabilistic models.CVPR, pages 11451–11461, 2022. 5

work page 2022
[30]

Optical-flow guided prompt optimization for coherent video generation.CVPR, pages 7837–7846, 2024

Hyelin Nam, Jaemin Kim, Dohun Lee, and Jong Chul Ye. Optical-flow guided prompt optimization for coherent video generation.CVPR, pages 7837–7846, 2024. 3

work page 2024
[31]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, 2021. 2, 3

work page 2021
[32]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. SAM 2: Segment anything in images and videos. arXiv preprint arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Com- mon objects in 3D: Large-scale learning and evaluation of real-life 3d category reconstruction

Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Com- mon objects in 3D: Large-scale learning and evaluation of real-life 3d category reconstruction. InICCV, 2021. 5, 3

work page 2021
[34]

Gen3C: 3D-informed world-consistent video generation with precise camera con- trol

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas M ¨uller, Alexan- der Keller, Sanja Fidler, and Jun Gao. Gen3C: 3D-informed world-consistent video generation with precise camera con- trol. InCVPR, pages 6121–6132, 2025. 3

work page 2025
[35]

very scattered

Paul D. Sampson. Fitting conic sections to “very scattered” data: An iterative refinement of the bookstein algorithm. Computer graphics and image processing, 1982. 2, 3

work page 1982
[36]

DROID-SLAM: Deep visual slam for monocular, stereo, and RGB-D cameras.NeurIPS, 34:16558–16569, 2021

Zachary Teed and Jia Deng. DROID-SLAM: Deep visual slam for monocular, stereo, and RGB-D cameras.NeurIPS, 34:16558–16569, 2021. 3

work page 2021
[37]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

VGGT: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InCVPR, 2025. 3, 4, 5

work page 2025
[39]

DUSt3R: Geometric 3D vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InCVPR, pages 20697–20709, 2024. 3

work page 2024
[40]

Zero-shot image restoration using denoising diffusion null-space model

Yinhuai Wang, Jiwen Yu, and Jian Zhang. Zero-shot image restoration using denoising diffusion null-space model. In ICLR, 2023. 5

work page 2023
[41]

Segmenting moving objects via an object-centric layered representation

Junyu Xie, Weidi Xie, and Andrew Zisserman. Segmenting moving objects via an object-centric layered representation. NeurIPS, 35:28023–28036, 2022. 3

work page 2022
[42]

Moving object segmentation: All you need is sam (and flow)

Junyu Xie, Charig Yang, Weidi Xie, and Andrew Zisserman. Moving object segmentation: All you need is sam (and flow). InACCV, pages 162–178, 2024

work page 2024
[43]

Self-supervised video object segmentation by motion grouping

Charig Yang, Hala Lamdouar, Erika Lu, Andrew Zisserman, and Weidi Xie. Self-supervised video object segmentation by motion grouping. InICCV, pages 7177–7188, 2021. 3

work page 2021
[44]

Depth any- thing V2.NeurIPS, 37:21875–21911, 2024

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing V2.NeurIPS, 37:21875–21911, 2024. 3

work page 2024
[45]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

ScanNet++: A high-fidelity dataset of 3D indoor scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. ScanNet++: A high-fidelity dataset of 3D indoor scenes. InICCV, 2023. 5, 3

work page 2023
[47]

Yu, Fereshteh Forghani, Konstantinos G

Jason J. Yu, Fereshteh Forghani, Konstantinos G. Derpanis, and Marcus A. Brubaker. Long-term photometric consistent novel view synthesis with diffusion models. InICCV, 2023. 2, 3

work page 2023
[48]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023. 3

work page 2023
[49]

World-consistent video diffusion with explicit 3D modeling

Qihang Zhang, Shuangfei Zhai, Miguel Angel Bautista Mar- tin, Kevin Miao, Alexander Toshev, Joshua Susskind, and Jiatao Gu. World-consistent video diffusion with explicit 3D modeling. InCVPR, pages 21685–21695, 2025. 3

work page 2025
[50]

ControlVideo: Training-free controllable text-to-video generation

Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, XIAOPENG ZHANG, Wangmeng Zuo, and Qi Tian. ControlVideo: Training-free controllable text-to-video generation. InICLR,

work page
[51]

UFM: A simple path towards unified dense correspondence with flow

Yuchen Zhang, Nikhil Keetha, Chenwei Lyu, Bhuvan Jhamb, Yutian Chen, Yuheng Qiu, Jay Karhade, Shreyas Jha, Yaoyu Hu, Deva Ramanan, Sebastian Scherer, and Wenshan Wang. UFM: A simple path towards unified dense correspondence with flow. InarXiV, 2025. 3, 4, 5 GeCo: A Differentiable Geometric Consistency Metric for Video Generation Supplementary Material Fig...

work page 2025
[52]

Warm-up (t∈[0,2]): We perform no gradient updates (Rt = 0) in the initial steps to establish the global layout

work page
[53]

Strong Guidance (t∈[3,19]): We applyR t = 3updates per step to enforce strong geometric constraints during the formation of structural content

work page
[54]

Refinement (t∈[20,49]): We reduce the frequency to Rt = 2updates per step to maintain consistency without disrupting fine texture generation. To mitigate the accumulation of errors and prevent the latent from drifting off the data manifold during aggres- sive updates, we strictly employ Time-Travel [16] within the specific intervalt∈[15,20]. D. Details on...

work page

[1] [1]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foun- dation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

MEt3R: Measuring multi-view consistency in generated images

Mohammad Asim, Christopher Wewer, Thomas Wimmer, Bernt Schiele, and Jan Eric Lenssen. MEt3R: Measuring multi-view consistency in generated images. InCVPR, pages 6034–6044, 2025. 1, 2, 3, 6, 7, 8

work page 2025

[3] [3]

Universal guidance for diffusion models

Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas Geip- ing, and Tom Goldstein. Universal guidance for diffusion models. InCVPRW, pages 843–852, 2023. 3, 5

work page 2023

[4] [4]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M ¨uller. ZoeDepth: Zero-shot trans- fer by combining relative and metric depth.arXiv preprint arXiv:2302.12288, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Video generation models as world simulators,

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators,

work page

[6] [6]

OpenAI technical report. 6

work page

[7] [7]

Emerg- ing properties in self-supervised vision transformers.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9630–9640, 2021

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv’e J’egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9630–9640, 2021. 2, 3

work page 2021

[8] [8]

Guess what moves: Unsupervised video and image segmentation by anticipating motion

Subhabrata Choudhury, Laurynas Karazija, Iro Laina, An- drea Vedaldi, and Christian Rupprecht. Guess what moves: Unsupervised video and image segmentation by anticipating motion. InBMVC, 2022. 3

work page 2022

[9] [9]

Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In CVPR, 2017. 5, 3

work page 2017

[10] [10]

Superpoint: Self-supervised interest point detection and description.2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 337–33712, 2017

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description.2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 337–33712, 2017. 3

work page 2018

[11] [11]

WorldScore: A unified evaluation benchmark for world generation

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Ji- ajun Wu. WorldScore: A unified evaluation benchmark for world generation. InICCV, 2025. 1, 2, 3

work page 2025

[12] [12]

Motion guidance: Diffusion-based image editing with differentiable motion es- timators

Daniel Geng and Andrew Owens. Motion guidance: Diffusion-based image editing with differentiable motion es- timators. InICLR, 2024. 3

work page 2024

[13] [13]

Veo: a text-to-video generation system

Google DeepMind. Veo: a text-to-video generation system. Technical report, 2025. Veo 3 technical report. 6

work page 2025

[14] [14]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weiss- buch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. LTX-Video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Multistable shape from shading emerges from patch diffusion.NeurIPS, 37:34686–34711, 2024

Xinran Han, Todd Zickler, and Ko Nishino. Multistable shape from shading emerges from patch diffusion.NeurIPS, 37:34686–34711, 2024. 3

work page 2024

[16] [16]

Cambridge University Press,

Richard Hartley and Andrew Zisserman.Multiple View Ge- ometry in Computer Vision. Cambridge University Press,

work page

[17] [17]

Zico Kolter, Ruslan Salakhutdinov, and Stefano Ermon

Yutong He, Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Dongjun Kim, Wei-Hsiang Liao, Yuki Mitsufuji, J. Zico Kolter, Ruslan Salakhutdinov, and Stefano Ermon. Manifold preserving guided diffusion. InICLR,

work page

[18] [18]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. CogVideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022. 6

work page internal anchor Pith review Pith/arXiv arXiv 2022

[19] [19]

JOG3R: Towards 3D- consistent video generators

Chun-Hao Paul Huang, Niloy Mitra, Hyeonho Jeong, Jae Shin Yoon, and Duygu Ceylan. JOG3R: Towards 3D- consistent video generators. InBMVC, 2025. 3

work page 2025

[20] [20]

Segment any motion in videos

Nan Huang, Wenzhao Zheng, Chenfeng Xu, Kurt Keutzer, Shanghang Zhang, Angjoo Kanazawa, and Qianqian Wang. Segment any motion in videos. InCVPR, pages 3406–3416,

work page

[21] [21]

VBench: Com- prehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Com- prehensive benchmark suite for video generative models. CVPR, pages 21807–21818, 2023. 2, 3, 6

work page 2023

[22] [22]

VBench++: Comprehensive and versatile benchmark suite for video generative models.ArXiv, 2024

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Yingcong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench++: Comprehensive and versatile benchmark suite for video generative models.ArXiv, 2024. 2, 3, 6

work page 2024

[23] [23]

Frame guidance: Training-free guidance for frame-level control in video diffusion models

Sang-Sub Jang, Taekyung Ki, Jaehyeong Jo, Jaehong Yoon, Soo Ye Kim, Zhe L. Lin, and Sung Ju Hwang. Frame guid- ance: Training-free guidance for frame-level control in video diffusion models.ArXiv, abs/2506.07177, 2025. 3, 5

work page arXiv 2025

[24] [24]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Seg- menting invisible moving objects.BMVC, 2021

Hala Lamdouar, Weidi Xie, and Andrew Zisserman. Seg- menting invisible moving objects.BMVC, 2021. 3

work page 2021

[26] [26]

Lightglue: Local feature matching at light speed

Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Polle- feys. Lightglue: Local feature matching at light speed. 2023 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 17581–17592, 2023. 3

work page 2023

[27] [27]

Chan, and Ying Shan

Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond H. Chan, and Ying Shan. EvalCrafter: Benchmarking and eval- uating large video generation models.CVPR, pages 22139– 22149, 2023. 5

work page 2023

[28] [28]

David G. Lowe. Object recognition from local scale- invariant features.Proceedings of the Seventh IEEE Interna- tional Conference on Computer Vision, 2:1150–1157 vol.2,

work page

[29] [29]

RePaint: Inpainting using denoising diffusion probabilistic models.CVPR, pages 11451–11461, 2022

Andreas Lugmayr, Martin Danelljan, Andr´es Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. RePaint: Inpainting using denoising diffusion probabilistic models.CVPR, pages 11451–11461, 2022. 5

work page 2022

[30] [30]

Optical-flow guided prompt optimization for coherent video generation.CVPR, pages 7837–7846, 2024

Hyelin Nam, Jaemin Kim, Dohun Lee, and Jong Chul Ye. Optical-flow guided prompt optimization for coherent video generation.CVPR, pages 7837–7846, 2024. 3

work page 2024

[31] [31]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, 2021. 2, 3

work page 2021

[32] [32]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. SAM 2: Segment anything in images and videos. arXiv preprint arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Com- mon objects in 3D: Large-scale learning and evaluation of real-life 3d category reconstruction

Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Com- mon objects in 3D: Large-scale learning and evaluation of real-life 3d category reconstruction. InICCV, 2021. 5, 3

work page 2021

[34] [34]

Gen3C: 3D-informed world-consistent video generation with precise camera con- trol

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas M ¨uller, Alexan- der Keller, Sanja Fidler, and Jun Gao. Gen3C: 3D-informed world-consistent video generation with precise camera con- trol. InCVPR, pages 6121–6132, 2025. 3

work page 2025

[35] [35]

very scattered

Paul D. Sampson. Fitting conic sections to “very scattered” data: An iterative refinement of the bookstein algorithm. Computer graphics and image processing, 1982. 2, 3

work page 1982

[36] [36]

DROID-SLAM: Deep visual slam for monocular, stereo, and RGB-D cameras.NeurIPS, 34:16558–16569, 2021

Zachary Teed and Jia Deng. DROID-SLAM: Deep visual slam for monocular, stereo, and RGB-D cameras.NeurIPS, 34:16558–16569, 2021. 3

work page 2021

[37] [37]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

VGGT: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InCVPR, 2025. 3, 4, 5

work page 2025

[39] [39]

DUSt3R: Geometric 3D vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InCVPR, pages 20697–20709, 2024. 3

work page 2024

[40] [40]

Zero-shot image restoration using denoising diffusion null-space model

Yinhuai Wang, Jiwen Yu, and Jian Zhang. Zero-shot image restoration using denoising diffusion null-space model. In ICLR, 2023. 5

work page 2023

[41] [41]

Segmenting moving objects via an object-centric layered representation

Junyu Xie, Weidi Xie, and Andrew Zisserman. Segmenting moving objects via an object-centric layered representation. NeurIPS, 35:28023–28036, 2022. 3

work page 2022

[42] [42]

Moving object segmentation: All you need is sam (and flow)

Junyu Xie, Charig Yang, Weidi Xie, and Andrew Zisserman. Moving object segmentation: All you need is sam (and flow). InACCV, pages 162–178, 2024

work page 2024

[43] [43]

Self-supervised video object segmentation by motion grouping

Charig Yang, Hala Lamdouar, Erika Lu, Andrew Zisserman, and Weidi Xie. Self-supervised video object segmentation by motion grouping. InICCV, pages 7177–7188, 2021. 3

work page 2021

[44] [44]

Depth any- thing V2.NeurIPS, 37:21875–21911, 2024

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing V2.NeurIPS, 37:21875–21911, 2024. 3

work page 2024

[45] [45]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [46]

ScanNet++: A high-fidelity dataset of 3D indoor scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. ScanNet++: A high-fidelity dataset of 3D indoor scenes. InICCV, 2023. 5, 3

work page 2023

[47] [47]

Yu, Fereshteh Forghani, Konstantinos G

Jason J. Yu, Fereshteh Forghani, Konstantinos G. Derpanis, and Marcus A. Brubaker. Long-term photometric consistent novel view synthesis with diffusion models. InICCV, 2023. 2, 3

work page 2023

[48] [48]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023. 3

work page 2023

[49] [49]

World-consistent video diffusion with explicit 3D modeling

Qihang Zhang, Shuangfei Zhai, Miguel Angel Bautista Mar- tin, Kevin Miao, Alexander Toshev, Joshua Susskind, and Jiatao Gu. World-consistent video diffusion with explicit 3D modeling. InCVPR, pages 21685–21695, 2025. 3

work page 2025

[50] [50]

ControlVideo: Training-free controllable text-to-video generation

Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, XIAOPENG ZHANG, Wangmeng Zuo, and Qi Tian. ControlVideo: Training-free controllable text-to-video generation. InICLR,

work page

[51] [51]

UFM: A simple path towards unified dense correspondence with flow

Yuchen Zhang, Nikhil Keetha, Chenwei Lyu, Bhuvan Jhamb, Yutian Chen, Yuheng Qiu, Jay Karhade, Shreyas Jha, Yaoyu Hu, Deva Ramanan, Sebastian Scherer, and Wenshan Wang. UFM: A simple path towards unified dense correspondence with flow. InarXiV, 2025. 3, 4, 5 GeCo: A Differentiable Geometric Consistency Metric for Video Generation Supplementary Material Fig...

work page 2025

[52] [52]

Warm-up (t∈[0,2]): We perform no gradient updates (Rt = 0) in the initial steps to establish the global layout

work page

[53] [53]

Strong Guidance (t∈[3,19]): We applyR t = 3updates per step to enforce strong geometric constraints during the formation of structural content

work page

[54] [54]

Refinement (t∈[20,49]): We reduce the frequency to Rt = 2updates per step to maintain consistency without disrupting fine texture generation. To mitigate the accumulation of errors and prevent the latent from drifting off the data manifold during aggres- sive updates, we strictly employ Time-Travel [16] within the specific intervalt∈[15,20]. D. Details on...

work page