GeCo: Evaluating Geometric Consistency for Video Generation via Motion and Structure
Pith reviewed 2026-05-16 20:05 UTC · model grok-4.3
The pith
GeCo detects geometric deformation and occlusion inconsistencies in generated videos of static scenes by fusing residual motion and depth priors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GeCo is a geometry-grounded metric for jointly detecting geometric deformation and occlusion-inconsistency artifacts in static scenes. By fusing residual motion and depth priors, GeCo produces interpretable, dense consistency maps that reveal these artifacts. It is used to benchmark recent video generation models and as a training-free guidance loss to reduce deformation artifacts during video generation.
What carries the argument
The GeCo metric, which combines residual motion and depth priors to generate dense, interpretable consistency maps for identifying geometric and occlusion issues.
If this is right
- Systematic benchmarking of video generation models reveals common geometric failure modes.
- GeCo can serve as a training-free loss to guide generation and reduce deformation artifacts.
- Dense consistency maps provide interpretable visualizations of where inconsistencies occur in generated videos.
- Joint detection of deformation and occlusion errors allows for more comprehensive evaluation than separate metrics.
Where Pith is reading between the lines
- GeCo's approach might be adapted to evaluate consistency in other generative tasks like image synthesis or 3D reconstruction.
- The metric could help in developing more robust video generation models by providing direct feedback on structural fidelity.
- Extending GeCo to handle dynamic scenes with moving objects would broaden its applicability to real-world video content.
Load-bearing premise
That residual motion combined with depth priors is sufficient to reliably separate true geometric deformation and occlusion errors from other sources of inconsistency in generated videos of static scenes.
What would settle it
A video generation output of a static scene with visible geometric warping or wrong occlusions that nonetheless receives high consistency scores from GeCo would challenge the metric's effectiveness.
Figures
read the original abstract
We introduce GeCo, a geometry-grounded metric for jointly detecting geometric deformation and occlusion-inconsistency artifacts in static scenes. By fusing residual motion and depth priors, GeCo produces interpretable, dense consistency maps that reveal these artifacts. We use GeCo to systematically benchmark recent video generation models, uncovering common failure modes, and further employ it as a training-free guidance loss to reduce deformation artifacts during video generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GeCo, a geometry-grounded metric that fuses residual motion (optical flow) with monocular depth priors to produce dense consistency maps for jointly detecting geometric deformation and occlusion-inconsistency artifacts in generated videos of static scenes. It applies the metric to benchmark recent video generation models, identify common failure modes, and serve as a training-free guidance loss to mitigate deformation artifacts during synthesis.
Significance. If the depth priors remain sufficiently accurate on artifact-laden generated frames, GeCo could provide a useful, interpretable tool for structural evaluation and guidance in video synthesis, addressing limitations in existing perceptual metrics. The training-free guidance application is a practical strength that could be directly adopted by practitioners.
major comments (2)
- [Abstract / Method] Abstract and method description: the central claim that fusing residual motion with depth priors reliably separates true geometric deformation and occlusion errors from other inconsistency sources (e.g., texture flicker or lighting) is load-bearing for both the benchmarking and guidance results, yet no validation or ablation demonstrates that monocular depth estimates remain accurate rather than hallucinating or smoothing over the very artifacts GeCo targets.
- [Experiments] Experiments section: the reported benchmarking of video models and quantitative gains from the guidance loss depend on GeCo's maps being faithful; without error analysis on depth network outputs for deformed frames or comparison against ground-truth depth where available, the improvements cannot be confidently attributed to the metric rather than post-hoc choices.
minor comments (2)
- [Method] Clarify the precise fusion formula (e.g., how residual flow and depth are combined into the consistency map) with an explicit equation to improve reproducibility.
- [Abstract / Experiments] The assumption of 'static scenes' is stated but not operationalized; specify how camera motion or object motion is excluded or handled in the benchmark datasets.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for explicit validation of depth prior robustness. We address each major comment below and will incorporate the suggested analyses in the revised manuscript to strengthen the claims.
read point-by-point responses
-
Referee: [Abstract / Method] Abstract and method description: the central claim that fusing residual motion with depth priors reliably separates true geometric deformation and occlusion errors from other inconsistency sources (e.g., texture flicker or lighting) is load-bearing for both the benchmarking and guidance results, yet no validation or ablation demonstrates that monocular depth estimates remain accurate rather than hallucinating or smoothing over the very artifacts GeCo targets.
Authors: We agree that the separation claim is central and that direct validation of depth accuracy on artifact-laden frames was not provided. The fusion is motivated by the observation that residual motion captures local inconsistencies while depth provides global structure, but we acknowledge the absence of targeted ablations. In revision, we will add a dedicated analysis subsection with (i) qualitative depth map comparisons on clean vs. deformed generated frames and (ii) quantitative error metrics on synthetic data with controlled geometric artifacts to demonstrate that monocular estimates do not systematically hallucinate or smooth the targeted inconsistencies. revision: yes
-
Referee: [Experiments] Experiments section: the reported benchmarking of video models and quantitative gains from the guidance loss depend on GeCo's maps being faithful; without error analysis on depth network outputs for deformed frames or comparison against ground-truth depth where available, the improvements cannot be confidently attributed to the metric rather than post-hoc choices.
Authors: We concur that faithful attribution of benchmarking results and guidance gains requires evidence that GeCo maps reflect true geometric errors. The current experiments rely on the metric's design and qualitative visualizations, but lack the requested error analysis. We will revise the experiments section to include (i) depth network error statistics on frames with documented deformations and (ii) comparisons against ground-truth depth on available synthetic video subsets, allowing readers to assess whether the reported improvements are driven by accurate inconsistency detection. revision: yes
Circularity Check
No circularity: GeCo metric is constructed from external priors
full rationale
The paper defines GeCo explicitly as the fusion of residual motion and depth priors to generate consistency maps for detecting artifacts in static scenes. No equations, self-citations, or fitted parameters are presented that reduce the metric definition to its own outputs or predictions. The construction is presented as a direct combination of independent external signals (optical flow residuals and monocular depth), with no load-bearing step that renames a fit or imports uniqueness from prior author work. This is the common case of a self-contained metric definition.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
Mind the Gap: Geometrically Accurate Generative Reconstruction from Disjoint Views
GLADOS reconstructs 3D geometry from disjoint views by generating intermediate perspectives, performing robust coarse alignment that tolerates generative inconsistencies, and iteratively expanding context for consistency.
-
GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation
GeoFlow adds a geometry-consistency reward based on rigid camera flow and object appearance preservation, integrated via reinforcement fine-tuning to improve geometric coherence in video generation.
Reference graph
Works this paper leans on
-
[1]
Cosmos World Foundation Model Platform for Physical AI
Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foun- dation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
MEt3R: Measuring multi-view consistency in generated images
Mohammad Asim, Christopher Wewer, Thomas Wimmer, Bernt Schiele, and Jan Eric Lenssen. MEt3R: Measuring multi-view consistency in generated images. InCVPR, pages 6034–6044, 2025. 1, 2, 3, 6, 7, 8
work page 2025
-
[3]
Universal guidance for diffusion models
Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas Geip- ing, and Tom Goldstein. Universal guidance for diffusion models. InCVPRW, pages 843–852, 2023. 3, 5
work page 2023
-
[4]
ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth
Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M ¨uller. ZoeDepth: Zero-shot trans- fer by combining relative and metric depth.arXiv preprint arXiv:2302.12288, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Video generation models as world simulators,
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators,
-
[6]
OpenAI technical report. 6
-
[7]
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv’e J’egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers.2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9630–9640, 2021. 2, 3
work page 2021
-
[8]
Guess what moves: Unsupervised video and image segmentation by anticipating motion
Subhabrata Choudhury, Laurynas Karazija, Iro Laina, An- drea Vedaldi, and Christian Rupprecht. Guess what moves: Unsupervised video and image segmentation by anticipating motion. InBMVC, 2022. 3
work page 2022
-
[9]
Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner
Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In CVPR, 2017. 5, 3
work page 2017
-
[10]
Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description.2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 337–33712, 2017. 3
work page 2018
-
[11]
WorldScore: A unified evaluation benchmark for world generation
Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Ji- ajun Wu. WorldScore: A unified evaluation benchmark for world generation. InICCV, 2025. 1, 2, 3
work page 2025
-
[12]
Motion guidance: Diffusion-based image editing with differentiable motion es- timators
Daniel Geng and Andrew Owens. Motion guidance: Diffusion-based image editing with differentiable motion es- timators. InICLR, 2024. 3
work page 2024
-
[13]
Veo: a text-to-video generation system
Google DeepMind. Veo: a text-to-video generation system. Technical report, 2025. Veo 3 technical report. 6
work page 2025
-
[14]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weiss- buch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. LTX-Video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Multistable shape from shading emerges from patch diffusion.NeurIPS, 37:34686–34711, 2024
Xinran Han, Todd Zickler, and Ko Nishino. Multistable shape from shading emerges from patch diffusion.NeurIPS, 37:34686–34711, 2024. 3
work page 2024
-
[16]
Richard Hartley and Andrew Zisserman.Multiple View Ge- ometry in Computer Vision. Cambridge University Press,
-
[17]
Zico Kolter, Ruslan Salakhutdinov, and Stefano Ermon
Yutong He, Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Dongjun Kim, Wei-Hsiang Liao, Yuki Mitsufuji, J. Zico Kolter, Ruslan Salakhutdinov, and Stefano Ermon. Manifold preserving guided diffusion. InICLR,
-
[18]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. CogVideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022. 6
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
JOG3R: Towards 3D- consistent video generators
Chun-Hao Paul Huang, Niloy Mitra, Hyeonho Jeong, Jae Shin Yoon, and Duygu Ceylan. JOG3R: Towards 3D- consistent video generators. InBMVC, 2025. 3
work page 2025
-
[20]
Nan Huang, Wenzhao Zheng, Chenfeng Xu, Kurt Keutzer, Shanghang Zhang, Angjoo Kanazawa, and Qianqian Wang. Segment any motion in videos. InCVPR, pages 3406–3416,
-
[21]
VBench: Com- prehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Com- prehensive benchmark suite for video generative models. CVPR, pages 21807–21818, 2023. 2, 3, 6
work page 2023
-
[22]
VBench++: Comprehensive and versatile benchmark suite for video generative models.ArXiv, 2024
Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Yingcong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench++: Comprehensive and versatile benchmark suite for video generative models.ArXiv, 2024. 2, 3, 6
work page 2024
-
[23]
Frame guidance: Training-free guidance for frame-level control in video diffusion models
Sang-Sub Jang, Taekyung Ki, Jaehyeong Jo, Jaehong Yoon, Soo Ye Kim, Zhe L. Lin, and Sung Ju Hwang. Frame guid- ance: Training-free guidance for frame-level control in video diffusion models.ArXiv, abs/2506.07177, 2025. 3, 5
-
[24]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. HunyuanVideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Seg- menting invisible moving objects.BMVC, 2021
Hala Lamdouar, Weidi Xie, and Andrew Zisserman. Seg- menting invisible moving objects.BMVC, 2021. 3
work page 2021
-
[26]
Lightglue: Local feature matching at light speed
Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Polle- feys. Lightglue: Local feature matching at light speed. 2023 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 17581–17592, 2023. 3
work page 2023
-
[27]
Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond H. Chan, and Ying Shan. EvalCrafter: Benchmarking and eval- uating large video generation models.CVPR, pages 22139– 22149, 2023. 5
work page 2023
-
[28]
David G. Lowe. Object recognition from local scale- invariant features.Proceedings of the Seventh IEEE Interna- tional Conference on Computer Vision, 2:1150–1157 vol.2,
-
[29]
RePaint: Inpainting using denoising diffusion probabilistic models.CVPR, pages 11451–11461, 2022
Andreas Lugmayr, Martin Danelljan, Andr´es Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. RePaint: Inpainting using denoising diffusion probabilistic models.CVPR, pages 11451–11461, 2022. 5
work page 2022
-
[30]
Optical-flow guided prompt optimization for coherent video generation.CVPR, pages 7837–7846, 2024
Hyelin Nam, Jaemin Kim, Dohun Lee, and Jong Chul Ye. Optical-flow guided prompt optimization for coherent video generation.CVPR, pages 7837–7846, 2024. 3
work page 2024
-
[31]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, 2021. 2, 3
work page 2021
-
[32]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. SAM 2: Segment anything in images and videos. arXiv preprint arXiv:...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Com- mon objects in 3D: Large-scale learning and evaluation of real-life 3d category reconstruction
Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Com- mon objects in 3D: Large-scale learning and evaluation of real-life 3d category reconstruction. InICCV, 2021. 5, 3
work page 2021
-
[34]
Gen3C: 3D-informed world-consistent video generation with precise camera con- trol
Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas M ¨uller, Alexan- der Keller, Sanja Fidler, and Jun Gao. Gen3C: 3D-informed world-consistent video generation with precise camera con- trol. InCVPR, pages 6121–6132, 2025. 3
work page 2025
-
[35]
Paul D. Sampson. Fitting conic sections to “very scattered” data: An iterative refinement of the bookstein algorithm. Computer graphics and image processing, 1982. 2, 3
work page 1982
-
[36]
DROID-SLAM: Deep visual slam for monocular, stereo, and RGB-D cameras.NeurIPS, 34:16558–16569, 2021
Zachary Teed and Jia Deng. DROID-SLAM: Deep visual slam for monocular, stereo, and RGB-D cameras.NeurIPS, 34:16558–16569, 2021. 3
work page 2021
-
[37]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
VGGT: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InCVPR, 2025. 3, 4, 5
work page 2025
-
[39]
DUSt3R: Geometric 3D vision made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InCVPR, pages 20697–20709, 2024. 3
work page 2024
-
[40]
Zero-shot image restoration using denoising diffusion null-space model
Yinhuai Wang, Jiwen Yu, and Jian Zhang. Zero-shot image restoration using denoising diffusion null-space model. In ICLR, 2023. 5
work page 2023
-
[41]
Segmenting moving objects via an object-centric layered representation
Junyu Xie, Weidi Xie, and Andrew Zisserman. Segmenting moving objects via an object-centric layered representation. NeurIPS, 35:28023–28036, 2022. 3
work page 2022
-
[42]
Moving object segmentation: All you need is sam (and flow)
Junyu Xie, Charig Yang, Weidi Xie, and Andrew Zisserman. Moving object segmentation: All you need is sam (and flow). InACCV, pages 162–178, 2024
work page 2024
-
[43]
Self-supervised video object segmentation by motion grouping
Charig Yang, Hala Lamdouar, Erika Lu, Andrew Zisserman, and Weidi Xie. Self-supervised video object segmentation by motion grouping. InICCV, pages 7177–7188, 2021. 3
work page 2021
-
[44]
Depth any- thing V2.NeurIPS, 37:21875–21911, 2024
Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing V2.NeurIPS, 37:21875–21911, 2024. 3
work page 2024
-
[45]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. CogVideoX: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
ScanNet++: A high-fidelity dataset of 3D indoor scenes
Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. ScanNet++: A high-fidelity dataset of 3D indoor scenes. InICCV, 2023. 5, 3
work page 2023
-
[47]
Yu, Fereshteh Forghani, Konstantinos G
Jason J. Yu, Fereshteh Forghani, Konstantinos G. Derpanis, and Marcus A. Brubaker. Long-term photometric consistent novel view synthesis with diffusion models. InICCV, 2023. 2, 3
work page 2023
-
[48]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023. 3
work page 2023
-
[49]
World-consistent video diffusion with explicit 3D modeling
Qihang Zhang, Shuangfei Zhai, Miguel Angel Bautista Mar- tin, Kevin Miao, Alexander Toshev, Joshua Susskind, and Jiatao Gu. World-consistent video diffusion with explicit 3D modeling. InCVPR, pages 21685–21695, 2025. 3
work page 2025
-
[50]
ControlVideo: Training-free controllable text-to-video generation
Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, XIAOPENG ZHANG, Wangmeng Zuo, and Qi Tian. ControlVideo: Training-free controllable text-to-video generation. InICLR,
-
[51]
UFM: A simple path towards unified dense correspondence with flow
Yuchen Zhang, Nikhil Keetha, Chenwei Lyu, Bhuvan Jhamb, Yutian Chen, Yuheng Qiu, Jay Karhade, Shreyas Jha, Yaoyu Hu, Deva Ramanan, Sebastian Scherer, and Wenshan Wang. UFM: A simple path towards unified dense correspondence with flow. InarXiV, 2025. 3, 4, 5 GeCo: A Differentiable Geometric Consistency Metric for Video Generation Supplementary Material Fig...
work page 2025
-
[52]
Warm-up (t∈[0,2]): We perform no gradient updates (Rt = 0) in the initial steps to establish the global layout
-
[53]
Strong Guidance (t∈[3,19]): We applyR t = 3updates per step to enforce strong geometric constraints during the formation of structural content
-
[54]
Refinement (t∈[20,49]): We reduce the frequency to Rt = 2updates per step to maintain consistency without disrupting fine texture generation. To mitigate the accumulation of errors and prevent the latent from drifting off the data manifold during aggres- sive updates, we strictly employ Time-Travel [16] within the specific intervalt∈[15,20]. D. Details on...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.