Recognition: unknown
Video Generation with Predictive Latents
Pith reviewed 2026-05-09 16:52 UTC · model grok-4.3
The pith
A video VAE trained to predict future frames from partial observations produces latents that generate higher-quality videos.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Predictive Video VAE encodes only past frames after randomly discarding future ones and trains its decoder to reconstruct the observed frames while simultaneously predicting the missing future frames; this produces a latent space with improved temporal coherence that supports superior video generation, delivering 52 percent faster convergence and a 34.42 FVD gain over the Wan2.2 VAE on UCF101.
What carries the argument
The predictive reconstruction objective that unifies reconstruction of observed frames with prediction of future frames from partial past inputs.
If this is right
- Generative quality continues to rise as VAE training length increases, indicating the method scales.
- Latents from the model improve performance on downstream video-understanding tasks that rely on motion understanding.
- Video diffusion models built on these latents require less training time to reach a given quality level.
Where Pith is reading between the lines
- The same masking-plus-prediction pattern could be applied during pretraining of other autoregressive or diffusion-based video models to strengthen their motion priors.
- If the predictive latents capture coherent world dynamics, they may support longer-horizon video prediction without additional fine-tuning.
- The approach suggests a general route to embed predictive world-modeling signals inside reconstruction objectives for any spatiotemporal generative task.
Load-bearing premise
That forcing the latent space to encode temporally predictive structures through simultaneous reconstruction and future prediction will produce latents whose diffusability directly improves downstream generative performance.
What would settle it
Train an otherwise identical video VAE without the future-prediction term and measure whether its generated-video FVD on UCF101 is at least 30 points worse than the predictive version; equal or better performance would falsify the central claim.
read the original abstract
Video Variational Autoencoder (VAE) enables latent video generative modeling by mapping the visual world into compact spatiotemporal latent spaces, improving training efficiency and stability. While existing video VAEs achieve commendable reconstruction quality, continued optimization of reconstruction does not necessarily translate into improved generative performance. How to enhance the diffusability of video latents remains a critical and unresolved challenge. In this work, inspired by principles of predictive world modeling, we investigate the potential of predictive learning to improve the video generative modeling. To this end, we introduce a simple and effective predictive reconstruction objective that unifies predictive learning with video reconstruction. Specifically, we randomly discard future frames and encode only partial past observations, while training the decoder to reconstruct the observed frames and predict future ones simultaneously. This design encourages the latent space to encode temporally predictive structures and build a more coherent understanding of video dynamics, thereby improving generation quality. Our model, termed Predictive Video VAE (PV-VAE), achieves superior performance on video generation, with 52% faster convergence and a 34.42 FVD improvement over the Wan2.2 VAE on UCF101. Furthermore, comprehensive analyses demonstrate that PV-VAE not only exhibits favorable scalability, with generative performance improving alongside VAE training, but also yields consistent gains in downstream video understanding, underscoring a latent space that effectively captures temporal coherence and motion priors.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Predictive Video VAE (PV-VAE), a video VAE trained with a predictive reconstruction objective: future frames are randomly discarded so that the encoder sees only partial past observations, while the decoder is trained to reconstruct the observed frames and predict the missing future frames simultaneously. This is argued to encourage temporally predictive structures in the latent space, improving diffusability for downstream diffusion-based video generation. The central empirical claims are a 52% faster convergence and 34.42 FVD improvement over the Wan2.2 VAE baseline on UCF101, plus favorable scalability and gains on downstream video understanding tasks.
Significance. If the predictive objective can be shown to specifically enhance latent diffusability (rather than merely altering reconstruction statistics or training dynamics), the approach would offer a lightweight, principle-driven way to improve video VAEs without architectural overhaul. The reported scalability with VAE training compute and consistent downstream benefits would strengthen its practical value for latent generative modeling.
major comments (3)
- [Abstract] Abstract: The abstract reports concrete numerical gains (52% faster convergence and 34.42 FVD improvement over Wan2.2 VAE on UCF101) but supplies no information on experimental controls, including whether the baseline was re-trained with identical data, optimizer, compute budget, or hyperparameters, nor any mention of statistical significance or variance across runs. Without these, the gains cannot be confidently attributed to the predictive objective.
- [Abstract] Abstract and central claim: The manuscript asserts that unifying reconstruction with future-frame prediction from partial observations produces latents with improved diffusability that directly drive the observed generative gains. However, no intermediate diagnostics are described (e.g., diffusion training loss on the latents, noise-prediction error curves, or latent-space Fréchet distance) that would isolate diffusability improvements from confounding factors such as shifts in reconstruction-prediction trade-off or incidental changes in latent marginals.
- [Abstract] The skeptic's concern is borne out: end-to-end FVD and convergence metrics alone do not rule out alternative explanations for the improvement. An ablation that trains the identical architecture with a pure reconstruction objective (or with prediction disabled) is required to establish that the predictive component is load-bearing for the diffusability claim.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our paper. We address each of the major comments below and have revised the manuscript to incorporate additional details, diagnostics, and ablations as suggested. These changes strengthen the presentation of our results and the evidence for the benefits of the predictive objective.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract reports concrete numerical gains (52% faster convergence and 34.42 FVD improvement over Wan2.2 VAE on UCF101) but supplies no information on experimental controls, including whether the baseline was re-trained with identical data, optimizer, compute budget, or hyperparameters, nor any mention of statistical significance or variance across runs. Without these, the gains cannot be confidently attributed to the predictive objective.
Authors: We agree with this observation and have revised the abstract to include a brief statement on the experimental controls: the Wan2.2 VAE baseline was re-trained with the same data, optimizer, and compute budget. We have also added details on statistical significance and variance (averaged over three independent runs) in the main text and supplementary material. This should allow readers to better evaluate the reported gains. revision: yes
-
Referee: [Abstract] Abstract and central claim: The manuscript asserts that unifying reconstruction with future-frame prediction from partial observations produces latents with improved diffusability that directly drive the observed generative gains. However, no intermediate diagnostics are described (e.g., diffusion training loss on the latents, noise-prediction error curves, or latent-space Fréchet distance) that would isolate diffusability improvements from confounding factors such as shifts in reconstruction-prediction trade-off or incidental changes in latent marginals.
Authors: We acknowledge the importance of such diagnostics to isolate the effect on diffusability. In the revised manuscript, we have included new figures showing the diffusion training loss curves for PV-VAE latents versus the baseline, demonstrating faster convergence and lower error in noise prediction. Additionally, we report latent-space Fréchet distances to show improved alignment in the latent distribution. These additions help rule out alternative explanations related to reconstruction trade-offs. revision: yes
-
Referee: [Abstract] The skeptic's concern is borne out: end-to-end FVD and convergence metrics alone do not rule out alternative explanations for the improvement. An ablation that trains the identical architecture with a pure reconstruction objective (or with prediction disabled) is required to establish that the predictive component is load-bearing for the diffusability claim.
Authors: We agree that this ablation is necessary to substantiate our central claim. We have added a dedicated ablation study in the revised manuscript (Section 4.3) where we train the same architecture with the predictive component disabled, using only reconstruction loss. The results confirm that the reconstruction-only variant performs comparably to the Wan2.2 baseline without the reported gains in FVD or convergence speed. This establishes that the predictive objective is indeed load-bearing. We have also updated the abstract to reference this ablation. revision: yes
Circularity Check
No circularity: predictive objective defined independently of generative metrics
full rationale
The paper defines its core training objective (randomly masking future frames, encoding partial observations, and jointly reconstructing observed frames while predicting future ones) as an independent design choice motivated by predictive world modeling. This objective is not derived from or fitted to the downstream FVD or convergence metrics; instead, the VAE is trained with the predictive loss and then evaluated separately on video generation tasks. No equations reduce the claimed diffusability improvement to a tautology, no self-citations bear the central load, and no fitted parameters are relabeled as predictions. The reported gains are empirical outcomes, not forced by construction from the inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Self-supervised learning from images with a joint-embedding predictive architecture
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023
2023
-
[2]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025
work page internal anchor Pith review arXiv 2025
-
[3]
Back to the features: Dino as a foundation for video world models.arXiv preprint arXiv:2507.19468,
Federico Baldassarre, Marc Szafraniec, Basile Terver, Vasil Khalidov, Francisco Massa, Yann LeCun, Patrick Labatut, Maximilian Seitzer, and Piotr Bojanowski. Back to the features: Dino as a foundation for video world models. arXiv preprint arXiv:2507.19468, 2025
-
[4]
V-jepa: Latent video prediction for visual representation learning
Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. V-jepa: Latent video prediction for visual representation learning. 2023
2023
-
[5]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review arXiv 2023
-
[6]
Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024
2024
-
[7]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advancesin neural information processing systems, 33:1877–1901, 2020
1901
-
[8]
D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In A. Fitzgibbon et al. (Eds.), editor,European Conf. on Computer Vision (ECCV), Part IV, LNCS 7577, pages 611–625. Springer-Verlag, October 2012
2012
-
[9]
Deep compression autoencoder for efficient high-resolution diffusion models
Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733, 2024
-
[10]
arXiv preprint arXiv:2409.01199 (2024)
Liuhan Chen, Zongjian Li, Bin Lin, Bin Zhu, Qian Wang, Shenghai Yuan, Xing Zhou, Xinhua Cheng, and Li Yuan. Od-vae: An omni-dimensional video compressor for improving latent video diffusion model.arXiv preprint arXiv:2409.01199, 2024
-
[11]
Leanvae: An ultra-efficient reconstruction vae for video diffusion models
Yu Cheng and Fajie Yuan. Leanvae: An ultra-efficient reconstruction vae for video diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15692–15702, 2025
2025
-
[12]
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019
2019
-
[13]
Tap-vid: A benchmark for tracking any point in a video.Advancesin Neural Information Processing Systems, 35:13610–13626, 2022
Carl Doersch, Ankush Gupta, Larisa Markeeva, Adria Recasens, Lucas Smaira, Yusuf Aytar, Joao Carreira, Andrew Zisserman, and Yi Yang. Tap-vid: A benchmark for tracking any point in a video.Advancesin Neural Information Processing Systems, 35:13610–13626, 2022
2022
-
[14]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[15]
Optical flow estimation
David Fleet and Yair Weiss. Optical flow estimation. InHandbook of mathematical models in computer vision, pages 237–257. Springer, 2006
2006
-
[16]
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113, 2025
work page internal anchor Pith review arXiv 2025
-
[17]
Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020. 12
2020
-
[18]
Siamese masked autoencoders.Advancesin Neural Information Processing Systems, 36:40676–40693, 2023
Agrim Gupta, Jiajun Wu, Jia Deng, and Fei-Fei Li. Siamese masked autoencoders.Advancesin Neural Information Processing Systems, 36:40676–40693, 2023
2023
-
[19]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022
2022
-
[20]
Image quality metrics: Psnr vs
Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010
2010
-
[21]
The Kinetics Human Action Video Dataset
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017
work page internal anchor Pith review arXiv 2017
-
[22]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
A path towards autonomous machine intelligence version 0.9
Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1): 1–62, 2022
2022
-
[24]
Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers
Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18262–18272, 2025
2025
-
[25]
Wf-vae: Enhancing video vae by wavelet-driven energy flow for latent video diffusion model
Zongjian Li, Bin Lin, Yang Ye, Liuhan Chen, Xinhua Cheng, Shenghai Yuan, and Li Yuan. Wf-vae: Enhancing video vae by wavelet-driven energy flow for latent video diffusion model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17778–17788, 2025
2025
-
[26]
Shizhan Liu, Xinran Deng, Zhuoyi Yang, Jiayan Teng, Xiaotao Gu, and Jie Tang. Delving into latent spectral biasing of video vaes for superior diffusability.arXiv preprint arXiv:2512.05394, 2025
-
[27]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022
work page internal anchor Pith review arXiv 2022
-
[28]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[29]
Latte: Latent Diffusion Transformer for Video Generation
Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024
work page internal anchor Pith review arXiv 2024
-
[30]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
A benchmark dataset and evaluation methodology for video object segmentation
Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine- Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 724–732, 2016
2016
-
[32]
Variational autoencoder
Lucas Pinheiro Cinelli, Matheus Araújo Marins, Eduardo Antúnio Barros da Silva, and Sérgio Lima Netto. Variational autoencoder. InVariational methods for machine learning with applications to deep networks, pages 111–149. Springer, 2021
2021
-
[33]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
2022
-
[34]
Human activity prediction: Early recognition of ongoing activities from streaming videos
Michael S Ryoo. Human activity prediction: Early recognition of ongoing activities from streaming videos. In 2011 international conference on computer vision, pages 1036–1043. IEEE, 2011
2011
-
[35]
Litevae: Lightweight and efficient variational autoencoders for latent diffusion models.Advances in Neural Information Processing Systems, 37:3907–3936, 2024
Seyedmorteza Sadat, Jakob Buhmann, Derek Bradley, Otmar Hilliges, and Romann M Weber. Litevae: Lightweight and efficient variational autoencoders for latent diffusion models.Advances in Neural Information Processing Systems, 37:3907–3936, 2024
2024
-
[36]
Train sparsely, generate densely: Memory- efficient unsupervised training of high-resolution temporal gan.International Journal of Computer Vision, 128 (10):2586–2606, 2020
Masaki Saito, Shunta Saito, Masanori Koyama, and Sosuke Kobayashi. Train sparsely, generate densely: Memory- efficient unsupervised training of high-resolution temporal gan.International Journal of Computer Vision, 128 (10):2586–2606, 2020. 13
2020
-
[37]
Seedance 2.0: Advancing Video Generation for World Complexity
Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[38]
Rectok: Reconstruction distillation along rectified flow
Qingyu Shi, Size Wu, Jinbin Bai, Kaidong Yu, Yujing Wang, Yunhai Tong, Xiangtai Li, and Xuelong Li. Rectok: Reconstruction distillation along rectified flow.arXiv preprint arXiv:2512.13421, 2025
-
[39]
Improving the diffusability of autoencoders.arXiv preprint arXiv:2502.14831, 2025
Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, and Aliaksandr Siarohin. Improving the diffusability of autoencoders.arXiv preprint arXiv:2502.14831, 2025
-
[40]
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012
work page internal anchor Pith review arXiv 2012
-
[41]
Emergent correspondence from image diffusion.Advancesin Neural Information Processing Systems, 36:1363–1389, 2023
Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emergent correspondence from image diffusion.Advancesin Neural Information Processing Systems, 36:1363–1389, 2023
2023
-
[42]
Raft: Recurrent all-pairs field transforms for optical flow
Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean conference on computer vision, pages 402–419. Springer, 2020
2020
-
[43]
MAGI-1: Autoregressive Video Generation at Scale
Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025
work page internal anchor Pith review arXiv 2025
-
[44]
Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, and Saining Xie. Scaling text-to-image diffusion transformers with representation autoencoders. arXiv preprint arXiv:2601.16208, 2026
-
[45]
Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advancesin neural information processing systems, 35:10078–10093, 2022
Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advancesin neural information processing systems, 35:10078–10093, 2022
2022
-
[46]
Learning spatiotemporal features with 3d convolutional networks
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. InProceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015
2015
-
[47]
Towards Accurate Generative Models of Video: A New Metric & Challenges
Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018
work page internal anchor Pith review arXiv 2018
-
[48]
Pedro Vélez, Luisa F Polanía, Yi Yang, Chuhan Zhang, Rishabh Kabra, Anurag Arnab, and Mehdi SM Sajjadi. From image to video: An empirical study of diffusion representations.arXiv preprint arXiv:2502.07001, 2025
-
[49]
Predicting actions from static scenes
Tuan-Hung Vu, Catherine Olsson, Ivan Laptev, Aude Oliva, and Josef Sivic. Predicting actions from static scenes. In European Conference on Computer Vision, pages 421–436. Springer, 2014
2014
-
[50]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Vidtwin: Video vae with decoupled structure and dynamics
Yuchi Wang, Junliang Guo, Xinyi Xie, Tianyu He, Xu Sun, and Jiang Bian. Vidtwin: Video vae with decoupled structure and dynamics. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 22922–22932, 2025
2025
-
[52]
Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004
2004
-
[53]
Improved video vae for latent video diffusion model
Pingyu Wu, Kai Zhu, Yu Liu, Liming Zhao, Wei Zhai, Yang Cao, and Zheng-Jun Zha. Improved video vae for latent video diffusion model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18124–18133, 2025
2025
-
[54]
H3ae: High compression, high speed, and high quality autoencoder for video diffusion models
Yushu Wu, Yanyu Li, Ivan Skorokhodov, Anil Kag, Willi Menapace, Sharath Girish, Aliaksandr Siarohin, Yanzhi Wang, and Sergey Tulyakov. H3ae: High compression, high speed, and high quality autoencoder for video diffusion models. arXiv preprint arXiv:2504.10567, 2025
-
[55]
Simmim: A simple framework for masked image modeling
Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9653–9663, 2022. 14
2022
-
[56]
Latent denoising makes good tokenizers
Jiawei Yang, Tianhong Li, Lijie Fan, Yonglong Tian, and Yue Wang. Latent denoising makes good tokenizers. In The FourteenthInternational Conference on Learning Representations, 2026
2026
-
[57]
Cambrian-S: Towards Spatial Supersensing in Video.arXiv preprint arXiv:2511.04670, 2025
Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, et al. Cambrian-s: Towards spatial supersensing in video.arXiv preprint arXiv:2511.04670, 2025
-
[58]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024
work page internal anchor Pith review arXiv 2024
-
[59]
Towards scalable pre-training of visual tokenizers for generation
Jingfeng Yao, Yuda Song, Yucong Zhou, and Xinggang Wang. Towards scalable pre-training of visual tokenizers for generation. arXiv preprint arXiv:2512.13687, 2025
-
[60]
Reconstruction vs
Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025
2025
-
[61]
Xiangchen Yin, Jiahui Yuan, Zhangchi Hu, Wenzhang Sun, Jie Chen, Xiaozhen Qiao, Hao Li, and Xiaoyan Sun. Deco-vae: Learning compact latents for video reconstruction via decoupled representation.arXiv preprint arXiv:2511.14530, 2025
-
[62]
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024
work page internal anchor Pith review arXiv 2024
-
[63]
Sihyun Yu, Weili Nie, De-An Huang, Boyi Li, Jinwoo Shin, and Anima Anandkumar. Efficient video diffusion models via content-frame motion-latent decomposition.arXiv preprint arXiv:2403.14148, 2024
-
[64]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018
2018
-
[65]
Shilong Zhang, He Zhang, Zhifei Zhang, Chongjian Ge, Shuchen Xue, Shaoteng Liu, Mengwei Ren, Soo Ye Kim, Yuqian Zhou, Qing Liu, et al. Both semantics and reconstruction matter: Making representation encoders ready for text-to-image generation and editing.arXiv preprint arXiv:2512.17909, 2025
-
[66]
Cv-vae: A compatible video vae for latent generative video models.Advances in Neural Information Processing Systems, 37:12847–12871, 2024
Sijie Zhao, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Muyao Niu, Xiaoyu Li, Wenbo Hu, and Ying Shan. Cv-vae: A compatible video vae for latent generative video models.Advances in Neural Information Processing Systems, 37:12847–12871, 2024
2024
-
[67]
Diffusion Transformers with Representation Autoencoders
Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690, 2025
work page internal anchor Pith review arXiv 2025
-
[68]
Open-Sora: Democratizing Efficient Video Production for All
Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024
work page internal anchor Pith review arXiv 2024
-
[69]
Stereo Magnification: Learning View Synthesis using Multiplane Images
Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018
work page internal anchor Pith review arXiv 2018
-
[70]
Deep learning in next-frame prediction: A benchmark review
Yufan Zhou, Haiwei Dong, and Abdulmotaleb El Saddik. Deep learning in next-frame prediction: A benchmark review. IEEE Access, 8:69273–69283, 2020
2020
-
[71]
Exploring pre-trained text-to-video diffusion models for referring video object segmentation
Zixin Zhu, Xuelu Feng, Dongdong Chen, Junsong Yuan, Chunming Qiao, and Gang Hua. Exploring pre-trained text-to-video diffusion models for referring video object segmentation. InEuropean Conference on Computer Vision, pages 452–469. Springer, 2024. 15
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.