arxiv: 2605.02134 · v1 · submitted 2026-05-04 · 💻 cs.CV

Recognition: unknown

Video Generation with Predictive Latents

Yian Zhao , Feng Wang , Qiushan Guo , Chang Liu , Xiangyang Ji , Jian Zhang , Jie Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-09 16:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords video vaepredictive learninglatent video generationtemporal coherencevideo diffusionfuture predictionreconstruction objective

0 comments

The pith

A video VAE trained to predict future frames from partial observations produces latents that generate higher-quality videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that existing video VAEs optimize reconstruction without necessarily improving generation because the resulting latents lack sufficient temporal structure for diffusion models. By randomly dropping future frames so the encoder sees only partial past observations, then training the decoder to both reconstruct the seen frames and predict the unseen ones, the latent space is forced to encode predictive dynamics. This unified objective yields faster convergence and stronger generative results on standard benchmarks. The design directly targets the diffusability problem that has limited prior video VAEs.

Core claim

The Predictive Video VAE encodes only past frames after randomly discarding future ones and trains its decoder to reconstruct the observed frames while simultaneously predicting the missing future frames; this produces a latent space with improved temporal coherence that supports superior video generation, delivering 52 percent faster convergence and a 34.42 FVD gain over the Wan2.2 VAE on UCF101.

What carries the argument

The predictive reconstruction objective that unifies reconstruction of observed frames with prediction of future frames from partial past inputs.

If this is right

Generative quality continues to rise as VAE training length increases, indicating the method scales.
Latents from the model improve performance on downstream video-understanding tasks that rely on motion understanding.
Video diffusion models built on these latents require less training time to reach a given quality level.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same masking-plus-prediction pattern could be applied during pretraining of other autoregressive or diffusion-based video models to strengthen their motion priors.
If the predictive latents capture coherent world dynamics, they may support longer-horizon video prediction without additional fine-tuning.
The approach suggests a general route to embed predictive world-modeling signals inside reconstruction objectives for any spatiotemporal generative task.

Load-bearing premise

That forcing the latent space to encode temporally predictive structures through simultaneous reconstruction and future prediction will produce latents whose diffusability directly improves downstream generative performance.

What would settle it

Train an otherwise identical video VAE without the future-prediction term and measure whether its generated-video FVD on UCF101 is at least 30 points worse than the predictive version; equal or better performance would falsify the central claim.

read the original abstract

Video Variational Autoencoder (VAE) enables latent video generative modeling by mapping the visual world into compact spatiotemporal latent spaces, improving training efficiency and stability. While existing video VAEs achieve commendable reconstruction quality, continued optimization of reconstruction does not necessarily translate into improved generative performance. How to enhance the diffusability of video latents remains a critical and unresolved challenge. In this work, inspired by principles of predictive world modeling, we investigate the potential of predictive learning to improve the video generative modeling. To this end, we introduce a simple and effective predictive reconstruction objective that unifies predictive learning with video reconstruction. Specifically, we randomly discard future frames and encode only partial past observations, while training the decoder to reconstruct the observed frames and predict future ones simultaneously. This design encourages the latent space to encode temporally predictive structures and build a more coherent understanding of video dynamics, thereby improving generation quality. Our model, termed Predictive Video VAE (PV-VAE), achieves superior performance on video generation, with 52% faster convergence and a 34.42 FVD improvement over the Wan2.2 VAE on UCF101. Furthermore, comprehensive analyses demonstrate that PV-VAE not only exhibits favorable scalability, with generative performance improving alongside VAE training, but also yields consistent gains in downstream video understanding, underscoring a latent space that effectively captures temporal coherence and motion priors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PV-VAE adds a predictive reconstruction loss to video VAEs that produces faster diffusion convergence and lower FVD on UCF101, but the gains rest on end-to-end metrics without direct evidence that the latents became more diffusible.

read the letter

The paper trains a video VAE so the decoder reconstructs observed frames while also predicting future ones from partial past observations. This is meant to push the latent space toward better temporal structure for later diffusion training. They report 52% faster convergence and a 34-point FVD drop versus Wan2.2 on UCF101, plus some scalability and downstream understanding gains. The method is simple and draws from existing predictive modeling ideas without overcomplicating the architecture. That counts as a usable incremental change if you're already working with video VAEs. The numbers are specific enough to be worth testing. The main weakness is that the evidence stays at the final generation stage. There are no reported checks on diffusion training loss, latent-space metrics, or noise-prediction error to show the predictive objective actually improved diffusability rather than just shifting reconstruction quality or latent statistics. Without those controls or ablations that isolate the predictive term, the causal link stays unproven. The abstract mentions analyses, but if they do not include the right intermediate diagnostics, the central claim rests on correlation. This paper is for people already tuning latent video generators who want a small training tweak to try. It is not a foundational shift, but the idea is straightforward to implement and the results are concrete. I would send it for peer review so referees can ask for the missing ablations and verify the baseline comparisons.

Referee Report

3 major / 0 minor

Summary. The paper proposes Predictive Video VAE (PV-VAE), a video VAE trained with a predictive reconstruction objective: future frames are randomly discarded so that the encoder sees only partial past observations, while the decoder is trained to reconstruct the observed frames and predict the missing future frames simultaneously. This is argued to encourage temporally predictive structures in the latent space, improving diffusability for downstream diffusion-based video generation. The central empirical claims are a 52% faster convergence and 34.42 FVD improvement over the Wan2.2 VAE baseline on UCF101, plus favorable scalability and gains on downstream video understanding tasks.

Significance. If the predictive objective can be shown to specifically enhance latent diffusability (rather than merely altering reconstruction statistics or training dynamics), the approach would offer a lightweight, principle-driven way to improve video VAEs without architectural overhaul. The reported scalability with VAE training compute and consistent downstream benefits would strengthen its practical value for latent generative modeling.

major comments (3)

[Abstract] Abstract: The abstract reports concrete numerical gains (52% faster convergence and 34.42 FVD improvement over Wan2.2 VAE on UCF101) but supplies no information on experimental controls, including whether the baseline was re-trained with identical data, optimizer, compute budget, or hyperparameters, nor any mention of statistical significance or variance across runs. Without these, the gains cannot be confidently attributed to the predictive objective.
[Abstract] Abstract and central claim: The manuscript asserts that unifying reconstruction with future-frame prediction from partial observations produces latents with improved diffusability that directly drive the observed generative gains. However, no intermediate diagnostics are described (e.g., diffusion training loss on the latents, noise-prediction error curves, or latent-space Fréchet distance) that would isolate diffusability improvements from confounding factors such as shifts in reconstruction-prediction trade-off or incidental changes in latent marginals.
[Abstract] The skeptic's concern is borne out: end-to-end FVD and convergence metrics alone do not rule out alternative explanations for the improvement. An ablation that trains the identical architecture with a pure reconstruction objective (or with prediction disabled) is required to establish that the predictive component is load-bearing for the diffusability claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback on our paper. We address each of the major comments below and have revised the manuscript to incorporate additional details, diagnostics, and ablations as suggested. These changes strengthen the presentation of our results and the evidence for the benefits of the predictive objective.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract reports concrete numerical gains (52% faster convergence and 34.42 FVD improvement over Wan2.2 VAE on UCF101) but supplies no information on experimental controls, including whether the baseline was re-trained with identical data, optimizer, compute budget, or hyperparameters, nor any mention of statistical significance or variance across runs. Without these, the gains cannot be confidently attributed to the predictive objective.

Authors: We agree with this observation and have revised the abstract to include a brief statement on the experimental controls: the Wan2.2 VAE baseline was re-trained with the same data, optimizer, and compute budget. We have also added details on statistical significance and variance (averaged over three independent runs) in the main text and supplementary material. This should allow readers to better evaluate the reported gains. revision: yes
Referee: [Abstract] Abstract and central claim: The manuscript asserts that unifying reconstruction with future-frame prediction from partial observations produces latents with improved diffusability that directly drive the observed generative gains. However, no intermediate diagnostics are described (e.g., diffusion training loss on the latents, noise-prediction error curves, or latent-space Fréchet distance) that would isolate diffusability improvements from confounding factors such as shifts in reconstruction-prediction trade-off or incidental changes in latent marginals.

Authors: We acknowledge the importance of such diagnostics to isolate the effect on diffusability. In the revised manuscript, we have included new figures showing the diffusion training loss curves for PV-VAE latents versus the baseline, demonstrating faster convergence and lower error in noise prediction. Additionally, we report latent-space Fréchet distances to show improved alignment in the latent distribution. These additions help rule out alternative explanations related to reconstruction trade-offs. revision: yes
Referee: [Abstract] The skeptic's concern is borne out: end-to-end FVD and convergence metrics alone do not rule out alternative explanations for the improvement. An ablation that trains the identical architecture with a pure reconstruction objective (or with prediction disabled) is required to establish that the predictive component is load-bearing for the diffusability claim.

Authors: We agree that this ablation is necessary to substantiate our central claim. We have added a dedicated ablation study in the revised manuscript (Section 4.3) where we train the same architecture with the predictive component disabled, using only reconstruction loss. The results confirm that the reconstruction-only variant performs comparably to the Wan2.2 baseline without the reported gains in FVD or convergence speed. This establishes that the predictive objective is indeed load-bearing. We have also updated the abstract to reference this ablation. revision: yes

Circularity Check

0 steps flagged

No circularity: predictive objective defined independently of generative metrics

full rationale

The paper defines its core training objective (randomly masking future frames, encoding partial observations, and jointly reconstructing observed frames while predicting future ones) as an independent design choice motivated by predictive world modeling. This objective is not derived from or fitted to the downstream FVD or convergence metrics; instead, the VAE is trained with the predictive loss and then evaluated separately on video generation tasks. No equations reduce the claimed diffusability improvement to a tautology, no self-citations bear the central load, and no fitted parameters are relabeled as predictions. The reported gains are empirical outcomes, not forced by construction from the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; all technical details are absent.

pith-pipeline@v0.9.0 · 5548 in / 1084 out tokens · 31895 ms · 2026-05-09T16:52:54.414527+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 34 canonical work pages · 20 internal anchors

[1]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023

2023
[2]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review arXiv 2025
[3]

Back to the features: Dino as a foundation for video world models.arXiv preprint arXiv:2507.19468,

Federico Baldassarre, Marc Szafraniec, Basile Terver, Vasil Khalidov, Francisco Massa, Yann LeCun, Patrick Labatut, Maximilian Seitzer, and Piotr Bojanowski. Back to the features: Dino as a foundation for video world models. arXiv preprint arXiv:2507.19468, 2025

work page arXiv 2025
[4]

V-jepa: Latent video prediction for visual representation learning

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. V-jepa: Latent video prediction for visual representation learning. 2023

2023
[5]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review arXiv 2023
[6]

Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

2024
[7]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advancesin neural information processing systems, 33:1877–1901, 2020

1901
[8]

D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In A. Fitzgibbon et al. (Eds.), editor,European Conf. on Computer Vision (ECCV), Part IV, LNCS 7577, pages 611–625. Springer-Verlag, October 2012

2012
[9]

Deep compression autoencoder for efficient high-resolution diffusion models

Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733, 2024

work page arXiv 2024
[10]

arXiv preprint arXiv:2409.01199 (2024)

Liuhan Chen, Zongjian Li, Bin Lin, Bin Zhu, Qian Wang, Shenghai Yuan, Xing Zhou, Xinhua Cheng, and Li Yuan. Od-vae: An omni-dimensional video compressor for improving latent video diffusion model.arXiv preprint arXiv:2409.01199, 2024

work page arXiv 2024
[11]

Leanvae: An ultra-efficient reconstruction vae for video diffusion models

Yu Cheng and Fajie Yuan. Leanvae: An ultra-efficient reconstruction vae for video diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15692–15702, 2025

2025
[12]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

2019
[13]

Tap-vid: A benchmark for tracking any point in a video.Advancesin Neural Information Processing Systems, 35:13610–13626, 2022

Carl Doersch, Ankush Gupta, Larisa Markeeva, Adria Recasens, Lucas Smaira, Yusuf Aytar, Joao Carreira, Andrew Zisserman, and Yi Yang. Tap-vid: A benchmark for tracking any point in a video.Advancesin Neural Information Processing Systems, 35:13610–13626, 2022

2022
[14]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[15]

Optical flow estimation

David Fleet and Yair Weiss. Optical flow estimation. InHandbook of mathematical models in computer vision, pages 237–257. Springer, 2006

2006
[16]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113, 2025

work page internal anchor Pith review arXiv 2025
[17]

Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020. 12

2020
[18]

Siamese masked autoencoders.Advancesin Neural Information Processing Systems, 36:40676–40693, 2023

Agrim Gupta, Jiajun Wu, Jia Deng, and Fei-Fei Li. Siamese masked autoencoders.Advancesin Neural Information Processing Systems, 36:40676–40693, 2023

2023
[19]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

2022
[20]

Image quality metrics: Psnr vs

Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010

2010
[21]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review arXiv 2017
[22]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

A path towards autonomous machine intelligence version 0.9

Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1): 1–62, 2022

2022
[24]

Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18262–18272, 2025

2025
[25]

Wf-vae: Enhancing video vae by wavelet-driven energy flow for latent video diffusion model

Zongjian Li, Bin Lin, Yang Ye, Liuhan Chen, Xinhua Cheng, Shenghai Yuan, and Li Yuan. Wf-vae: Enhancing video vae by wavelet-driven energy flow for latent video diffusion model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17778–17788, 2025

2025
[26]

Delving into latent spectral biasing of video vaes for superior diffusability.arXiv preprint arXiv:2512.05394, 2025

Shizhan Liu, Xinran Deng, Zhuoyi Yang, Jiayan Teng, Xiaotao Gu, and Jie Tang. Delving into latent spectral biasing of video vaes for superior diffusability.arXiv preprint arXiv:2512.05394, 2025

work page arXiv 2025
[27]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review arXiv 2022
[28]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

work page internal anchor Pith review arXiv 2024
[30]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

A benchmark dataset and evaluation methodology for video object segmentation

Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine- Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 724–732, 2016

2016
[32]

Variational autoencoder

Lucas Pinheiro Cinelli, Matheus Araújo Marins, Eduardo Antúnio Barros da Silva, and Sérgio Lima Netto. Variational autoencoder. InVariational methods for machine learning with applications to deep networks, pages 111–149. Springer, 2021

2021
[33]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022
[34]

Human activity prediction: Early recognition of ongoing activities from streaming videos

Michael S Ryoo. Human activity prediction: Early recognition of ongoing activities from streaming videos. In 2011 international conference on computer vision, pages 1036–1043. IEEE, 2011

2011
[35]

Litevae: Lightweight and efficient variational autoencoders for latent diffusion models.Advances in Neural Information Processing Systems, 37:3907–3936, 2024

Seyedmorteza Sadat, Jakob Buhmann, Derek Bradley, Otmar Hilliges, and Romann M Weber. Litevae: Lightweight and efficient variational autoencoders for latent diffusion models.Advances in Neural Information Processing Systems, 37:3907–3936, 2024

2024
[36]

Train sparsely, generate densely: Memory- efficient unsupervised training of high-resolution temporal gan.International Journal of Computer Vision, 128 (10):2586–2606, 2020

Masaki Saito, Shunta Saito, Masanori Koyama, and Sosuke Kobayashi. Train sparsely, generate densely: Memory- efficient unsupervised training of high-resolution temporal gan.International Journal of Computer Vision, 128 (10):2586–2606, 2020. 13

2020
[37]

Seedance 2.0: Advancing Video Generation for World Complexity

Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity.arXiv preprint arXiv:2604.14148, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

Rectok: Reconstruction distillation along rectified flow

Qingyu Shi, Size Wu, Jinbin Bai, Kaidong Yu, Yujing Wang, Yunhai Tong, Xiangtai Li, and Xuelong Li. Rectok: Reconstruction distillation along rectified flow.arXiv preprint arXiv:2512.13421, 2025

work page arXiv 2025
[39]

Improving the diffusability of autoencoders.arXiv preprint arXiv:2502.14831, 2025

Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, and Aliaksandr Siarohin. Improving the diffusability of autoencoders.arXiv preprint arXiv:2502.14831, 2025

work page arXiv 2025
[40]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv:1212.0402, 2012

work page internal anchor Pith review arXiv 2012
[41]

Emergent correspondence from image diffusion.Advancesin Neural Information Processing Systems, 36:1363–1389, 2023

Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emergent correspondence from image diffusion.Advancesin Neural Information Processing Systems, 36:1363–1389, 2023

2023
[42]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean conference on computer vision, pages 402–419. Springer, 2020

2020
[43]

MAGI-1: Autoregressive Video Generation at Scale

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

work page internal anchor Pith review arXiv 2025
[44]

Scaling text-to-image diffusion transformers with representation autoencoders.arXiv preprint arXiv:2601.16208, 2026

Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, and Saining Xie. Scaling text-to-image diffusion transformers with representation autoencoders. arXiv preprint arXiv:2601.16208, 2026

work page arXiv 2026
[45]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advancesin neural information processing systems, 35:10078–10093, 2022

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advancesin neural information processing systems, 35:10078–10093, 2022

2022
[46]

Learning spatiotemporal features with 3d convolutional networks

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. InProceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015

2015
[47]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review arXiv 2018
[48]

From image to video: An empirical study of diffusion representations.arXiv preprint arXiv:2502.07001, 2025

Pedro Vélez, Luisa F Polanía, Yi Yang, Chuhan Zhang, Rishabh Kabra, Anurag Arnab, and Mehdi SM Sajjadi. From image to video: An empirical study of diffusion representations.arXiv preprint arXiv:2502.07001, 2025

work page arXiv 2025
[49]

Predicting actions from static scenes

Tuan-Hung Vu, Catherine Olsson, Ivan Laptev, Aude Oliva, and Josef Sivic. Predicting actions from static scenes. In European Conference on Computer Vision, pages 421–436. Springer, 2014

2014
[50]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Vidtwin: Video vae with decoupled structure and dynamics

Yuchi Wang, Junliang Guo, Xinyi Xie, Tianyu He, Xu Sun, and Jiang Bian. Vidtwin: Video vae with decoupled structure and dynamics. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 22922–22932, 2025

2025
[52]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

2004
[53]

Improved video vae for latent video diffusion model

Pingyu Wu, Kai Zhu, Yu Liu, Liming Zhao, Wei Zhai, Yang Cao, and Zheng-Jun Zha. Improved video vae for latent video diffusion model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18124–18133, 2025

2025
[54]

H3ae: High compression, high speed, and high quality autoencoder for video diffusion models

Yushu Wu, Yanyu Li, Ivan Skorokhodov, Anil Kag, Willi Menapace, Sharath Girish, Aliaksandr Siarohin, Yanzhi Wang, and Sergey Tulyakov. H3ae: High compression, high speed, and high quality autoencoder for video diffusion models. arXiv preprint arXiv:2504.10567, 2025

work page arXiv 2025
[55]

Simmim: A simple framework for masked image modeling

Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9653–9663, 2022. 14

2022
[56]

Latent denoising makes good tokenizers

Jiawei Yang, Tianhong Li, Lijie Fan, Yonglong Tian, and Yue Wang. Latent denoising makes good tokenizers. In The FourteenthInternational Conference on Learning Representations, 2026

2026
[57]

Cambrian-S: Towards Spatial Supersensing in Video.arXiv preprint arXiv:2511.04670, 2025

Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, et al. Cambrian-s: Towards spatial supersensing in video.arXiv preprint arXiv:2511.04670, 2025

work page arXiv 2025
[58]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review arXiv 2024
[59]

Towards scalable pre-training of visual tokenizers for generation

Jingfeng Yao, Yuda Song, Yucong Zhou, and Xinggang Wang. Towards scalable pre-training of visual tokenizers for generation. arXiv preprint arXiv:2512.13687, 2025

work page arXiv 2025
[60]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025

2025
[61]

Deco-vae: Learning compact latents for video reconstruction via decoupled representation.arXiv preprint arXiv:2511.14530, 2025

Xiangchen Yin, Jiahui Yuan, Zhangchi Hu, Wenzhang Sun, Jie Chen, Xiaozhen Qiao, Hao Li, and Xiaoyan Sun. Deco-vae: Learning compact latents for video reconstruction via decoupled representation.arXiv preprint arXiv:2511.14530, 2025

work page arXiv 2025
[62]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

work page internal anchor Pith review arXiv 2024
[63]

Efficient video diffusion models via content-frame motion-latent decomposition.arXiv preprint arXiv:2403.14148, 2024

Sihyun Yu, Weili Nie, De-An Huang, Boyi Li, Jinwoo Shin, and Anima Anandkumar. Efficient video diffusion models via content-frame motion-latent decomposition.arXiv preprint arXiv:2403.14148, 2024

work page arXiv 2024
[64]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

2018
[65]

Both semantics and reconstruction matter: Making rep- resentation encoders ready for text-to-image generation and editing

Shilong Zhang, He Zhang, Zhifei Zhang, Chongjian Ge, Shuchen Xue, Shaoteng Liu, Mengwei Ren, Soo Ye Kim, Yuqian Zhou, Qing Liu, et al. Both semantics and reconstruction matter: Making representation encoders ready for text-to-image generation and editing.arXiv preprint arXiv:2512.17909, 2025

work page arXiv 2025
[66]

Cv-vae: A compatible video vae for latent generative video models.Advances in Neural Information Processing Systems, 37:12847–12871, 2024

Sijie Zhao, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Muyao Niu, Xiaoyu Li, Wenbo Hu, and Ying Shan. Cv-vae: A compatible video vae for latent generative video models.Advances in Neural Information Processing Systems, 37:12847–12871, 2024

2024
[67]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690, 2025

work page internal anchor Pith review arXiv 2025
[68]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024

work page internal anchor Pith review arXiv 2024
[69]

Stereo Magnification: Learning View Synthesis using Multiplane Images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018

work page internal anchor Pith review arXiv 2018
[70]

Deep learning in next-frame prediction: A benchmark review

Yufan Zhou, Haiwei Dong, and Abdulmotaleb El Saddik. Deep learning in next-frame prediction: A benchmark review. IEEE Access, 8:69273–69283, 2020

2020
[71]

Exploring pre-trained text-to-video diffusion models for referring video object segmentation

Zixin Zhu, Xuelu Feng, Dongdong Chen, Junsong Yuan, Chunming Qiao, and Gang Hua. Exploring pre-trained text-to-video diffusion models for referring video object segmentation. InEuropean Conference on Computer Vision, pages 452–469. Springer, 2024. 15

2024