pith. sign in

arxiv: 2507.13942 · v2 · submitted 2025-07-18 · 💻 cs.CV · cs.AI· cs.LG

Frozen Forecasting: A Unified Evaluation

Pith reviewed 2026-05-19 03:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords frozen vision backbonesforecasting evaluationlatent diffusionvideo pretrainingperceptual qualityrepresentation spacefuture trajectoriesunified evaluation
0
0 comments X

The pith

A unified test using latent diffusion in representation space reveals that video-pretrained models forecast futures better than image-pretrained ones across abstraction levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework to measure the forecasting ability of frozen vision backbones by training latent diffusion models to predict future features directly in the backbone's representation space and then decoding those predictions with lightweight task-specific readouts. This approach allows consistent comparison across models and tasks ranging from pixel-level predictions to high-level object motion, without retraining the core model for each new forecasting problem. A reader would care because the results show that forecasting performance tracks closely with perceptual quality and that video-pretrained models hold an advantage over image-pretrained ones at every level of abstraction.

Core claim

By training latent diffusion models to forecast entire future trajectories in the representation space of a frozen vision backbone and decoding them via lightweight readouts, the intrinsic forecasting capacity of the backbone can be isolated and evaluated uniformly across diverse tasks, revealing a strong correlation with perceptual quality and consistent superiority of video-pretrained models over image-pretrained ones.

What carries the argument

Latent diffusion models trained to forecast future features in the frozen backbone's representation space, decoded by lightweight task-specific readouts.

If this is right

  • Forecasting performance strongly correlates with perceptual quality across models.
  • Video-pretrained models consistently outperform image-based models at all levels of abstraction.
  • Language supervision does not consistently improve forecasting ability.
  • Video synthesis models match or exceed the forecasting performance of masking-based pretraining regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could be used to screen new pretraining recipes for their effect on long-horizon prediction without full task retraining.
  • The observed correlation implies that models strong at static image synthesis may already encode useful temporal structure even when trained only on images.
  • If the isolation holds, the method offers a way to compare predictive power in multimodal models that include language or other modalities.

Load-bearing premise

That the diffusion training in representation space and the choice of lightweight readouts accurately isolate the backbone's own forecasting capacity without being dominated by the diffusion process itself or readout design.

What would settle it

If swapping the diffusion architecture or readout heads changes the performance ranking of the nine tested backbones, or if forecasting scores show no correlation with independent perceptual quality measures on the same models.

Figures

Figures reproduced from arXiv: 2507.13942 by Carl Doersch, Guangyao Zhou, Jacob C Walker, Jo\~ao Carreira, Luisa Polania Cabrera, Maks Ovsjanikov, Pedro V\'elez, Rishabh Kabra, Sayna Ebrahimi, Shiry Ginosar.

Figure 1
Figure 1. Figure 1: Forecasting performance strongly correlates with perceptual ability over short time horizons. While (a) perception is understanding current percepts, (b) forecasting predicts future states of the world. (c) We evaluate forecasting on pixels, point tracks, bounding box tracks, and depth using 10 samples per example and report the normalized max (min for lower-is-better metrics) performance per task. We comp… view at source ↗
Figure 2
Figure 2. Figure 2: Diffusion-based forecasting method from frozen vision model backbones. (a) Perception-style readouts: we train readout heads on frozen representations to perform downstream perception tasks like object detection on observed frames as in [4]. We extend this setup to forecasting as follows. (b) Forecasting framework: We introduce a forecasting diffusion model that predicts future representations conditioned … view at source ↗
Figure 3
Figure 3. Figure 3: Forecasting per-example metric results. We evaluate forecasting on pixels (PSNR), point tracks (Jaccard Distance), bounding box tracks (IoU), and depth maps (Mean Absolute Relative Error) using 10 samples per example. For reference, we also report perception performance on each task. Given the stochastic nature of forecasting, we report the mean and maximum/minimum performance across samples. This reveals … view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative forecasts from the 4DS-e model across diverse tasks. We condition on frames 1–4 and forecast frames 5–16. Top: Pixels forecasting—the model captures smooth camera motion. Middle: Bounding boxes—it predicts a car turning (left) and vehicle motion (right). Bottom: Point tracks—the model forecasts a hand rising (left) and camera motion (right). These results demonstrate that our approach generaliz… view at source ↗
read the original abstract

Forecasting future events is a fundamental capability for general-purpose systems that plan or act across different levels of abstraction. Yet, evaluating whether a forecast is "correct" remains challenging due to the inherent uncertainty of the future. We propose a unified evaluation framework for assessing the forecasting capabilities of frozen vision backbones across diverse tasks and abstraction levels. Rather than focusing on single time steps, our framework evaluates entire trajectories and incorporates distributional metrics that better capture the multimodal nature of future outcomes. Given a frozen vision model, we train latent diffusion models to forecast future features directly in its representation space, which are then decoded via lightweight, task-specific readouts. This enables consistent evaluation across a suite of diverse tasks while isolating the forecasting capacity of the backbone itself. We apply our framework to nine diverse vision models, spanning image and video pretraining, contrastive and generative objectives, and with or without language supervision, and evaluate them on four forecasting tasks, from low-level pixel predictions to high-level object motion. We find that forecasting performance strongly correlates with perceptual quality and that the forecasting abilities of video synthesis models are comparable or exceed those pretrained in masking regimes across all levels of abstraction. However, language supervision does not consistently improve forecasting. Notably, video-pretrained models consistently outperform image-based ones.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a unified evaluation framework called Frozen Forecasting for assessing the forecasting capabilities of frozen vision backbones across abstraction levels. Given a frozen model, latent diffusion models are trained to predict future features directly in its representation space; these forecasts are then decoded by lightweight task-specific readouts. The framework is applied to nine vision models (spanning image/video pretraining, contrastive/generative objectives, and language supervision) on four forecasting tasks ranging from low-level pixel prediction to high-level object motion. Main empirical claims are that forecasting performance correlates strongly with perceptual quality, video-pretrained models consistently outperform image-based ones, and language supervision does not reliably help.

Significance. If the method successfully isolates intrinsic forecasting capacity, the framework would offer a standardized, multi-task, multi-abstraction benchmark for comparing vision backbones, with direct implications for selecting models in predictive or planning systems. The reported correlation between forecasting and perceptual quality, together with the video-vs-image advantage, would be useful empirical guidance. The work is strengthened by its attempt at a consistent protocol across diverse models and tasks, but its significance is limited by the absence of controls needed to substantiate the isolation claim.

major comments (2)
  1. [Method] Method section: The central claim that the latent-diffusion-plus-readout procedure isolates the backbone's intrinsic forecasting capacity is not supported by the required controls. No ablations are described that hold the diffusion architecture, noise schedule, number of steps, and readout capacity fixed while varying only the backbone. Without these, differences in reported forecasting scores could arise from how well each representation space aligns with the diffusion prior or from optimization ease rather than from genuine predictive structure in the frozen features.
  2. [Experiments and Results] Experiments and Results: The comparisons across the nine models and the claimed correlation with perceptual quality are presented without error bars, statistical significance tests, or details on data splits and diffusion-model capacity controls. This makes it impossible to assess whether the consistent outperformance of video-pretrained models is robust or whether the ranking could change under a non-diffusion forecaster (e.g., linear autoregressive head).
minor comments (2)
  1. [Abstract / Method] The abstract and method description would benefit from an explicit table listing the four forecasting tasks, their abstraction levels, evaluation metrics, and datasets.
  2. [Method] Notation for the representation space and the diffusion objective could be introduced earlier and used consistently to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments point by point below, providing clarifications and indicating where revisions will be made to strengthen the paper.

read point-by-point responses
  1. Referee: [Method] The central claim that the latent-diffusion-plus-readout procedure isolates the backbone's intrinsic forecasting capacity is not supported by the required controls. No ablations are described that hold the diffusion architecture, noise schedule, number of steps, and readout capacity fixed while varying only the backbone. Without these, differences in reported forecasting scores could arise from how well each representation space aligns with the diffusion prior or from optimization ease rather than from genuine predictive structure in the frozen features.

    Authors: We appreciate the referee's emphasis on rigorously isolating the backbone's forecasting capacity. Our framework employs a fixed latent diffusion model architecture, noise schedule, and number of diffusion steps across all backbones, with task-specific readouts kept lightweight and consistent in capacity. This design aims to attribute performance differences primarily to the quality of the frozen representations. Nevertheless, we acknowledge that additional ablations explicitly varying only the backbone while holding all other elements constant would provide stronger evidence. In the revised version, we will include such ablations on representative tasks to further substantiate the isolation claim. revision: yes

  2. Referee: [Experiments and Results] The comparisons across the nine models and the claimed correlation with perceptual quality are presented without error bars, statistical significance tests, or details on data splits and diffusion-model capacity controls. This makes it impossible to assess whether the consistent outperformance of video-pretrained models is robust or whether the ranking could change under a non-diffusion forecaster (e.g., linear autoregressive head).

    Authors: We agree that including error bars, statistical significance tests, and explicit details on data splits and diffusion model capacities would enhance the reliability of our empirical findings. Regarding the potential change in ranking under alternative forecasters, our use of latent diffusion is intended to capture the multimodal nature of future predictions, which simpler models like linear autoregressive heads may not fully address. To address the referee's concern, we will add error bars and significance tests to the results in the revision. Additionally, we will include a comparison using a linear autoregressive readout on a subset of tasks to verify the robustness of the observed rankings. revision: yes

Circularity Check

0 steps flagged

No circularity: evaluation framework derives performance metrics independently from new diffusion training on frozen representations

full rationale

The paper defines a new evaluation protocol that freezes a vision backbone, trains a separate latent diffusion model to predict future features in that backbone's representation space, and then applies lightweight task-specific readouts to measure forecasting quality via distributional trajectory metrics. No equations or steps reduce the reported forecasting scores or correlations to quantities that are fitted or defined by the backbone itself; the diffusion training and readout performance are external to the backbone parameters. No self-citation is invoked as a uniqueness theorem or load-bearing justification for the central claims. The comparisons between video- and image-pretrained models, and the correlation with perceptual quality, follow directly from applying the same protocol across models rather than from any renaming, ansatz smuggling, or self-referential definition. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that feature-space diffusion forecasting plus lightweight decoders faithfully reflect the backbone's forecasting ability. No free parameters are explicitly fitted in the abstract description. No new entities are postulated.

axioms (1)
  • domain assumption Forecasting future features in the representation space of a frozen vision backbone, decoded by task-specific readouts, isolates the backbone's forecasting capacity.
    This premise is invoked when the framework is introduced to enable consistent evaluation across tasks while isolating the backbone.

pith-pipeline@v0.9.0 · 5796 in / 1330 out tokens · 35146 ms · 2026-05-19T03:38:55.043581+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens

    cs.CV 2026-04 conditional novelty 7.0

    Delta tokens compress VFM feature differences into single tokens, enabling a lightweight generative world model that predicts diverse futures with far lower compute than existing approaches.

  2. Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

    cs.CV 2026-04 unverdicted novelty 6.0

    Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 2 Pith papers · 7 internal anchors

  1. [1]

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016. URL https: //arxiv.org/abs/1607.06450

  2. [2]

    Revisiting Feature Prediction for Learning Visual Representations from Video

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471, 2024

  3. [3]

    Stylegan knows normal, depth, albedo, and more

    Anand Bhattad, Daniel McKee, Derek Hoiem, and David Forsyth. Stylegan knows normal, depth, albedo, and more. Advances in Neural Information Processing Systems, 36:73082–73103, 2023

  4. [4]

    Scaling 4d representations

    João Carreira, Dilara Gokay, Michael King, Chuhan Zhang, Ignacio Rocco, Aravindh Mahendran, Thomas Albert Keck, Joseph Heyward, Skanda Koppula, Etienne Pot, et al. Scaling 4d representations. arXiv preprint arXiv:2412.15212, 2024

  5. [5]

    2019 , journal =

    Aidan Clark, Jeff Donahue, and Karen Simonyan. Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571, 2019

  6. [6]

    Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

    Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017

  7. [7]

    Stochastic video generation with a learned prior

    Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. InInternational conference on machine learning, pages 1174–1183. PMLR, 2018

  8. [8]

    Tap-vid: A benchmark for tracking any point in a video

    Carl Doersch, Ankush Gupta, Larisa Markeeva, Adria Recasens, Lucas Smaira, Yusuf Aytar, Joao Carreira, Andrew Zisserman, and Yi Yang. Tap-vid: A benchmark for tracking any point in a video. Advances in Neural Information Processing Systems, 35:13610–13626, 2022

  9. [9]

    The fréchet distance between multivariate normal distributions

    DC Dowson and BV Landau. The fréchet distance between multivariate normal distributions. Journal of multivariate analysis, 12(3):450–455, 1982

  10. [10]

    Sur la distance de deux lois de probabilité

    Maurice Fréchet. Sur la distance de deux lois de probabilité. In Annales de l’ISUP, volume 6, pages 183–198, 1957

  11. [11]

    Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J Fleet, Dan Gnanapragasam, Florian Golemo, Charles Herrmann, Thomas Kipf, Abhijit Kundu, Dmitry Lagun, Issam Laradji, Hsueh-Ti (Derek) Liu, Henning Meyer, Yishu Miao, Derek Nowrouzezahrai, Cengiz Oztireli, Etienne Pot, Noha Radwan, Daniel Rebain, Sara Sabour, Meh...

  12. [12]

    Seer: Language Instructed Video Prediction with Latent Diffusion Models

    Xianfan Gu, Chuan Wen, Weirui Ye, Jiaming Song, and Yang Gao. Seer: Language instructed video prediction with latent diffusion models. arXiv preprint arXiv:2303.14897, 2023

  13. [13]

    Photorealistic video generation with diffusion models

    Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models. In ECCV, 2024

  14. [14]

    Unsupervised semantic correspondence using stable diffusion

    Eric Hedlin, Gopal Sharma, Shweta Mahajan, Hossam Isack, Abhishek Kar, Andrea Tagliasacchi, and Kwang Moo Yi. Unsupervised semantic correspondence using stable diffusion. Advances in Neural Information Processing Systems, 36:8266–8279, 2023

  15. [15]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

  16. [16]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020

  17. [17]

    Diffusion models for video prediction and infilling

    Tobias Höppe, Arash Mehrjou, Stefan Bauer, Didrik Nielsen, and Andrea Dittadi. Diffusion models for video prediction and infilling. arXiv preprint arXiv:2206.07696, 2022. 10

  18. [18]

    arXiv preprint arXiv:2412.11673 (2024)

    Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Dino-foresight: Looking into the future with dino, 2024. URL https://arxiv.org/abs/2412.11673

  19. [19]

    VideoPoet: A Large Language Model for Zero-Shot Video Generation

    Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023

  20. [20]

    Your diffusion model is secretly a zero-shot classifier

    Alexander C Li, Mihir Prabhudesai, Shivam Duggal, Ellis Brown, and Deepak Pathak. Your diffusion model is secretly a zero-shot classifier. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2206–2217, 2023

  21. [21]

    Imageflownet: Forecasting multiscale image-level trajectories of disease progression with irregularly-sampled longitudinal medical images

    Chen Liu, Ke Xu, Liangbo L Shen, Guillaume Huguet, Zilong Wang, Alexander Tong, Danilo Bzdok, Jay Stewart, Jay C Wang, Lucian V Del Priore, et al. Imageflownet: Forecasting multiscale image-level trajectories of disease progression with irregularly-sampled longitudinal medical images. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech ...

  22. [22]

    Predicting deeper into the future of semantic segmentation

    Pauline Luc, Natalia Neverova, Camille Couprie, Jakob Verbeek, and Yann LeCun. Predicting deeper into the future of semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pages 648–657, 2017

  23. [23]

    Predicting future instance segmentation by forecasting convolutional features

    Pauline Luc, Camille Couprie, Yann Lecun, and Jakob Verbeek. Predicting future instance segmentation by forecasting convolutional features. In Proceedings of the european conference on computer vision (ECCV), pages 584–599, 2018

  24. [24]

    Diffusion hyperfeatures: Searching through time and space for semantic correspondence.Advances in Neural Information Processing Systems, 36:47500–47510, 2023

    Grace Luo, Lisa Dunlap, Dong Huk Park, Aleksander Holynski, and Trevor Darrell. Diffusion hyperfeatures: Searching through time and space for semantic correspondence.Advances in Neural Information Processing Systems, 36:47500–47510, 2023

  25. [25]

    Learning to listen: Modeling non-deterministic dyadic facial motion

    Evonne Ng, Hanbyul Joo, Liwen Hu, Hao Li, Trevor Darrell, Angjoo Kanazawa, and Shiry Ginosar. Learning to listen: Modeling non-deterministic dyadic facial motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20395–20405, 2022

  26. [26]

    Sora, 12 2024

    OpenAI. Sora, 12 2024. URL https://openai.com/sora/

  27. [27]

    A review on deep learning techniques for video prediction

    Sergiu Oprea, Pablo Martinez-Gonzalez, Alberto Garcia-Garcia, John Alejandro Castro-Vargas, Sergio Orts-Escolano, Jose Garcia-Rodriguez, and Antonis Argyros. A review on deep learning techniques for video prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6):2806–2826,

  28. [28]

    doi: 10.1109/TPAMI.2020.3045007

  29. [29]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

  30. [30]

    Sampson, Shikai Li, Simone Parmeggiani, Steve Fine, Tara Fowler, Vladan Petrovic, and Yuming Du

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Matthew Yu, Mitesh Kumar Sing...

  31. [31]

    URL https://arxiv.org/abs/2410.13720

  32. [32]

    Perception test: A diagnostic benchmark for multimodal video models

    Viorica P˘atr˘aucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Mate- jovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen,...

  33. [33]

    arXiv preprint arXiv:2501.05453 , year=

    Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar, Yossi Gandelsman, Christoph Feichtenhofer, and Jitendra Malik. An empirical study of autoregressive pre-training from videos. arXiv preprint arXiv:2501.05453, 2025

  34. [34]

    Video (language) modeling: a baseline for generative models of natural videos

    MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and Sumit Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014

  35. [35]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in perception...

  36. [36]

    Poly- autoregressive prediction for modeling interactions

    Neerja Thakkar, Tara Sadjadpour, Jathushan Rajasegaran, Shiry Ginosar, and Jitendra Malik. Poly- autoregressive prediction for modeling interactions. In CVPR, 2025

  37. [37]

    VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training

    Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. NeurIPS, 2022

  38. [38]

    Mocogan: Decomposing motion and content for video generation

    Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1526–1535, 2018

  39. [39]

    Fvd: A new metric for video generation

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. ICLR workshop, 2019

  40. [40]

    Anticipating visual representations from unlabeled video

    Carl V ondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating visual representations from unlabeled video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 98–106, 2016

  41. [41]

    Polanía, Yi Yang, Chuhan Zhang, Rishabh Kabra, Anurag Arnab, and Mehdi S

    Pedro Vélez, Luisa F. Polanía, Yi Yang, Chuhan Zhang, Rishabh Kabra, Anurag Arnab, and Mehdi S. M. Sajjadi. From image to video: An empirical study of diffusion representations. arXiv preprint arXiv:2502.07001, 2025

  42. [42]

    VideoMAE v2: Scaling video masked autoencoders with dual masking

    Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. VideoMAE v2: Scaling video masked autoencoders with dual masking. In CVPR, pages 14549–14560, 2023

  43. [43]

    Imaginator: Conditional spatio- temporal gan for video generation

    Yaohui Wang, Piotr Bilinski, Francois Bremond, and Antitza Dantcheva. Imaginator: Conditional spatio- temporal gan for video generation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1160–1169, 2020

  44. [44]

    Aid: Adapting image2video diffusion models for instruction-guided video prediction

    Zhen Xing, Qi Dai, Zejia Weng, Zuxuan Wu, and Yu-Gang Jiang. Aid: Adapting image2video diffusion models for instruction-guided video prediction. arXiv preprint arXiv:2406.06465, 2024

  45. [45]

    What matters when repurposing diffusion models for general dense perception tasks? ICLR, 2025

    Guangkai Xu, Yongtao Ge, Mingyu Liu, Chengxiang Fan, Kangyang Xie, Zhiyue Zhao, Hao Chen, and Chunhua Shen. What matters when repurposing diffusion models for general dense perception tasks? ICLR, 2025

  46. [46]

    Video diffusion models with local-global context guidance

    Siyuan Yang, Lu Zhang, Yu Liu, Zhizhuo Jiang, and You He. Video diffusion models with local-global context guidance. arXiv preprint arXiv:2306.02562, 2023

  47. [47]

    Stdiff: Spatio-temporal diffusion for continuous stochastic video prediction

    Xi Ye and Guillaume-Alexandre Bilodeau. Stdiff: Spatio-temporal diffusion for continuous stochastic video prediction. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 38, pages 6666–6674, 2024

  48. [48]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, pages 11975–11986, 2023

  49. [49]

    A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Polania Cabrera, Varun Jampani, Deqing Sun, and Ming-Hsuan Yang. A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence. Advances in Neural Information Processing Systems, 36:45533–45547, 2023

  50. [50]

    Trajectory flow matching with applications to clinical time series modelling

    Xi Nicole Zhang, Yuan Pu, Yuki Kawamura, Andrew Loza, Yoshua Bengio, Dennis Shung, and Alexander Tong. Trajectory flow matching with applications to clinical time series modelling. Advances in Neural Information Processing Systems, 37:107198–107224, 2024. 12

  51. [51]

    Long Zhao, Nitesh Bharadwaj Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J. Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, and Boqing Gong. VideoPrism: A foundational visual encoder for video understanding. In ICML, 2024

  52. [52]

    Unleashing text-to-image diffusion models for visual perception

    Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu, Jie Zhou, and Jiwen Lu. Unleashing text-to-image diffusion models for visual perception. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5729–5739, 2023. 13 A Appendix A.1 Ablations Table 3: Diffusion vs Regression Model Pixels Depth Point Tracks Box Tracks Mean ↑ Best ↑...