Frozen Forecasting: A Unified Evaluation
Pith reviewed 2026-05-19 03:38 UTC · model grok-4.3
The pith
A unified test using latent diffusion in representation space reveals that video-pretrained models forecast futures better than image-pretrained ones across abstraction levels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By training latent diffusion models to forecast entire future trajectories in the representation space of a frozen vision backbone and decoding them via lightweight readouts, the intrinsic forecasting capacity of the backbone can be isolated and evaluated uniformly across diverse tasks, revealing a strong correlation with perceptual quality and consistent superiority of video-pretrained models over image-pretrained ones.
What carries the argument
Latent diffusion models trained to forecast future features in the frozen backbone's representation space, decoded by lightweight task-specific readouts.
If this is right
- Forecasting performance strongly correlates with perceptual quality across models.
- Video-pretrained models consistently outperform image-based models at all levels of abstraction.
- Language supervision does not consistently improve forecasting ability.
- Video synthesis models match or exceed the forecasting performance of masking-based pretraining regimes.
Where Pith is reading between the lines
- The framework could be used to screen new pretraining recipes for their effect on long-horizon prediction without full task retraining.
- The observed correlation implies that models strong at static image synthesis may already encode useful temporal structure even when trained only on images.
- If the isolation holds, the method offers a way to compare predictive power in multimodal models that include language or other modalities.
Load-bearing premise
That the diffusion training in representation space and the choice of lightweight readouts accurately isolate the backbone's own forecasting capacity without being dominated by the diffusion process itself or readout design.
What would settle it
If swapping the diffusion architecture or readout heads changes the performance ranking of the nine tested backbones, or if forecasting scores show no correlation with independent perceptual quality measures on the same models.
Figures
read the original abstract
Forecasting future events is a fundamental capability for general-purpose systems that plan or act across different levels of abstraction. Yet, evaluating whether a forecast is "correct" remains challenging due to the inherent uncertainty of the future. We propose a unified evaluation framework for assessing the forecasting capabilities of frozen vision backbones across diverse tasks and abstraction levels. Rather than focusing on single time steps, our framework evaluates entire trajectories and incorporates distributional metrics that better capture the multimodal nature of future outcomes. Given a frozen vision model, we train latent diffusion models to forecast future features directly in its representation space, which are then decoded via lightweight, task-specific readouts. This enables consistent evaluation across a suite of diverse tasks while isolating the forecasting capacity of the backbone itself. We apply our framework to nine diverse vision models, spanning image and video pretraining, contrastive and generative objectives, and with or without language supervision, and evaluate them on four forecasting tasks, from low-level pixel predictions to high-level object motion. We find that forecasting performance strongly correlates with perceptual quality and that the forecasting abilities of video synthesis models are comparable or exceed those pretrained in masking regimes across all levels of abstraction. However, language supervision does not consistently improve forecasting. Notably, video-pretrained models consistently outperform image-based ones.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a unified evaluation framework called Frozen Forecasting for assessing the forecasting capabilities of frozen vision backbones across abstraction levels. Given a frozen model, latent diffusion models are trained to predict future features directly in its representation space; these forecasts are then decoded by lightweight task-specific readouts. The framework is applied to nine vision models (spanning image/video pretraining, contrastive/generative objectives, and language supervision) on four forecasting tasks ranging from low-level pixel prediction to high-level object motion. Main empirical claims are that forecasting performance correlates strongly with perceptual quality, video-pretrained models consistently outperform image-based ones, and language supervision does not reliably help.
Significance. If the method successfully isolates intrinsic forecasting capacity, the framework would offer a standardized, multi-task, multi-abstraction benchmark for comparing vision backbones, with direct implications for selecting models in predictive or planning systems. The reported correlation between forecasting and perceptual quality, together with the video-vs-image advantage, would be useful empirical guidance. The work is strengthened by its attempt at a consistent protocol across diverse models and tasks, but its significance is limited by the absence of controls needed to substantiate the isolation claim.
major comments (2)
- [Method] Method section: The central claim that the latent-diffusion-plus-readout procedure isolates the backbone's intrinsic forecasting capacity is not supported by the required controls. No ablations are described that hold the diffusion architecture, noise schedule, number of steps, and readout capacity fixed while varying only the backbone. Without these, differences in reported forecasting scores could arise from how well each representation space aligns with the diffusion prior or from optimization ease rather than from genuine predictive structure in the frozen features.
- [Experiments and Results] Experiments and Results: The comparisons across the nine models and the claimed correlation with perceptual quality are presented without error bars, statistical significance tests, or details on data splits and diffusion-model capacity controls. This makes it impossible to assess whether the consistent outperformance of video-pretrained models is robust or whether the ranking could change under a non-diffusion forecaster (e.g., linear autoregressive head).
minor comments (2)
- [Abstract / Method] The abstract and method description would benefit from an explicit table listing the four forecasting tasks, their abstraction levels, evaluation metrics, and datasets.
- [Method] Notation for the representation space and the diffusion objective could be introduced earlier and used consistently to improve readability.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments point by point below, providing clarifications and indicating where revisions will be made to strengthen the paper.
read point-by-point responses
-
Referee: [Method] The central claim that the latent-diffusion-plus-readout procedure isolates the backbone's intrinsic forecasting capacity is not supported by the required controls. No ablations are described that hold the diffusion architecture, noise schedule, number of steps, and readout capacity fixed while varying only the backbone. Without these, differences in reported forecasting scores could arise from how well each representation space aligns with the diffusion prior or from optimization ease rather than from genuine predictive structure in the frozen features.
Authors: We appreciate the referee's emphasis on rigorously isolating the backbone's forecasting capacity. Our framework employs a fixed latent diffusion model architecture, noise schedule, and number of diffusion steps across all backbones, with task-specific readouts kept lightweight and consistent in capacity. This design aims to attribute performance differences primarily to the quality of the frozen representations. Nevertheless, we acknowledge that additional ablations explicitly varying only the backbone while holding all other elements constant would provide stronger evidence. In the revised version, we will include such ablations on representative tasks to further substantiate the isolation claim. revision: yes
-
Referee: [Experiments and Results] The comparisons across the nine models and the claimed correlation with perceptual quality are presented without error bars, statistical significance tests, or details on data splits and diffusion-model capacity controls. This makes it impossible to assess whether the consistent outperformance of video-pretrained models is robust or whether the ranking could change under a non-diffusion forecaster (e.g., linear autoregressive head).
Authors: We agree that including error bars, statistical significance tests, and explicit details on data splits and diffusion model capacities would enhance the reliability of our empirical findings. Regarding the potential change in ranking under alternative forecasters, our use of latent diffusion is intended to capture the multimodal nature of future predictions, which simpler models like linear autoregressive heads may not fully address. To address the referee's concern, we will add error bars and significance tests to the results in the revision. Additionally, we will include a comparison using a linear autoregressive readout on a subset of tasks to verify the robustness of the observed rankings. revision: yes
Circularity Check
No circularity: evaluation framework derives performance metrics independently from new diffusion training on frozen representations
full rationale
The paper defines a new evaluation protocol that freezes a vision backbone, trains a separate latent diffusion model to predict future features in that backbone's representation space, and then applies lightweight task-specific readouts to measure forecasting quality via distributional trajectory metrics. No equations or steps reduce the reported forecasting scores or correlations to quantities that are fitted or defined by the backbone itself; the diffusion training and readout performance are external to the backbone parameters. No self-citation is invoked as a uniqueness theorem or load-bearing justification for the central claims. The comparisons between video- and image-pretrained models, and the correlation with perceptual quality, follow directly from applying the same protocol across models rather than from any renaming, ansatz smuggling, or self-referential definition. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Forecasting future features in the representation space of a frozen vision backbone, decoded by task-specific readouts, isolates the backbone's forecasting capacity.
Forward citations
Cited by 2 Pith papers
-
A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
Delta tokens compress VFM feature differences into single tokens, enabling a lightweight generative world model that predicts diverse futures with far lower compute than existing approaches.
-
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.
Reference graph
Works this paper leans on
-
[1]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016. URL https: //arxiv.org/abs/1607.06450
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
Revisiting Feature Prediction for Learning Visual Representations from Video
Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Stylegan knows normal, depth, albedo, and more
Anand Bhattad, Daniel McKee, Derek Hoiem, and David Forsyth. Stylegan knows normal, depth, albedo, and more. Advances in Neural Information Processing Systems, 36:73082–73103, 2023
work page 2023
-
[4]
João Carreira, Dilara Gokay, Michael King, Chuhan Zhang, Ignacio Rocco, Aravindh Mahendran, Thomas Albert Keck, Joseph Heyward, Skanda Koppula, Etienne Pot, et al. Scaling 4d representations. arXiv preprint arXiv:2412.15212, 2024
-
[5]
Aidan Clark, Jeff Donahue, and Karen Simonyan. Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571, 2019
-
[6]
Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner
Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017
work page 2017
-
[7]
Stochastic video generation with a learned prior
Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. InInternational conference on machine learning, pages 1174–1183. PMLR, 2018
work page 2018
-
[8]
Tap-vid: A benchmark for tracking any point in a video
Carl Doersch, Ankush Gupta, Larisa Markeeva, Adria Recasens, Lucas Smaira, Yusuf Aytar, Joao Carreira, Andrew Zisserman, and Yi Yang. Tap-vid: A benchmark for tracking any point in a video. Advances in Neural Information Processing Systems, 35:13610–13626, 2022
work page 2022
-
[9]
The fréchet distance between multivariate normal distributions
DC Dowson and BV Landau. The fréchet distance between multivariate normal distributions. Journal of multivariate analysis, 12(3):450–455, 1982
work page 1982
-
[10]
Sur la distance de deux lois de probabilité
Maurice Fréchet. Sur la distance de deux lois de probabilité. In Annales de l’ISUP, volume 6, pages 183–198, 1957
work page 1957
-
[11]
Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J Fleet, Dan Gnanapragasam, Florian Golemo, Charles Herrmann, Thomas Kipf, Abhijit Kundu, Dmitry Lagun, Issam Laradji, Hsueh-Ti (Derek) Liu, Henning Meyer, Yishu Miao, Derek Nowrouzezahrai, Cengiz Oztireli, Etienne Pot, Noha Radwan, Daniel Rebain, Sara Sabour, Meh...
work page 2022
-
[12]
Seer: Language Instructed Video Prediction with Latent Diffusion Models
Xianfan Gu, Chuan Wen, Weirui Ye, Jiaming Song, and Yang Gao. Seer: Language instructed video prediction with latent diffusion models. arXiv preprint arXiv:2303.14897, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Photorealistic video generation with diffusion models
Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models. In ECCV, 2024
work page 2024
-
[14]
Unsupervised semantic correspondence using stable diffusion
Eric Hedlin, Gopal Sharma, Shweta Mahajan, Hossam Isack, Abhishek Kar, Andrea Tagliasacchi, and Kwang Moo Yi. Unsupervised semantic correspondence using stable diffusion. Advances in Neural Information Processing Systems, 36:8266–8279, 2023
work page 2023
-
[15]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017
work page 2017
-
[16]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020
work page 2020
-
[17]
Diffusion models for video prediction and infilling
Tobias Höppe, Arash Mehrjou, Stefan Bauer, Didrik Nielsen, and Andrea Dittadi. Diffusion models for video prediction and infilling. arXiv preprint arXiv:2206.07696, 2022. 10
-
[18]
arXiv preprint arXiv:2412.11673 (2024)
Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Dino-foresight: Looking into the future with dino, 2024. URL https://arxiv.org/abs/2412.11673
-
[19]
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
Your diffusion model is secretly a zero-shot classifier
Alexander C Li, Mihir Prabhudesai, Shivam Duggal, Ellis Brown, and Deepak Pathak. Your diffusion model is secretly a zero-shot classifier. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2206–2217, 2023
work page 2023
-
[21]
Chen Liu, Ke Xu, Liangbo L Shen, Guillaume Huguet, Zilong Wang, Alexander Tong, Danilo Bzdok, Jay Stewart, Jay C Wang, Lucian V Del Priore, et al. Imageflownet: Forecasting multiscale image-level trajectories of disease progression with irregularly-sampled longitudinal medical images. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech ...
work page 2025
-
[22]
Predicting deeper into the future of semantic segmentation
Pauline Luc, Natalia Neverova, Camille Couprie, Jakob Verbeek, and Yann LeCun. Predicting deeper into the future of semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pages 648–657, 2017
work page 2017
-
[23]
Predicting future instance segmentation by forecasting convolutional features
Pauline Luc, Camille Couprie, Yann Lecun, and Jakob Verbeek. Predicting future instance segmentation by forecasting convolutional features. In Proceedings of the european conference on computer vision (ECCV), pages 584–599, 2018
work page 2018
-
[24]
Grace Luo, Lisa Dunlap, Dong Huk Park, Aleksander Holynski, and Trevor Darrell. Diffusion hyperfeatures: Searching through time and space for semantic correspondence.Advances in Neural Information Processing Systems, 36:47500–47510, 2023
work page 2023
-
[25]
Learning to listen: Modeling non-deterministic dyadic facial motion
Evonne Ng, Hanbyul Joo, Liwen Hu, Hao Li, Trevor Darrell, Angjoo Kanazawa, and Shiry Ginosar. Learning to listen: Modeling non-deterministic dyadic facial motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20395–20405, 2022
work page 2022
- [26]
-
[27]
A review on deep learning techniques for video prediction
Sergiu Oprea, Pablo Martinez-Gonzalez, Alberto Garcia-Garcia, John Alejandro Castro-Vargas, Sergio Orts-Escolano, Jose Garcia-Rodriguez, and Antonis Argyros. A review on deep learning techniques for video prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6):2806–2826,
-
[28]
doi: 10.1109/TPAMI.2020.3045007
-
[29]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Sampson, Shikai Li, Simone Parmeggiani, Steve Fine, Tara Fowler, Vladan Petrovic, and Yuming Du
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Matthew Yu, Mitesh Kumar Sing...
-
[31]
URL https://arxiv.org/abs/2410.13720
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Perception test: A diagnostic benchmark for multimodal video models
Viorica P˘atr˘aucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Mate- jovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen,...
work page 2023
-
[33]
arXiv preprint arXiv:2501.05453 , year=
Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar, Yossi Gandelsman, Christoph Feichtenhofer, and Jitendra Malik. An empirical study of autoregressive pre-training from videos. arXiv preprint arXiv:2501.05453, 2025
-
[34]
Video (language) modeling: a baseline for generative models of natural videos
MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and Sumit Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[35]
Scalability in perception for autonomous driving: Waymo open dataset
Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in perception...
work page 2020
-
[36]
Poly- autoregressive prediction for modeling interactions
Neerja Thakkar, Tara Sadjadpour, Jathushan Rajasegaran, Shiry Ginosar, and Jitendra Malik. Poly- autoregressive prediction for modeling interactions. In CVPR, 2025
work page 2025
-
[37]
VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training
Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. NeurIPS, 2022
work page 2022
-
[38]
Mocogan: Decomposing motion and content for video generation
Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1526–1535, 2018
work page 2018
-
[39]
Fvd: A new metric for video generation
Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. ICLR workshop, 2019
work page 2019
-
[40]
Anticipating visual representations from unlabeled video
Carl V ondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating visual representations from unlabeled video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 98–106, 2016
work page 2016
-
[41]
Polanía, Yi Yang, Chuhan Zhang, Rishabh Kabra, Anurag Arnab, and Mehdi S
Pedro Vélez, Luisa F. Polanía, Yi Yang, Chuhan Zhang, Rishabh Kabra, Anurag Arnab, and Mehdi S. M. Sajjadi. From image to video: An empirical study of diffusion representations. arXiv preprint arXiv:2502.07001, 2025
-
[42]
VideoMAE v2: Scaling video masked autoencoders with dual masking
Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. VideoMAE v2: Scaling video masked autoencoders with dual masking. In CVPR, pages 14549–14560, 2023
work page 2023
-
[43]
Imaginator: Conditional spatio- temporal gan for video generation
Yaohui Wang, Piotr Bilinski, Francois Bremond, and Antitza Dantcheva. Imaginator: Conditional spatio- temporal gan for video generation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1160–1169, 2020
work page 2020
-
[44]
Aid: Adapting image2video diffusion models for instruction-guided video prediction
Zhen Xing, Qi Dai, Zejia Weng, Zuxuan Wu, and Yu-Gang Jiang. Aid: Adapting image2video diffusion models for instruction-guided video prediction. arXiv preprint arXiv:2406.06465, 2024
-
[45]
What matters when repurposing diffusion models for general dense perception tasks? ICLR, 2025
Guangkai Xu, Yongtao Ge, Mingyu Liu, Chengxiang Fan, Kangyang Xie, Zhiyue Zhao, Hao Chen, and Chunhua Shen. What matters when repurposing diffusion models for general dense perception tasks? ICLR, 2025
work page 2025
-
[46]
Video diffusion models with local-global context guidance
Siyuan Yang, Lu Zhang, Yu Liu, Zhizhuo Jiang, and You He. Video diffusion models with local-global context guidance. arXiv preprint arXiv:2306.02562, 2023
-
[47]
Stdiff: Spatio-temporal diffusion for continuous stochastic video prediction
Xi Ye and Guillaume-Alexandre Bilodeau. Stdiff: Spatio-temporal diffusion for continuous stochastic video prediction. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 38, pages 6666–6674, 2024
work page 2024
-
[48]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, pages 11975–11986, 2023
work page 2023
-
[49]
A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence
Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Polania Cabrera, Varun Jampani, Deqing Sun, and Ming-Hsuan Yang. A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence. Advances in Neural Information Processing Systems, 36:45533–45547, 2023
work page 2023
-
[50]
Trajectory flow matching with applications to clinical time series modelling
Xi Nicole Zhang, Yuan Pu, Yuki Kawamura, Andrew Loza, Yoshua Bengio, Dennis Shung, and Alexander Tong. Trajectory flow matching with applications to clinical time series modelling. Advances in Neural Information Processing Systems, 37:107198–107224, 2024. 12
work page 2024
-
[51]
Long Zhao, Nitesh Bharadwaj Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J. Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, and Boqing Gong. VideoPrism: A foundational visual encoder for video understanding. In ICML, 2024
work page 2024
-
[52]
Unleashing text-to-image diffusion models for visual perception
Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu, Jie Zhou, and Jiwen Lu. Unleashing text-to-image diffusion models for visual perception. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5729–5739, 2023. 13 A Appendix A.1 Ablations Table 3: Diffusion vs Regression Model Pixels Depth Point Tracks Box Tracks Mean ↑ Best ↑...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.