Frozen Forecasting: A Unified Evaluation

Carl Doersch; Guangyao Zhou; Jacob C Walker; Jo\~ao Carreira; Luisa Polania Cabrera; Maks Ovsjanikov; Pedro V\'elez; Rishabh Kabra; Sayna Ebrahimi; Shiry Ginosar

arxiv: 2507.13942 · v2 · submitted 2025-07-18 · 💻 cs.CV · cs.AI· cs.LG

Frozen Forecasting: A Unified Evaluation

Jacob C Walker , Pedro V\'elez , Luisa Polania Cabrera , Guangyao Zhou , Sayna Ebrahimi , Rishabh Kabra , Carl Doersch , Maks Ovsjanikov

show 2 more authors

Jo\~ao Carreira Shiry Ginosar

This is my paper

Pith reviewed 2026-05-19 03:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords frozen vision backbonesforecasting evaluationlatent diffusionvideo pretrainingperceptual qualityrepresentation spacefuture trajectoriesunified evaluation

0 comments

The pith

A unified test using latent diffusion in representation space reveals that video-pretrained models forecast futures better than image-pretrained ones across abstraction levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework to measure the forecasting ability of frozen vision backbones by training latent diffusion models to predict future features directly in the backbone's representation space and then decoding those predictions with lightweight task-specific readouts. This approach allows consistent comparison across models and tasks ranging from pixel-level predictions to high-level object motion, without retraining the core model for each new forecasting problem. A reader would care because the results show that forecasting performance tracks closely with perceptual quality and that video-pretrained models hold an advantage over image-pretrained ones at every level of abstraction.

Core claim

By training latent diffusion models to forecast entire future trajectories in the representation space of a frozen vision backbone and decoding them via lightweight readouts, the intrinsic forecasting capacity of the backbone can be isolated and evaluated uniformly across diverse tasks, revealing a strong correlation with perceptual quality and consistent superiority of video-pretrained models over image-pretrained ones.

What carries the argument

Latent diffusion models trained to forecast future features in the frozen backbone's representation space, decoded by lightweight task-specific readouts.

If this is right

Forecasting performance strongly correlates with perceptual quality across models.
Video-pretrained models consistently outperform image-based models at all levels of abstraction.
Language supervision does not consistently improve forecasting ability.
Video synthesis models match or exceed the forecasting performance of masking-based pretraining regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could be used to screen new pretraining recipes for their effect on long-horizon prediction without full task retraining.
The observed correlation implies that models strong at static image synthesis may already encode useful temporal structure even when trained only on images.
If the isolation holds, the method offers a way to compare predictive power in multimodal models that include language or other modalities.

Load-bearing premise

That the diffusion training in representation space and the choice of lightweight readouts accurately isolate the backbone's own forecasting capacity without being dominated by the diffusion process itself or readout design.

What would settle it

If swapping the diffusion architecture or readout heads changes the performance ranking of the nine tested backbones, or if forecasting scores show no correlation with independent perceptual quality measures on the same models.

Figures

Figures reproduced from arXiv: 2507.13942 by Carl Doersch, Guangyao Zhou, Jacob C Walker, Jo\~ao Carreira, Luisa Polania Cabrera, Maks Ovsjanikov, Pedro V\'elez, Rishabh Kabra, Sayna Ebrahimi, Shiry Ginosar.

**Figure 1.** Figure 1: Forecasting performance strongly correlates with perceptual ability over short time horizons. While (a) perception is understanding current percepts, (b) forecasting predicts future states of the world. (c) We evaluate forecasting on pixels, point tracks, bounding box tracks, and depth using 10 samples per example and report the normalized max (min for lower-is-better metrics) performance per task. We comp… view at source ↗

**Figure 2.** Figure 2: Diffusion-based forecasting method from frozen vision model backbones. (a) Perception-style readouts: we train readout heads on frozen representations to perform downstream perception tasks like object detection on observed frames as in [4]. We extend this setup to forecasting as follows. (b) Forecasting framework: We introduce a forecasting diffusion model that predicts future representations conditioned … view at source ↗

**Figure 3.** Figure 3: Forecasting per-example metric results. We evaluate forecasting on pixels (PSNR), point tracks (Jaccard Distance), bounding box tracks (IoU), and depth maps (Mean Absolute Relative Error) using 10 samples per example. For reference, we also report perception performance on each task. Given the stochastic nature of forecasting, we report the mean and maximum/minimum performance across samples. This reveals … view at source ↗

**Figure 4.** Figure 4: Qualitative forecasts from the 4DS-e model across diverse tasks. We condition on frames 1–4 and forecast frames 5–16. Top: Pixels forecasting—the model captures smooth camera motion. Middle: Bounding boxes—it predicts a car turning (left) and vehicle motion (right). Bottom: Point tracks—the model forecasts a hand rising (left) and camera motion (right). These results demonstrate that our approach generaliz… view at source ↗

read the original abstract

Forecasting future events is a fundamental capability for general-purpose systems that plan or act across different levels of abstraction. Yet, evaluating whether a forecast is "correct" remains challenging due to the inherent uncertainty of the future. We propose a unified evaluation framework for assessing the forecasting capabilities of frozen vision backbones across diverse tasks and abstraction levels. Rather than focusing on single time steps, our framework evaluates entire trajectories and incorporates distributional metrics that better capture the multimodal nature of future outcomes. Given a frozen vision model, we train latent diffusion models to forecast future features directly in its representation space, which are then decoded via lightweight, task-specific readouts. This enables consistent evaluation across a suite of diverse tasks while isolating the forecasting capacity of the backbone itself. We apply our framework to nine diverse vision models, spanning image and video pretraining, contrastive and generative objectives, and with or without language supervision, and evaluate them on four forecasting tasks, from low-level pixel predictions to high-level object motion. We find that forecasting performance strongly correlates with perceptual quality and that the forecasting abilities of video synthesis models are comparable or exceed those pretrained in masking regimes across all levels of abstraction. However, language supervision does not consistently improve forecasting. Notably, video-pretrained models consistently outperform image-based ones.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper introduces a consistent diffusion-based test for forecasting in frozen vision backbones and reports that video-pretrained models outperform image ones across tasks, though the isolation from forecaster effects remains unproven.

read the letter

The main point is a framework that keeps a vision backbone frozen, trains a latent diffusion model to predict future features inside its representation space, and then applies lightweight task-specific decoders to turn those predictions into outputs. They run the same protocol on nine models and four tasks that range from pixel-level to object-motion forecasting, using distributional metrics over full trajectories instead of single-step errors. Video-pretrained models come out ahead at every abstraction level, and the scores line up with perceptual quality measures; language supervision does not show a clear benefit. That setup lets them compare models that were trained under very different objectives without retraining the backbone each time, which is a practical step for anyone who needs to pick a representation for planning or simulation work. The distributional trajectory metrics also fit the multimodal nature of future states better than point estimates would. The soft spot is the assumption that differences in performance reflect the backbone's own predictive structure rather than how compatible its features are with the diffusion training process. Video models often produce smoother temporal statistics or lower-dimensional features that could make the forecasting objective easier to optimize, independent of real future-state accuracy. The abstract gives no ablations that fix the diffusion architecture, noise schedule, and readout size while swapping only the backbone, nor does it test whether the same ranking appears with a simpler non-diffusion forecaster such as a linear autoregressive head. Without those checks the claim that the method isolates intrinsic forecasting capacity stays provisional. Statistical details on variance, data splits, or significance tests are also absent, which makes the reported correlations harder to weigh. This kind of benchmark is aimed at researchers who evaluate or select vision models for predictive downstream use. The breadth of models and tasks gives it enough substance to go out for peer review, provided the authors add the missing controls and basic error reporting in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a unified evaluation framework called Frozen Forecasting for assessing the forecasting capabilities of frozen vision backbones across abstraction levels. Given a frozen model, latent diffusion models are trained to predict future features directly in its representation space; these forecasts are then decoded by lightweight task-specific readouts. The framework is applied to nine vision models (spanning image/video pretraining, contrastive/generative objectives, and language supervision) on four forecasting tasks ranging from low-level pixel prediction to high-level object motion. Main empirical claims are that forecasting performance correlates strongly with perceptual quality, video-pretrained models consistently outperform image-based ones, and language supervision does not reliably help.

Significance. If the method successfully isolates intrinsic forecasting capacity, the framework would offer a standardized, multi-task, multi-abstraction benchmark for comparing vision backbones, with direct implications for selecting models in predictive or planning systems. The reported correlation between forecasting and perceptual quality, together with the video-vs-image advantage, would be useful empirical guidance. The work is strengthened by its attempt at a consistent protocol across diverse models and tasks, but its significance is limited by the absence of controls needed to substantiate the isolation claim.

major comments (2)

[Method] Method section: The central claim that the latent-diffusion-plus-readout procedure isolates the backbone's intrinsic forecasting capacity is not supported by the required controls. No ablations are described that hold the diffusion architecture, noise schedule, number of steps, and readout capacity fixed while varying only the backbone. Without these, differences in reported forecasting scores could arise from how well each representation space aligns with the diffusion prior or from optimization ease rather than from genuine predictive structure in the frozen features.
[Experiments and Results] Experiments and Results: The comparisons across the nine models and the claimed correlation with perceptual quality are presented without error bars, statistical significance tests, or details on data splits and diffusion-model capacity controls. This makes it impossible to assess whether the consistent outperformance of video-pretrained models is robust or whether the ranking could change under a non-diffusion forecaster (e.g., linear autoregressive head).

minor comments (2)

[Abstract / Method] The abstract and method description would benefit from an explicit table listing the four forecasting tasks, their abstraction levels, evaluation metrics, and datasets.
[Method] Notation for the representation space and the diffusion objective could be introduced earlier and used consistently to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments point by point below, providing clarifications and indicating where revisions will be made to strengthen the paper.

read point-by-point responses

Referee: [Method] The central claim that the latent-diffusion-plus-readout procedure isolates the backbone's intrinsic forecasting capacity is not supported by the required controls. No ablations are described that hold the diffusion architecture, noise schedule, number of steps, and readout capacity fixed while varying only the backbone. Without these, differences in reported forecasting scores could arise from how well each representation space aligns with the diffusion prior or from optimization ease rather than from genuine predictive structure in the frozen features.

Authors: We appreciate the referee's emphasis on rigorously isolating the backbone's forecasting capacity. Our framework employs a fixed latent diffusion model architecture, noise schedule, and number of diffusion steps across all backbones, with task-specific readouts kept lightweight and consistent in capacity. This design aims to attribute performance differences primarily to the quality of the frozen representations. Nevertheless, we acknowledge that additional ablations explicitly varying only the backbone while holding all other elements constant would provide stronger evidence. In the revised version, we will include such ablations on representative tasks to further substantiate the isolation claim. revision: yes
Referee: [Experiments and Results] The comparisons across the nine models and the claimed correlation with perceptual quality are presented without error bars, statistical significance tests, or details on data splits and diffusion-model capacity controls. This makes it impossible to assess whether the consistent outperformance of video-pretrained models is robust or whether the ranking could change under a non-diffusion forecaster (e.g., linear autoregressive head).

Authors: We agree that including error bars, statistical significance tests, and explicit details on data splits and diffusion model capacities would enhance the reliability of our empirical findings. Regarding the potential change in ranking under alternative forecasters, our use of latent diffusion is intended to capture the multimodal nature of future predictions, which simpler models like linear autoregressive heads may not fully address. To address the referee's concern, we will add error bars and significance tests to the results in the revision. Additionally, we will include a comparison using a linear autoregressive readout on a subset of tasks to verify the robustness of the observed rankings. revision: yes

Circularity Check

0 steps flagged

No circularity: evaluation framework derives performance metrics independently from new diffusion training on frozen representations

full rationale

The paper defines a new evaluation protocol that freezes a vision backbone, trains a separate latent diffusion model to predict future features in that backbone's representation space, and then applies lightweight task-specific readouts to measure forecasting quality via distributional trajectory metrics. No equations or steps reduce the reported forecasting scores or correlations to quantities that are fitted or defined by the backbone itself; the diffusion training and readout performance are external to the backbone parameters. No self-citation is invoked as a uniqueness theorem or load-bearing justification for the central claims. The comparisons between video- and image-pretrained models, and the correlation with perceptual quality, follow directly from applying the same protocol across models rather than from any renaming, ansatz smuggling, or self-referential definition. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that feature-space diffusion forecasting plus lightweight decoders faithfully reflect the backbone's forecasting ability. No free parameters are explicitly fitted in the abstract description. No new entities are postulated.

axioms (1)

domain assumption Forecasting future features in the representation space of a frozen vision backbone, decoded by task-specific readouts, isolates the backbone's forecasting capacity.
This premise is invoked when the framework is introduced to enable consistent evaluation across tasks while isolating the backbone.

pith-pipeline@v0.9.0 · 5796 in / 1330 out tokens · 35146 ms · 2026-05-19T03:38:55.043581+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
cs.CV 2026-04 conditional novelty 7.0

Delta tokens compress VFM feature differences into single tokens, enabling a lightweight generative world model that predicts diverse futures with far lower compute than existing approaches.
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
cs.CV 2026-04 unverdicted novelty 6.0

Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 2 Pith papers · 7 internal anchors

[1]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016. URL https: //arxiv.org/abs/1607.06450

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Stylegan knows normal, depth, albedo, and more

Anand Bhattad, Daniel McKee, Derek Hoiem, and David Forsyth. Stylegan knows normal, depth, albedo, and more. Advances in Neural Information Processing Systems, 36:73082–73103, 2023

work page 2023
[4]

Scaling 4d representations

João Carreira, Dilara Gokay, Michael King, Chuhan Zhang, Ignacio Rocco, Aravindh Mahendran, Thomas Albert Keck, Joseph Heyward, Skanda Koppula, Etienne Pot, et al. Scaling 4d representations. arXiv preprint arXiv:2412.15212, 2024

work page arXiv 2024
[5]

2019 , journal =

Aidan Clark, Jeff Donahue, and Karen Simonyan. Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571, 2019

work page arXiv 1907
[6]

Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017

work page 2017
[7]

Stochastic video generation with a learned prior

Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. InInternational conference on machine learning, pages 1174–1183. PMLR, 2018

work page 2018
[8]

Tap-vid: A benchmark for tracking any point in a video

Carl Doersch, Ankush Gupta, Larisa Markeeva, Adria Recasens, Lucas Smaira, Yusuf Aytar, Joao Carreira, Andrew Zisserman, and Yi Yang. Tap-vid: A benchmark for tracking any point in a video. Advances in Neural Information Processing Systems, 35:13610–13626, 2022

work page 2022
[9]

The fréchet distance between multivariate normal distributions

DC Dowson and BV Landau. The fréchet distance between multivariate normal distributions. Journal of multivariate analysis, 12(3):450–455, 1982

work page 1982
[10]

Sur la distance de deux lois de probabilité

Maurice Fréchet. Sur la distance de deux lois de probabilité. In Annales de l’ISUP, volume 6, pages 183–198, 1957

work page 1957
[11]

Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J Fleet, Dan Gnanapragasam, Florian Golemo, Charles Herrmann, Thomas Kipf, Abhijit Kundu, Dmitry Lagun, Issam Laradji, Hsueh-Ti (Derek) Liu, Henning Meyer, Yishu Miao, Derek Nowrouzezahrai, Cengiz Oztireli, Etienne Pot, Noha Radwan, Daniel Rebain, Sara Sabour, Meh...

work page 2022
[12]

Seer: Language Instructed Video Prediction with Latent Diffusion Models

Xianfan Gu, Chuan Wen, Weirui Ye, Jiaming Song, and Yang Gao. Seer: Language instructed video prediction with latent diffusion models. arXiv preprint arXiv:2303.14897, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Photorealistic video generation with diffusion models

Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models. In ECCV, 2024

work page 2024
[14]

Unsupervised semantic correspondence using stable diffusion

Eric Hedlin, Gopal Sharma, Shweta Mahajan, Hossam Isack, Abhishek Kar, Andrea Tagliasacchi, and Kwang Moo Yi. Unsupervised semantic correspondence using stable diffusion. Advances in Neural Information Processing Systems, 36:8266–8279, 2023

work page 2023
[15]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017
[16]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[17]

Diffusion models for video prediction and infilling

Tobias Höppe, Arash Mehrjou, Stefan Bauer, Didrik Nielsen, and Andrea Dittadi. Diffusion models for video prediction and infilling. arXiv preprint arXiv:2206.07696, 2022. 10

work page arXiv 2022
[18]

arXiv preprint arXiv:2412.11673 (2024)

Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Dino-foresight: Looking into the future with dino, 2024. URL https://arxiv.org/abs/2412.11673

work page arXiv 2024
[19]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Your diffusion model is secretly a zero-shot classifier

Alexander C Li, Mihir Prabhudesai, Shivam Duggal, Ellis Brown, and Deepak Pathak. Your diffusion model is secretly a zero-shot classifier. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2206–2217, 2023

work page 2023
[21]

Imageflownet: Forecasting multiscale image-level trajectories of disease progression with irregularly-sampled longitudinal medical images

Chen Liu, Ke Xu, Liangbo L Shen, Guillaume Huguet, Zilong Wang, Alexander Tong, Danilo Bzdok, Jay Stewart, Jay C Wang, Lucian V Del Priore, et al. Imageflownet: Forecasting multiscale image-level trajectories of disease progression with irregularly-sampled longitudinal medical images. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech ...

work page 2025
[22]

Predicting deeper into the future of semantic segmentation

Pauline Luc, Natalia Neverova, Camille Couprie, Jakob Verbeek, and Yann LeCun. Predicting deeper into the future of semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pages 648–657, 2017

work page 2017
[23]

Predicting future instance segmentation by forecasting convolutional features

Pauline Luc, Camille Couprie, Yann Lecun, and Jakob Verbeek. Predicting future instance segmentation by forecasting convolutional features. In Proceedings of the european conference on computer vision (ECCV), pages 584–599, 2018

work page 2018
[24]

Diffusion hyperfeatures: Searching through time and space for semantic correspondence.Advances in Neural Information Processing Systems, 36:47500–47510, 2023

Grace Luo, Lisa Dunlap, Dong Huk Park, Aleksander Holynski, and Trevor Darrell. Diffusion hyperfeatures: Searching through time and space for semantic correspondence.Advances in Neural Information Processing Systems, 36:47500–47510, 2023

work page 2023
[25]

Learning to listen: Modeling non-deterministic dyadic facial motion

Evonne Ng, Hanbyul Joo, Liwen Hu, Hao Li, Trevor Darrell, Angjoo Kanazawa, and Shiry Ginosar. Learning to listen: Modeling non-deterministic dyadic facial motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20395–20405, 2022

work page 2022
[26]

Sora, 12 2024

OpenAI. Sora, 12 2024. URL https://openai.com/sora/

work page 2024
[27]

A review on deep learning techniques for video prediction

Sergiu Oprea, Pablo Martinez-Gonzalez, Alberto Garcia-Garcia, John Alejandro Castro-Vargas, Sergio Orts-Escolano, Jose Garcia-Rodriguez, and Antonis Argyros. A review on deep learning techniques for video prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6):2806–2826,

work page
[28]

doi: 10.1109/TPAMI.2020.3045007

work page doi:10.1109/tpami.2020.3045007 2020
[29]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Sampson, Shikai Li, Simone Parmeggiani, Steve Fine, Tara Fowler, Vladan Petrovic, and Yuming Du

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Matthew Yu, Mitesh Kumar Sing...

work page
[31]

URL https://arxiv.org/abs/2410.13720

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Perception test: A diagnostic benchmark for multimodal video models

Viorica P˘atr˘aucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Mate- jovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen,...

work page 2023
[33]

arXiv preprint arXiv:2501.05453 , year=

Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar, Yossi Gandelsman, Christoph Feichtenhofer, and Jitendra Malik. An empirical study of autoregressive pre-training from videos. arXiv preprint arXiv:2501.05453, 2025

work page arXiv 2025
[34]

Video (language) modeling: a baseline for generative models of natural videos

MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and Sumit Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[35]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in perception...

work page 2020
[36]

Poly- autoregressive prediction for modeling interactions

Neerja Thakkar, Tara Sadjadpour, Jathushan Rajasegaran, Shiry Ginosar, and Jitendra Malik. Poly- autoregressive prediction for modeling interactions. In CVPR, 2025

work page 2025
[37]

VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. NeurIPS, 2022

work page 2022
[38]

Mocogan: Decomposing motion and content for video generation

Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1526–1535, 2018

work page 2018
[39]

Fvd: A new metric for video generation

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. ICLR workshop, 2019

work page 2019
[40]

Anticipating visual representations from unlabeled video

Carl V ondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating visual representations from unlabeled video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 98–106, 2016

work page 2016
[41]

Polanía, Yi Yang, Chuhan Zhang, Rishabh Kabra, Anurag Arnab, and Mehdi S

Pedro Vélez, Luisa F. Polanía, Yi Yang, Chuhan Zhang, Rishabh Kabra, Anurag Arnab, and Mehdi S. M. Sajjadi. From image to video: An empirical study of diffusion representations. arXiv preprint arXiv:2502.07001, 2025

work page arXiv 2025
[42]

VideoMAE v2: Scaling video masked autoencoders with dual masking

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. VideoMAE v2: Scaling video masked autoencoders with dual masking. In CVPR, pages 14549–14560, 2023

work page 2023
[43]

Imaginator: Conditional spatio- temporal gan for video generation

Yaohui Wang, Piotr Bilinski, Francois Bremond, and Antitza Dantcheva. Imaginator: Conditional spatio- temporal gan for video generation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1160–1169, 2020

work page 2020
[44]

Aid: Adapting image2video diffusion models for instruction-guided video prediction

Zhen Xing, Qi Dai, Zejia Weng, Zuxuan Wu, and Yu-Gang Jiang. Aid: Adapting image2video diffusion models for instruction-guided video prediction. arXiv preprint arXiv:2406.06465, 2024

work page arXiv 2024
[45]

What matters when repurposing diffusion models for general dense perception tasks? ICLR, 2025

Guangkai Xu, Yongtao Ge, Mingyu Liu, Chengxiang Fan, Kangyang Xie, Zhiyue Zhao, Hao Chen, and Chunhua Shen. What matters when repurposing diffusion models for general dense perception tasks? ICLR, 2025

work page 2025
[46]

Video diffusion models with local-global context guidance

Siyuan Yang, Lu Zhang, Yu Liu, Zhizhuo Jiang, and You He. Video diffusion models with local-global context guidance. arXiv preprint arXiv:2306.02562, 2023

work page arXiv 2023
[47]

Stdiff: Spatio-temporal diffusion for continuous stochastic video prediction

Xi Ye and Guillaume-Alexandre Bilodeau. Stdiff: Spatio-temporal diffusion for continuous stochastic video prediction. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 38, pages 6666–6674, 2024

work page 2024
[48]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, pages 11975–11986, 2023

work page 2023
[49]

A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence

Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Polania Cabrera, Varun Jampani, Deqing Sun, and Ming-Hsuan Yang. A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence. Advances in Neural Information Processing Systems, 36:45533–45547, 2023

work page 2023
[50]

Trajectory flow matching with applications to clinical time series modelling

Xi Nicole Zhang, Yuan Pu, Yuki Kawamura, Andrew Loza, Yoshua Bengio, Dennis Shung, and Alexander Tong. Trajectory flow matching with applications to clinical time series modelling. Advances in Neural Information Processing Systems, 37:107198–107224, 2024. 12

work page 2024
[51]

Long Zhao, Nitesh Bharadwaj Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J. Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, and Boqing Gong. VideoPrism: A foundational visual encoder for video understanding. In ICML, 2024

work page 2024
[52]

Unleashing text-to-image diffusion models for visual perception

Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu, Jie Zhou, and Jiwen Lu. Unleashing text-to-image diffusion models for visual perception. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5729–5739, 2023. 13 A Appendix A.1 Ablations Table 3: Diffusion vs Regression Model Pixels Depth Point Tracks Box Tracks Mean ↑ Best ↑...

work page 2023

[1] [1]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016. URL https: //arxiv.org/abs/1607.06450

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Stylegan knows normal, depth, albedo, and more

Anand Bhattad, Daniel McKee, Derek Hoiem, and David Forsyth. Stylegan knows normal, depth, albedo, and more. Advances in Neural Information Processing Systems, 36:73082–73103, 2023

work page 2023

[4] [4]

Scaling 4d representations

João Carreira, Dilara Gokay, Michael King, Chuhan Zhang, Ignacio Rocco, Aravindh Mahendran, Thomas Albert Keck, Joseph Heyward, Skanda Koppula, Etienne Pot, et al. Scaling 4d representations. arXiv preprint arXiv:2412.15212, 2024

work page arXiv 2024

[5] [5]

2019 , journal =

Aidan Clark, Jeff Donahue, and Karen Simonyan. Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571, 2019

work page arXiv 1907

[6] [6]

Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017

work page 2017

[7] [7]

Stochastic video generation with a learned prior

Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. InInternational conference on machine learning, pages 1174–1183. PMLR, 2018

work page 2018

[8] [8]

Tap-vid: A benchmark for tracking any point in a video

Carl Doersch, Ankush Gupta, Larisa Markeeva, Adria Recasens, Lucas Smaira, Yusuf Aytar, Joao Carreira, Andrew Zisserman, and Yi Yang. Tap-vid: A benchmark for tracking any point in a video. Advances in Neural Information Processing Systems, 35:13610–13626, 2022

work page 2022

[9] [9]

The fréchet distance between multivariate normal distributions

DC Dowson and BV Landau. The fréchet distance between multivariate normal distributions. Journal of multivariate analysis, 12(3):450–455, 1982

work page 1982

[10] [10]

Sur la distance de deux lois de probabilité

Maurice Fréchet. Sur la distance de deux lois de probabilité. In Annales de l’ISUP, volume 6, pages 183–198, 1957

work page 1957

[11] [11]

Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J Fleet, Dan Gnanapragasam, Florian Golemo, Charles Herrmann, Thomas Kipf, Abhijit Kundu, Dmitry Lagun, Issam Laradji, Hsueh-Ti (Derek) Liu, Henning Meyer, Yishu Miao, Derek Nowrouzezahrai, Cengiz Oztireli, Etienne Pot, Noha Radwan, Daniel Rebain, Sara Sabour, Meh...

work page 2022

[12] [12]

Seer: Language Instructed Video Prediction with Latent Diffusion Models

Xianfan Gu, Chuan Wen, Weirui Ye, Jiaming Song, and Yang Gao. Seer: Language instructed video prediction with latent diffusion models. arXiv preprint arXiv:2303.14897, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Photorealistic video generation with diffusion models

Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models. In ECCV, 2024

work page 2024

[14] [14]

Unsupervised semantic correspondence using stable diffusion

Eric Hedlin, Gopal Sharma, Shweta Mahajan, Hossam Isack, Abhishek Kar, Andrea Tagliasacchi, and Kwang Moo Yi. Unsupervised semantic correspondence using stable diffusion. Advances in Neural Information Processing Systems, 36:8266–8279, 2023

work page 2023

[15] [15]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017

[16] [16]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020

[17] [17]

Diffusion models for video prediction and infilling

Tobias Höppe, Arash Mehrjou, Stefan Bauer, Didrik Nielsen, and Andrea Dittadi. Diffusion models for video prediction and infilling. arXiv preprint arXiv:2206.07696, 2022. 10

work page arXiv 2022

[18] [18]

arXiv preprint arXiv:2412.11673 (2024)

Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Dino-foresight: Looking into the future with dino, 2024. URL https://arxiv.org/abs/2412.11673

work page arXiv 2024

[19] [19]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Your diffusion model is secretly a zero-shot classifier

Alexander C Li, Mihir Prabhudesai, Shivam Duggal, Ellis Brown, and Deepak Pathak. Your diffusion model is secretly a zero-shot classifier. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2206–2217, 2023

work page 2023

[21] [21]

Imageflownet: Forecasting multiscale image-level trajectories of disease progression with irregularly-sampled longitudinal medical images

Chen Liu, Ke Xu, Liangbo L Shen, Guillaume Huguet, Zilong Wang, Alexander Tong, Danilo Bzdok, Jay Stewart, Jay C Wang, Lucian V Del Priore, et al. Imageflownet: Forecasting multiscale image-level trajectories of disease progression with irregularly-sampled longitudinal medical images. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech ...

work page 2025

[22] [22]

Predicting deeper into the future of semantic segmentation

Pauline Luc, Natalia Neverova, Camille Couprie, Jakob Verbeek, and Yann LeCun. Predicting deeper into the future of semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pages 648–657, 2017

work page 2017

[23] [23]

Predicting future instance segmentation by forecasting convolutional features

Pauline Luc, Camille Couprie, Yann Lecun, and Jakob Verbeek. Predicting future instance segmentation by forecasting convolutional features. In Proceedings of the european conference on computer vision (ECCV), pages 584–599, 2018

work page 2018

[24] [24]

Diffusion hyperfeatures: Searching through time and space for semantic correspondence.Advances in Neural Information Processing Systems, 36:47500–47510, 2023

Grace Luo, Lisa Dunlap, Dong Huk Park, Aleksander Holynski, and Trevor Darrell. Diffusion hyperfeatures: Searching through time and space for semantic correspondence.Advances in Neural Information Processing Systems, 36:47500–47510, 2023

work page 2023

[25] [25]

Learning to listen: Modeling non-deterministic dyadic facial motion

Evonne Ng, Hanbyul Joo, Liwen Hu, Hao Li, Trevor Darrell, Angjoo Kanazawa, and Shiry Ginosar. Learning to listen: Modeling non-deterministic dyadic facial motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20395–20405, 2022

work page 2022

[26] [26]

Sora, 12 2024

OpenAI. Sora, 12 2024. URL https://openai.com/sora/

work page 2024

[27] [27]

A review on deep learning techniques for video prediction

Sergiu Oprea, Pablo Martinez-Gonzalez, Alberto Garcia-Garcia, John Alejandro Castro-Vargas, Sergio Orts-Escolano, Jose Garcia-Rodriguez, and Antonis Argyros. A review on deep learning techniques for video prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6):2806–2826,

work page

[28] [28]

doi: 10.1109/TPAMI.2020.3045007

work page doi:10.1109/tpami.2020.3045007 2020

[29] [29]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Sampson, Shikai Li, Simone Parmeggiani, Steve Fine, Tara Fowler, Vladan Petrovic, and Yuming Du

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Matthew Yu, Mitesh Kumar Sing...

work page

[31] [31]

URL https://arxiv.org/abs/2410.13720

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Perception test: A diagnostic benchmark for multimodal video models

Viorica P˘atr˘aucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Mate- jovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen,...

work page 2023

[33] [33]

arXiv preprint arXiv:2501.05453 , year=

Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar, Yossi Gandelsman, Christoph Feichtenhofer, and Jitendra Malik. An empirical study of autoregressive pre-training from videos. arXiv preprint arXiv:2501.05453, 2025

work page arXiv 2025

[34] [34]

Video (language) modeling: a baseline for generative models of natural videos

MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and Sumit Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[35] [35]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in perception...

work page 2020

[36] [36]

Poly- autoregressive prediction for modeling interactions

Neerja Thakkar, Tara Sadjadpour, Jathushan Rajasegaran, Shiry Ginosar, and Jitendra Malik. Poly- autoregressive prediction for modeling interactions. In CVPR, 2025

work page 2025

[37] [37]

VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. NeurIPS, 2022

work page 2022

[38] [38]

Mocogan: Decomposing motion and content for video generation

Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1526–1535, 2018

work page 2018

[39] [39]

Fvd: A new metric for video generation

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. ICLR workshop, 2019

work page 2019

[40] [40]

Anticipating visual representations from unlabeled video

Carl V ondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating visual representations from unlabeled video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 98–106, 2016

work page 2016

[41] [41]

Polanía, Yi Yang, Chuhan Zhang, Rishabh Kabra, Anurag Arnab, and Mehdi S

Pedro Vélez, Luisa F. Polanía, Yi Yang, Chuhan Zhang, Rishabh Kabra, Anurag Arnab, and Mehdi S. M. Sajjadi. From image to video: An empirical study of diffusion representations. arXiv preprint arXiv:2502.07001, 2025

work page arXiv 2025

[42] [42]

VideoMAE v2: Scaling video masked autoencoders with dual masking

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. VideoMAE v2: Scaling video masked autoencoders with dual masking. In CVPR, pages 14549–14560, 2023

work page 2023

[43] [43]

Imaginator: Conditional spatio- temporal gan for video generation

Yaohui Wang, Piotr Bilinski, Francois Bremond, and Antitza Dantcheva. Imaginator: Conditional spatio- temporal gan for video generation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1160–1169, 2020

work page 2020

[44] [44]

Aid: Adapting image2video diffusion models for instruction-guided video prediction

Zhen Xing, Qi Dai, Zejia Weng, Zuxuan Wu, and Yu-Gang Jiang. Aid: Adapting image2video diffusion models for instruction-guided video prediction. arXiv preprint arXiv:2406.06465, 2024

work page arXiv 2024

[45] [45]

What matters when repurposing diffusion models for general dense perception tasks? ICLR, 2025

Guangkai Xu, Yongtao Ge, Mingyu Liu, Chengxiang Fan, Kangyang Xie, Zhiyue Zhao, Hao Chen, and Chunhua Shen. What matters when repurposing diffusion models for general dense perception tasks? ICLR, 2025

work page 2025

[46] [46]

Video diffusion models with local-global context guidance

Siyuan Yang, Lu Zhang, Yu Liu, Zhizhuo Jiang, and You He. Video diffusion models with local-global context guidance. arXiv preprint arXiv:2306.02562, 2023

work page arXiv 2023

[47] [47]

Stdiff: Spatio-temporal diffusion for continuous stochastic video prediction

Xi Ye and Guillaume-Alexandre Bilodeau. Stdiff: Spatio-temporal diffusion for continuous stochastic video prediction. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 38, pages 6666–6674, 2024

work page 2024

[48] [48]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, pages 11975–11986, 2023

work page 2023

[49] [49]

A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence

Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Polania Cabrera, Varun Jampani, Deqing Sun, and Ming-Hsuan Yang. A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence. Advances in Neural Information Processing Systems, 36:45533–45547, 2023

work page 2023

[50] [50]

Trajectory flow matching with applications to clinical time series modelling

Xi Nicole Zhang, Yuan Pu, Yuki Kawamura, Andrew Loza, Yoshua Bengio, Dennis Shung, and Alexander Tong. Trajectory flow matching with applications to clinical time series modelling. Advances in Neural Information Processing Systems, 37:107198–107224, 2024. 12

work page 2024

[51] [51]

Long Zhao, Nitesh Bharadwaj Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J. Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, and Boqing Gong. VideoPrism: A foundational visual encoder for video understanding. In ICML, 2024

work page 2024

[52] [52]

Unleashing text-to-image diffusion models for visual perception

Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu, Jie Zhou, and Jiwen Lu. Unleashing text-to-image diffusion models for visual perception. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5729–5739, 2023. 13 A Appendix A.1 Ablations Table 3: Diffusion vs Regression Model Pixels Depth Point Tracks Box Tracks Mean ↑ Best ↑...

work page 2023