arxiv: 2604.21931 · v1 · submitted 2026-04-23 · 💻 cs.CV · cs.AI· cs.GR

Recognition: unknown

Seeing Fast and Slow: Learning the Flow of Time in Videos

Yen-Siang Wu , Rundong Luo , Jingsen Zhu , Tao Tu , Ali Farhadi , Matthew Wallingford , Yu-Chiang Frank Wang , Steve Marschner

show 1 more author

Wei-Chiu Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.GR

keywords temporal reasoningself-supervised learningvideo speed estimationslow-motion datasetspeed-conditioned generationtemporal super-resolution

0 comments

The pith

Self-supervised models learn to perceive and control the flow of time in videos by detecting speed changes and estimating playback speeds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats time as a learnable visual concept rather than a fixed property of video data. It builds self-supervised models that use multimodal cues and natural temporal structure to detect when videos have been sped up or slowed down and to estimate their playback speed. These models then allow curation of a very large slow-motion video collection from ordinary web sources. With that collection the authors train new models that generate video motion at a user-specified speed and that turn low-frame-rate blurry clips into high-frame-rate sequences with sharper temporal detail.

Core claim

We study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos. We first exploit the multimodal cues and temporal structure naturally present in videos to learn, in a self-supervised manner, to detect speed changes and estimate playback speed. We then show that these learned temporal reasoning models enable us to curate the largest slow-motion video dataset to date from noisy in-the-wild sources. Using this data, we further develop models capable of temporal control, including speed-conditioned video generation, which produces motion at specified playback speed, and temporal super-resolution, which transforms low-FPS, blurry 3D

What carries the argument

self-supervised temporal reasoning models trained to detect speed changes and estimate playback speed from multimodal and temporal cues in videos

If this is right

A large high-quality slow-motion dataset becomes available for training without manual labeling.
Video generators can be conditioned on a target playback speed so the same scene can be rendered fast or slow on demand.
Low-frame-rate input can be turned into high-frame-rate output that recovers fine-grained motion details.
Temporal forensics tasks such as spotting speed tampering become feasible with the same learned speed detectors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same speed-estimation signal could be used to flag edited or synthetic video that contains inconsistent timing.
World models for robotics or simulation might improve if they explicitly represent how events unfold at different timescales.
Extending the approach to audio-visual alignment could let models learn consistent speed across sound and image streams.

Load-bearing premise

Multimodal cues and temporal structure inside ordinary videos are rich enough to let a model reliably spot speed changes and estimate playback speed even when the source videos are noisy and uncurated.

What would settle it

Training the same generation and super-resolution models on a random sample of ordinary video instead of the curated slow-motion collection yields no measurable gain in temporal coherence or detail.

read the original abstract

How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos. We first exploit the multimodal cues and temporal structure naturally present in videos to learn, in a self-supervised manner, to detect speed changes and estimate playback speed. We then show that these learned temporal reasoning models enable us to curate the largest slow-motion video dataset to date from noisy in-the-wild sources. Such slow-motion footage, typically filmed by high-speed cameras, contains substantially richer temporal detail than standard videos. Using this data, we further develop models capable of temporal control, including speed-conditioned video generation, which produces motion at specified playback speed, and temporal super-resolution, which tranforms low-FPS, blurry videos into high-FPS sequences with fine-grained temporal details. Our findings highlight time as a manipulable, perceptual dimension in video learning, opening doors to temporally controllable video generation, temporal forensics detection, and potentially richer world-models that understand how events unfold over time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames time flow as a self-supervised visual concept to curate a large slow-mo dataset and enable speed-controlled video generation, but the abstract gives no evidence that the key estimation step holds up on real data.

read the letter

The main takeaway is that they treat playback speed as something a model can learn to detect and then control, starting from self-supervised training on natural video cues and ending with generation models that take speed as input or upsample temporally. The idea of using that to pull a big slow-motion collection from web video is the clearest new angle here. Most prior work on video speed either assumes fixed rates or focuses on optical flow without explicitly reasoning about perceived time passage, so this direction feels like a reasonable extension into controllable generation and forensics-style tasks. The pipeline logic is straightforward and avoids needing expensive labeled data upfront, which is a practical plus if the self-supervision actually delivers clean labels. The downstream uses for speed-conditioned synthesis and temporal super-resolution follow directly from having richer slow-mo footage. The soft spot is the missing validation for the speed estimator itself. The stress-test concern lands because the whole dataset and all later results depend on that model correctly identifying slow clips amid camera shake, edits, and audio issues in wild video. The abstract reports nothing on error rates, human agreement, or ablation on how label noise affects the generators, so we cannot yet judge whether the curation step is reliable or just noisy. This leaves the central claims provisional. The work is aimed at computer vision researchers focused on video generation and temporal modeling. A reader looking for new control axes in synthesis models could get ideas from it, but only once the experiments are visible. I would send it to peer review because the framing is coherent and scoped, and referees could usefully ask for the quantitative checks on the estimation accuracy and downstream robustness.

Referee Report

1 major / 0 minor

Summary. The paper claims to treat time as a learnable visual concept in videos by developing self-supervised models that detect speed changes and estimate playback speed from multimodal and temporal cues. These models are then used to curate the largest slow-motion video dataset from noisy in-the-wild sources, which in turn supports new models for speed-conditioned video generation and temporal super-resolution.

Significance. If the self-supervised speed estimation step is reliable, the work would provide a valuable large-scale slow-motion dataset and demonstrate new capabilities for temporal control in video models, potentially advancing generative video methods, temporal forensics, and richer world models that reason about event timing. The self-supervised curation approach is a notable strength if validated.

major comments (1)

[Abstract] The self-supervised pipeline for speed change detection and playback speed estimation is the load-bearing step for curating the claimed largest slow-motion dataset and all downstream results. The abstract describes the high-level approach but provides no quantitative validation, error analysis, ablations on real-world confounders (e.g., camera shake, cuts, audio desync), or comparisons to baselines, so the reliability of the labels cannot be assessed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We address the major comment on the abstract below and have made revisions to improve clarity regarding the reliability of the self-supervised pipeline.

read point-by-point responses

Referee: [Abstract] The self-supervised pipeline for speed change detection and playback speed estimation is the load-bearing step for curating the claimed largest slow-motion dataset and all downstream results. The abstract describes the high-level approach but provides no quantitative validation, error analysis, ablations on real-world confounders (e.g., camera shake, cuts, audio desync), or comparisons to baselines, so the reliability of the labels cannot be assessed.

Authors: We agree that the abstract, in its original form, emphasizes the high-level approach without embedding specific quantitative results, which limits immediate assessment of the pipeline's reliability. The body of the manuscript contains the requested quantitative validation, including accuracy and error metrics for speed change detection and playback speed estimation, ablations addressing real-world factors such as camera motion and temporal discontinuities, and comparisons against baselines. To directly address this point, we have revised the abstract to incorporate key quantitative highlights from our experiments while preserving its concise nature. This change makes the load-bearing role of the self-supervised step more transparent to readers. revision: yes

Circularity Check

0 steps flagged

No significant circularity; self-supervised pipeline is data-driven and self-contained

full rationale

The paper presents a self-supervised learning pipeline that exploits naturally occurring multimodal and temporal cues in videos to train models for speed change detection and playback speed estimation. These models are then applied to curate a slow-motion dataset from in-the-wild sources, which in turn supports downstream tasks like speed-conditioned generation and temporal super-resolution. No equations, derivations, or self-citations appear that reduce any claimed prediction or result to its own inputs by construction. The approach relies on external data patterns rather than fitted parameters renamed as predictions or ansatzes smuggled via prior self-work. This is the standard case of an honest non-finding: the derivation chain does not collapse into tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that videos contain learnable cues for time flow and that self-supervised training plus model-based curation can produce usable slow-motion data. Standard deep learning training involves many hyperparameters, but these are not specified in the abstract.

free parameters (1)

neural network hyperparameters and training settings
Typical in self-supervised video models; values are chosen or fitted during development but not detailed here.

axioms (1)

domain assumption Videos contain sufficient multimodal and temporal structure to support self-supervised inference of playback speed changes.
Invoked as the basis for the initial self-supervised learning stage described in the abstract.

pith-pipeline@v0.9.0 · 5556 in / 1287 out tokens · 61914 ms · 2026-05-09T21:52:11.592365+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

72 extracted references · 17 canonical work pages · 8 internal anchors

[1]

Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025

Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos-reason1: From physical common sense to embodied reasoning. arXiv preprint arXiv:2503.15558, 2025

work page arXiv 2025
[2]

Test of time: Instilling video-language models with a sense of time

Piyush Bagad, Makarand Tapaswi, and Cees GM Snoek. Test of time: Instilling video-language models with a sense of time. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2503–2516, 2023

2023
[3]

Ac3d: Analyzing and improving 3d camera control in video diffusion transformers

Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. InCVPR, 2025

2025
[4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. InICCV, 2021

2021
[6]

Speednet: Learning the speediness in videos

Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William T Freeman, Michael Rubinstein, Michal Irani, and Tali Dekel. Speednet: Learning the speediness in videos. InCVPR, 2020

2020
[7]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv:2311.15127, 2023

work page internal anchor Pith review arXiv 2023
[8]

Learning to synthesize motion blur

Tim Brooks and Jonathan T Barron. Learning to synthesize motion blur. InCVPR, 2019. 12

2019
[9]

Go-with-the-flow: Motion-controllable video diffusion models using real-time warped noise

Ryan Burgert, Yuancheng Xu, Wenqi Xian, Oliver Pilarski, Pascal Clausen, Mingming He, Li Ma, Yitong Deng, Lingxiao Li, Mohsen Mousavi, et al. Go-with-the-flow: Motion-controllable video diffusion models using real-time warped noise. InCVPR, 2025

2025
[10]

Sportsslomo: A new benchmark and baselines for human-centric video frame interpolation.arXiv:2308.16876, 2023

Jiaben Chen and Huaizu Jiang. Sportsslomo: A new benchmark and baselines for human-centric video frame interpolation.arXiv:2308.16876, 2023

work page arXiv 2023
[11]

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InCVPR, 2024

2024
[12]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Flolpips: A bespoke video quality metric for frame interpolation

Duolikun Danier, Fan Zhang, and David Bull. Flolpips: A bespoke video quality metric for frame interpolation. InPicture Coding Symposium, 2022

2022
[14]

Ldmvfi: Video frame interpolation with latent diffusion models

Duolikun Danier, Fan Zhang, and David Bull. Ldmvfi: Video frame interpolation with latent diffusion models. In AAAI, 2024

2024
[15]

Do language models understand time? InCompanion Proceedings of the ACM on Web Conference 2025, pages 1855–1868, 2025

Xi Ding and Lei Wang. Do language models understand time? InCompanion Proceedings of the ACM on Web Conference 2025, pages 1855–1868, 2025

2025
[16]

Scvrl: Shuffled contrastive video representation learning

Michael Dorkenwald, Fanyi Xiao, Biagio Brattoli, Joseph Tighe, and Davide Modolo. Scvrl: Shuffled contrastive video representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4132–4141, 2022

2022
[17]

Reversed in time: A novel temporal-emphasized benchmark for cross-modal video-text retrieval

Yang Du, Yuqi Liu, and Qin Jin. Reversed in time: A novel temporal-emphasized benchmark for cross-modal video-text retrieval. InProceedings of the 32nd ACM International Conference on Multimedia, pages 5260–5269, 2024

2024
[18]

Explorative inbetweening of time and space

Haiwen Feng, Zheng Ding, Zhihao Xia, Simon Niklaus, Victoria Abrevaya, Michael J Black, and Xuaner Zhang. Explorative inbetweening of time and space. InECCV, 2024

2024
[19]

The pulse of motion: Measuring physical frame rate from visual dynamics.arXiv preprint arXiv:2603.14375, 2026

Xiangbo Gao, Mingyang Wu, Siyuan Yang, Jiongze Yu, Pardis Taghavi, Fangzhou Lin, and Zhengzhong Tu. The pulse of motion: Measuring physical frame rate from visual dynamics.arXiv preprint arXiv:2603.14375, 2026

work page arXiv 2026
[20]

Video time: Properties, encoders and evaluation.arXiv preprint arXiv:1807.06980, 2018

Amir Ghodrati, Efstratios Gavves, and Cees GM Snoek. Video time: Properties, encoders and evaluation.arXiv preprint arXiv:1807.06980, 2018

work page arXiv 2018
[21]

Cover: A comprehensive video quality evaluator

Chenlong He, Qi Zheng, Ruoxi Zhu, Xiaoyang Zeng, Yibo Fan, and Zhengzhong Tu. Cover: A comprehensive video quality evaluator. InCVPR W, 2024

2024
[22]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeurIPS, 2017

2017
[23]

arXiv preprint arXiv:2512.25075 (2025)

Zhening Huang, Hyeonho Jeong, Xuelin Chen, Yulia Gryaditskaya, Tuanfeng Y Wang, Joan Lasenby, and Chun- Hao Huang. Spacetimepilot: Generative rendering of dynamic scenes across space and time.arXiv:2512.25075, 2025

work page arXiv 2025
[24]

Real-time intermediate flow estimation for video frame interpolation

Zhewei Huang, Tianyuan Zhang, Wen Heng, Boxin Shi, and Shuchang Zhou. Real-time intermediate flow estimation for video frame interpolation. InECCV, 2022

2022
[25]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InCVPR, 2024

2024
[26]

Video interpolation with diffusion models

Siddhant Jain, Daniel Watson, Eric Tabellion, Ben Poole, Janne Kontkanen, et al. Video interpolation with diffusion models. InCVPR, 2024

2024
[27]

Super slomo: High quality estimation of multiple intermediate frames for video interpolation

Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik Learned-Miller, and Jan Kautz. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. InCVPR, 2018

2018
[28]

Vace: All-in-one video creation and editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InICCV, 2025. 13

2025
[29]

Cotracker3: Simpler and better point tracking by pseudo-labelling real videos

Nikita Karaev, Yuri Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos. InICCV, 2025

2025
[30]

Need for speed: A benchmark for higher frame rate object tracking

Hamed Kiani Galoogahi, Ashton Fagg, Chen Huang, Deva Ramanan, and Simon Lucey. Need for speed: A benchmark for higher frame rate object tracking. InICCV, 2017

2017
[31]

Unsupervised representation learning by sorting sequences

Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Unsupervised representation learning by sorting sequences. InProceedings of the IEEE international conference on computer vision, pages 667–676, 2017

2017
[32]

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv:2305.06355, 2023

work page internal anchor Pith review arXiv 2023
[33]

Video-llava: Learning united visual representation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InEMNLP, 2024

2024
[34]

Video frame synthesis using deep voxel flow

Ziwei Liu, Raymond A Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agarwala. Video frame synthesis using deep voxel flow. InICCV, 2017

2017
[35]

Beyond the frame: Generating 360°panoramic videos from perspective videos

Rundong Luo, Matthew Wallingford, Ali Farhadi, Noah Snavely, and Wei-Chiu Ma. Beyond the frame: Generating 360°panoramic videos from perspective videos. InICCV, 2025

2025
[36]

Uvg dataset: 50/120fps 4k sequences for video codec analysis and development

Alexandre Mercat, Marko Viitanen, and Jarno Vanne. Uvg dataset: 50/120fps 4k sequences for video codec analysis and development. InProceedings of the 11th ACM multimedia systems conference, pages 297–302, 2020

2020
[37]

Shuffle and learn: unsupervised learning using temporal order verification

Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuffle and learn: unsupervised learning using temporal order verification. InEuropean conference on computer vision, pages 527–544. Springer, 2016

2016
[38]

Deep multi-scale convolutional neural network for dynamic scene deblurring

Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. InCVPR, 2017

2017
[39]

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to-video generation.arXiv:2407.02371, 2024

work page internal anchor Pith review arXiv 2024
[40]

Video frame interpolation via adaptive convolution

Simon Niklaus, Long Mai, and Feng Liu. Video frame interpolation via adaptive convolution. InCVPR, 2017

2017
[41]

Perazzi, J

F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InCVPR, 2016

2016
[42]

Seeing the arrow of time

Lyndsey C Pickup, Zheng Pan, Donglai Wei, YiChang Shih, Changshui Zhang, Andrew Zisserman, Bernhard Scholkopf, and William T Freeman. Seeing the arrow of time. InCVPR, 2014

2014
[43]

Film: Frame interpolation for large motion

Fitsum Reda, Janne Kontkanen, Eric Tabellion, Deqing Sun, Caroline Pantofaru, and Brian Curless. Film: Frame interpolation for large motion. InECCV, 2022

2022
[44]

Xvfi: extreme video frame interpolation

Hyeonjun Sim, Jihyong Oh, and Munchurl Kim. Xvfi: extreme video frame interpolation. InICCV, 2021

2021
[45]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. In CVPR, 2024

2024
[46]

Transnet v2: An effective deep network architecture for fast shot transition detection

Tomás Soucek and Jakub Lokoc. Transnet v2: An effective deep network architecture for fast shot transition detection. InACM MM, 2024

2024
[47]

Time and video speed perception: a comprehensive investigation of the relation between estimated video speed, clip duration and original duration: V

Verena Steinhof, Anna Schroeger, Roman Liepelt, and Laura Sperl. Time and video speed perception: a comprehensive investigation of the relation between estimated video speed, clip duration and original duration: V. steinhof et al.Cognitive Research: Principles and Implications, 2025

2025
[48]

Deep video deblurring for hand-held cameras

Shuochen Su, Mauricio Delbracio, Jue Wang, Guillermo Sapiro, Wolfgang Heidrich, and Oliver Wang. Deep video deblurring for hand-held cameras. InCVPR, 2017

2017
[49]

Fvd: A new metric for video generation

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. InICLR W, 2019

2019
[51]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv:2503.20314, 2025. 14

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

arXiv preprint arXiv:2505.22944 (2025)

Angtian Wang, Haibin Huang, Zhiyuan Fang, Yiding Yang, and Chongyang Ma. ATI: Any trajectory instruction for controllable video generation.arXiv:2505.22944, 2025

work page arXiv 2025
[53]

Self-supervised video representation learning by pace prediction

Jiangliu Wang, Jianbo Jiao, and Yun-Hui Liu. Self-supervised video representation learning by pace prediction. InECCV, 2020

2020
[54]

Videomae v2: Scaling video masked autoencoders with dual masking

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. InCVPR, 2023

2023
[55]

Generative inbetweening: Adapting image-to-video models for keyframe interpolation

Xiaojuan Wang, Boyang Zhou, Brian Curless, Ira Kemelmacher-Shlizerman, Aleksander Holynski, and Steven M Seitz. Generative inbetweening: Adapting image-to-video models for keyframe interpolation. InICLR, 2024

2024
[56]

Internvid: A large-scale video-text dataset for multimodal un- derstanding and generation.arXiv preprint arXiv:2307.06942,

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2307.06942, 2023

work page arXiv 2023
[57]

Sea-raft: Simple, efficient, accurate raft for optical flow

Yihan Wang, Lahav Lipson, and Jia Deng. Sea-raft: Simple, efficient, accurate raft for optical flow. InECCV, 2024

2024
[58]

Bullettime: Decoupled control of time and camera pose for video generation

Yiming Wang, Qihang Zhang, Shengqu Cai, Tong Wu, Jan Ackermann, Zhengfei Kuang, Yang Zheng, Frano Rajič, Siyu Tang, and Gordon Wetzstein. Bullettime: Decoupled control of time and camera pose for video generation. arXiv:2512.05076, 2025

work page arXiv 2025
[59]

Paxion: Patching action knowledge in video-language foundation models.Advances in Neural Information Processing Systems, 36:20729–20749, 2023

Zhenhailong Wang, Ansel Blume, Sha Li, Genglin Liu, Jaemin Cho, Zineng Tang, Mohit Bansal, and Heng Ji. Paxion: Patching action knowledge in video-language foundation models.Advances in Neural Information Processing Systems, 36:20729–20749, 2023

2023
[60]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InSIGGRAPH, 2024

2024
[61]

Learning and using the arrow of time

Donglai Wei, Joseph J Lim, Andrew Zisserman, and William T Freeman. Learning and using the arrow of time. InCVPR, 2018

2018
[62]

Zooming slow-mo: Fast and accurate one-stage space-time video super-resolution

Xiaoyu Xiang, Yapeng Tian, Yulun Zhang, Yun Fu, Jan P Allebach, and Chenliang Xu. Zooming slow-mo: Fast and accurate one-stage space-time video super-resolution. InCVPR, 2020

2020
[63]

Seeing the arrow of time in large multimodal models.arXiv preprint arXiv:2506.03340, 2025

Zihui Xue, Mi Luo, and Kristen Grauman. Seeing the arrow of time in large multimodal models.arXiv preprint arXiv:2506.03340, 2025

work page arXiv 2025
[64]

Video playback rate perception for self-supervised spatio-temporal representation learning

Yuan Yao, Chang Liu, Dezhao Luo, Yu Zhou, and Qixiang Ye. Video playback rate perception for self-supervised spatio-temporal representation learning. InCVPR, 2020

2020
[65]

Dptext-detr: Towards better scene text detection with dynamic points in transformer

Maoyuan Ye, Jing Zhang, Shanshan Zhao, Juhua Liu, Bo Du, and Dacheng Tao. Dptext-detr: Towards better scene text detection with dynamic points in transformer. InAAAI, 2023

2023
[66]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

2018
[67]

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, Yu Qiao, and Ziwei Liu. VBench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness. arXiv:2503.21755, 2025

work page internal anchor Pith review arXiv 2025
[68]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv:2504.10479, 2025. 15 A Supplementary Material Overview In this supplementary document, we provide additional implementation details ...

work page internal anchor Pith review arXiv 2025
[69]

Watch the video and use the preview slider (1.0× to 16.0×) to experiment with different speeds
[70]

Based on your observation, determine what speed factor would make it look normal
[71]

Speed Factor

Enter that factor in the "Speed Factor" input box
[72]

Click Done to save your answer and move to the next video
[73]

4" in the input box • If the video appears 2× slow when slider is at 10× → Enter

If you cannot determine a reasonable speed, click Cannot Tell (use sparingly) Important Notes • Please make sure you're using Chrome • Your answer can exceed these limits - enter any value you believe is correct Example Scenarios • If the video looks normal at 4× on the slider → Enter "4" in the input box • If the video appears 2× slow when slider is at 1...