pith. machine review for the scientific record. sign in

arxiv: 2604.21931 · v1 · submitted 2026-04-23 · 💻 cs.CV · cs.AI· cs.GR

Recognition: unknown

Seeing Fast and Slow: Learning the Flow of Time in Videos

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.GR
keywords temporal reasoningself-supervised learningvideo speed estimationslow-motion datasetspeed-conditioned generationtemporal super-resolution
0
0 comments X

The pith

Self-supervised models learn to perceive and control the flow of time in videos by detecting speed changes and estimating playback speeds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats time as a learnable visual concept rather than a fixed property of video data. It builds self-supervised models that use multimodal cues and natural temporal structure to detect when videos have been sped up or slowed down and to estimate their playback speed. These models then allow curation of a very large slow-motion video collection from ordinary web sources. With that collection the authors train new models that generate video motion at a user-specified speed and that turn low-frame-rate blurry clips into high-frame-rate sequences with sharper temporal detail.

Core claim

We study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos. We first exploit the multimodal cues and temporal structure naturally present in videos to learn, in a self-supervised manner, to detect speed changes and estimate playback speed. We then show that these learned temporal reasoning models enable us to curate the largest slow-motion video dataset to date from noisy in-the-wild sources. Using this data, we further develop models capable of temporal control, including speed-conditioned video generation, which produces motion at specified playback speed, and temporal super-resolution, which transforms low-FPS, blurry 3D

What carries the argument

self-supervised temporal reasoning models trained to detect speed changes and estimate playback speed from multimodal and temporal cues in videos

If this is right

  • A large high-quality slow-motion dataset becomes available for training without manual labeling.
  • Video generators can be conditioned on a target playback speed so the same scene can be rendered fast or slow on demand.
  • Low-frame-rate input can be turned into high-frame-rate output that recovers fine-grained motion details.
  • Temporal forensics tasks such as spotting speed tampering become feasible with the same learned speed detectors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same speed-estimation signal could be used to flag edited or synthetic video that contains inconsistent timing.
  • World models for robotics or simulation might improve if they explicitly represent how events unfold at different timescales.
  • Extending the approach to audio-visual alignment could let models learn consistent speed across sound and image streams.

Load-bearing premise

Multimodal cues and temporal structure inside ordinary videos are rich enough to let a model reliably spot speed changes and estimate playback speed even when the source videos are noisy and uncurated.

What would settle it

Training the same generation and super-resolution models on a random sample of ordinary video instead of the curated slow-motion collection yields no measurable gain in temporal coherence or detail.

read the original abstract

How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos. We first exploit the multimodal cues and temporal structure naturally present in videos to learn, in a self-supervised manner, to detect speed changes and estimate playback speed. We then show that these learned temporal reasoning models enable us to curate the largest slow-motion video dataset to date from noisy in-the-wild sources. Such slow-motion footage, typically filmed by high-speed cameras, contains substantially richer temporal detail than standard videos. Using this data, we further develop models capable of temporal control, including speed-conditioned video generation, which produces motion at specified playback speed, and temporal super-resolution, which tranforms low-FPS, blurry videos into high-FPS sequences with fine-grained temporal details. Our findings highlight time as a manipulable, perceptual dimension in video learning, opening doors to temporally controllable video generation, temporal forensics detection, and potentially richer world-models that understand how events unfold over time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims to treat time as a learnable visual concept in videos by developing self-supervised models that detect speed changes and estimate playback speed from multimodal and temporal cues. These models are then used to curate the largest slow-motion video dataset from noisy in-the-wild sources, which in turn supports new models for speed-conditioned video generation and temporal super-resolution.

Significance. If the self-supervised speed estimation step is reliable, the work would provide a valuable large-scale slow-motion dataset and demonstrate new capabilities for temporal control in video models, potentially advancing generative video methods, temporal forensics, and richer world models that reason about event timing. The self-supervised curation approach is a notable strength if validated.

major comments (1)
  1. [Abstract] The self-supervised pipeline for speed change detection and playback speed estimation is the load-bearing step for curating the claimed largest slow-motion dataset and all downstream results. The abstract describes the high-level approach but provides no quantitative validation, error analysis, ablations on real-world confounders (e.g., camera shake, cuts, audio desync), or comparisons to baselines, so the reliability of the labels cannot be assessed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We address the major comment on the abstract below and have made revisions to improve clarity regarding the reliability of the self-supervised pipeline.

read point-by-point responses
  1. Referee: [Abstract] The self-supervised pipeline for speed change detection and playback speed estimation is the load-bearing step for curating the claimed largest slow-motion dataset and all downstream results. The abstract describes the high-level approach but provides no quantitative validation, error analysis, ablations on real-world confounders (e.g., camera shake, cuts, audio desync), or comparisons to baselines, so the reliability of the labels cannot be assessed.

    Authors: We agree that the abstract, in its original form, emphasizes the high-level approach without embedding specific quantitative results, which limits immediate assessment of the pipeline's reliability. The body of the manuscript contains the requested quantitative validation, including accuracy and error metrics for speed change detection and playback speed estimation, ablations addressing real-world factors such as camera motion and temporal discontinuities, and comparisons against baselines. To directly address this point, we have revised the abstract to incorporate key quantitative highlights from our experiments while preserving its concise nature. This change makes the load-bearing role of the self-supervised step more transparent to readers. revision: yes

Circularity Check

0 steps flagged

No significant circularity; self-supervised pipeline is data-driven and self-contained

full rationale

The paper presents a self-supervised learning pipeline that exploits naturally occurring multimodal and temporal cues in videos to train models for speed change detection and playback speed estimation. These models are then applied to curate a slow-motion dataset from in-the-wild sources, which in turn supports downstream tasks like speed-conditioned generation and temporal super-resolution. No equations, derivations, or self-citations appear that reduce any claimed prediction or result to its own inputs by construction. The approach relies on external data patterns rather than fitted parameters renamed as predictions or ansatzes smuggled via prior self-work. This is the standard case of an honest non-finding: the derivation chain does not collapse into tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that videos contain learnable cues for time flow and that self-supervised training plus model-based curation can produce usable slow-motion data. Standard deep learning training involves many hyperparameters, but these are not specified in the abstract.

free parameters (1)
  • neural network hyperparameters and training settings
    Typical in self-supervised video models; values are chosen or fitted during development but not detailed here.
axioms (1)
  • domain assumption Videos contain sufficient multimodal and temporal structure to support self-supervised inference of playback speed changes.
    Invoked as the basis for the initial self-supervised learning stage described in the abstract.

pith-pipeline@v0.9.0 · 5556 in / 1287 out tokens · 61914 ms · 2026-05-09T21:52:11.592365+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 17 canonical work pages · 8 internal anchors

  1. [1]

    Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025

    Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos-reason1: From physical common sense to embodied reasoning. arXiv preprint arXiv:2503.15558, 2025

  2. [2]

    Test of time: Instilling video-language models with a sense of time

    Piyush Bagad, Makarand Tapaswi, and Cees GM Snoek. Test of time: Instilling video-language models with a sense of time. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2503–2516, 2023

  3. [3]

    Ac3d: Analyzing and improving 3d camera control in video diffusion transformers

    Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. InCVPR, 2025

  4. [4]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv:2502.13923, 2025

  5. [5]

    Frozen in time: A joint video and image encoder for end-to-end retrieval

    Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. InICCV, 2021

  6. [6]

    Speednet: Learning the speediness in videos

    Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William T Freeman, Michael Rubinstein, Michal Irani, and Tali Dekel. Speednet: Learning the speediness in videos. InCVPR, 2020

  7. [7]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv:2311.15127, 2023

  8. [8]

    Learning to synthesize motion blur

    Tim Brooks and Jonathan T Barron. Learning to synthesize motion blur. InCVPR, 2019. 12

  9. [9]

    Go-with-the-flow: Motion-controllable video diffusion models using real-time warped noise

    Ryan Burgert, Yuancheng Xu, Wenqi Xian, Oliver Pilarski, Pascal Clausen, Mingming He, Li Ma, Yitong Deng, Lingxiao Li, Mohsen Mousavi, et al. Go-with-the-flow: Motion-controllable video diffusion models using real-time warped noise. InCVPR, 2025

  10. [10]

    Sportsslomo: A new benchmark and baselines for human-centric video frame interpolation.arXiv:2308.16876, 2023

    Jiaben Chen and Huaizu Jiang. Sportsslomo: A new benchmark and baselines for human-centric video frame interpolation.arXiv:2308.16876, 2023

  11. [11]

    Panda-70m: Captioning 70m videos with multiple cross-modality teachers

    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InCVPR, 2024

  12. [12]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv:2507.06261, 2025

  13. [13]

    Flolpips: A bespoke video quality metric for frame interpolation

    Duolikun Danier, Fan Zhang, and David Bull. Flolpips: A bespoke video quality metric for frame interpolation. InPicture Coding Symposium, 2022

  14. [14]

    Ldmvfi: Video frame interpolation with latent diffusion models

    Duolikun Danier, Fan Zhang, and David Bull. Ldmvfi: Video frame interpolation with latent diffusion models. In AAAI, 2024

  15. [15]

    Do language models understand time? InCompanion Proceedings of the ACM on Web Conference 2025, pages 1855–1868, 2025

    Xi Ding and Lei Wang. Do language models understand time? InCompanion Proceedings of the ACM on Web Conference 2025, pages 1855–1868, 2025

  16. [16]

    Scvrl: Shuffled contrastive video representation learning

    Michael Dorkenwald, Fanyi Xiao, Biagio Brattoli, Joseph Tighe, and Davide Modolo. Scvrl: Shuffled contrastive video representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4132–4141, 2022

  17. [17]

    Reversed in time: A novel temporal-emphasized benchmark for cross-modal video-text retrieval

    Yang Du, Yuqi Liu, and Qin Jin. Reversed in time: A novel temporal-emphasized benchmark for cross-modal video-text retrieval. InProceedings of the 32nd ACM International Conference on Multimedia, pages 5260–5269, 2024

  18. [18]

    Explorative inbetweening of time and space

    Haiwen Feng, Zheng Ding, Zhihao Xia, Simon Niklaus, Victoria Abrevaya, Michael J Black, and Xuaner Zhang. Explorative inbetweening of time and space. InECCV, 2024

  19. [19]

    The pulse of motion: Measuring physical frame rate from visual dynamics.arXiv preprint arXiv:2603.14375, 2026

    Xiangbo Gao, Mingyang Wu, Siyuan Yang, Jiongze Yu, Pardis Taghavi, Fangzhou Lin, and Zhengzhong Tu. The pulse of motion: Measuring physical frame rate from visual dynamics.arXiv preprint arXiv:2603.14375, 2026

  20. [20]

    Video time: Properties, encoders and evaluation.arXiv preprint arXiv:1807.06980, 2018

    Amir Ghodrati, Efstratios Gavves, and Cees GM Snoek. Video time: Properties, encoders and evaluation.arXiv preprint arXiv:1807.06980, 2018

  21. [21]

    Cover: A comprehensive video quality evaluator

    Chenlong He, Qi Zheng, Ruoxi Zhu, Xiaoyang Zeng, Yibo Fan, and Zhengzhong Tu. Cover: A comprehensive video quality evaluator. InCVPR W, 2024

  22. [22]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeurIPS, 2017

  23. [23]

    arXiv preprint arXiv:2512.25075 (2025)

    Zhening Huang, Hyeonho Jeong, Xuelin Chen, Yulia Gryaditskaya, Tuanfeng Y Wang, Joan Lasenby, and Chun- Hao Huang. Spacetimepilot: Generative rendering of dynamic scenes across space and time.arXiv:2512.25075, 2025

  24. [24]

    Real-time intermediate flow estimation for video frame interpolation

    Zhewei Huang, Tianyuan Zhang, Wen Heng, Boxin Shi, and Shuchang Zhou. Real-time intermediate flow estimation for video frame interpolation. InECCV, 2022

  25. [25]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InCVPR, 2024

  26. [26]

    Video interpolation with diffusion models

    Siddhant Jain, Daniel Watson, Eric Tabellion, Ben Poole, Janne Kontkanen, et al. Video interpolation with diffusion models. InCVPR, 2024

  27. [27]

    Super slomo: High quality estimation of multiple intermediate frames for video interpolation

    Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik Learned-Miller, and Jan Kautz. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. InCVPR, 2018

  28. [28]

    Vace: All-in-one video creation and editing

    Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InICCV, 2025. 13

  29. [29]

    Cotracker3: Simpler and better point tracking by pseudo-labelling real videos

    Nikita Karaev, Yuri Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos. InICCV, 2025

  30. [30]

    Need for speed: A benchmark for higher frame rate object tracking

    Hamed Kiani Galoogahi, Ashton Fagg, Chen Huang, Deva Ramanan, and Simon Lucey. Need for speed: A benchmark for higher frame rate object tracking. InICCV, 2017

  31. [31]

    Unsupervised representation learning by sorting sequences

    Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Unsupervised representation learning by sorting sequences. InProceedings of the IEEE international conference on computer vision, pages 667–676, 2017

  32. [32]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv:2305.06355, 2023

  33. [33]

    Video-llava: Learning united visual representation by alignment before projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InEMNLP, 2024

  34. [34]

    Video frame synthesis using deep voxel flow

    Ziwei Liu, Raymond A Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agarwala. Video frame synthesis using deep voxel flow. InICCV, 2017

  35. [35]

    Beyond the frame: Generating 360°panoramic videos from perspective videos

    Rundong Luo, Matthew Wallingford, Ali Farhadi, Noah Snavely, and Wei-Chiu Ma. Beyond the frame: Generating 360°panoramic videos from perspective videos. InICCV, 2025

  36. [36]

    Uvg dataset: 50/120fps 4k sequences for video codec analysis and development

    Alexandre Mercat, Marko Viitanen, and Jarno Vanne. Uvg dataset: 50/120fps 4k sequences for video codec analysis and development. InProceedings of the 11th ACM multimedia systems conference, pages 297–302, 2020

  37. [37]

    Shuffle and learn: unsupervised learning using temporal order verification

    Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuffle and learn: unsupervised learning using temporal order verification. InEuropean conference on computer vision, pages 527–544. Springer, 2016

  38. [38]

    Deep multi-scale convolutional neural network for dynamic scene deblurring

    Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. InCVPR, 2017

  39. [39]

    OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

    Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to-video generation.arXiv:2407.02371, 2024

  40. [40]

    Video frame interpolation via adaptive convolution

    Simon Niklaus, Long Mai, and Feng Liu. Video frame interpolation via adaptive convolution. InCVPR, 2017

  41. [41]

    Perazzi, J

    F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InCVPR, 2016

  42. [42]

    Seeing the arrow of time

    Lyndsey C Pickup, Zheng Pan, Donglai Wei, YiChang Shih, Changshui Zhang, Andrew Zisserman, Bernhard Scholkopf, and William T Freeman. Seeing the arrow of time. InCVPR, 2014

  43. [43]

    Film: Frame interpolation for large motion

    Fitsum Reda, Janne Kontkanen, Eric Tabellion, Deqing Sun, Caroline Pantofaru, and Brian Curless. Film: Frame interpolation for large motion. InECCV, 2022

  44. [44]

    Xvfi: extreme video frame interpolation

    Hyeonjun Sim, Jihyong Oh, and Munchurl Kim. Xvfi: extreme video frame interpolation. InICCV, 2021

  45. [45]

    Moviechat: From dense token to sparse memory for long video understanding

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. In CVPR, 2024

  46. [46]

    Transnet v2: An effective deep network architecture for fast shot transition detection

    Tomás Soucek and Jakub Lokoc. Transnet v2: An effective deep network architecture for fast shot transition detection. InACM MM, 2024

  47. [47]

    Time and video speed perception: a comprehensive investigation of the relation between estimated video speed, clip duration and original duration: V

    Verena Steinhof, Anna Schroeger, Roman Liepelt, and Laura Sperl. Time and video speed perception: a comprehensive investigation of the relation between estimated video speed, clip duration and original duration: V. steinhof et al.Cognitive Research: Principles and Implications, 2025

  48. [48]

    Deep video deblurring for hand-held cameras

    Shuochen Su, Mauricio Delbracio, Jue Wang, Guillermo Sapiro, Wolfgang Heidrich, and Oliver Wang. Deep video deblurring for hand-held cameras. InCVPR, 2017

  49. [49]

    Fvd: A new metric for video generation

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. InICLR W, 2019

  50. [51]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv:2503.20314, 2025. 14

  51. [52]

    arXiv preprint arXiv:2505.22944 (2025)

    Angtian Wang, Haibin Huang, Zhiyuan Fang, Yiding Yang, and Chongyang Ma. ATI: Any trajectory instruction for controllable video generation.arXiv:2505.22944, 2025

  52. [53]

    Self-supervised video representation learning by pace prediction

    Jiangliu Wang, Jianbo Jiao, and Yun-Hui Liu. Self-supervised video representation learning by pace prediction. InECCV, 2020

  53. [54]

    Videomae v2: Scaling video masked autoencoders with dual masking

    Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. InCVPR, 2023

  54. [55]

    Generative inbetweening: Adapting image-to-video models for keyframe interpolation

    Xiaojuan Wang, Boyang Zhou, Brian Curless, Ira Kemelmacher-Shlizerman, Aleksander Holynski, and Steven M Seitz. Generative inbetweening: Adapting image-to-video models for keyframe interpolation. InICLR, 2024

  55. [56]

    Internvid: A large-scale video-text dataset for multimodal un- derstanding and generation.arXiv preprint arXiv:2307.06942,

    Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2307.06942, 2023

  56. [57]

    Sea-raft: Simple, efficient, accurate raft for optical flow

    Yihan Wang, Lahav Lipson, and Jia Deng. Sea-raft: Simple, efficient, accurate raft for optical flow. InECCV, 2024

  57. [58]

    Bullettime: Decoupled control of time and camera pose for video generation

    Yiming Wang, Qihang Zhang, Shengqu Cai, Tong Wu, Jan Ackermann, Zhengfei Kuang, Yang Zheng, Frano Rajič, Siyu Tang, and Gordon Wetzstein. Bullettime: Decoupled control of time and camera pose for video generation. arXiv:2512.05076, 2025

  58. [59]

    Paxion: Patching action knowledge in video-language foundation models.Advances in Neural Information Processing Systems, 36:20729–20749, 2023

    Zhenhailong Wang, Ansel Blume, Sha Li, Genglin Liu, Jaemin Cho, Zineng Tang, Mohit Bansal, and Heng Ji. Paxion: Patching action knowledge in video-language foundation models.Advances in Neural Information Processing Systems, 36:20729–20749, 2023

  59. [60]

    Motionctrl: A unified and flexible motion controller for video generation

    Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InSIGGRAPH, 2024

  60. [61]

    Learning and using the arrow of time

    Donglai Wei, Joseph J Lim, Andrew Zisserman, and William T Freeman. Learning and using the arrow of time. InCVPR, 2018

  61. [62]

    Zooming slow-mo: Fast and accurate one-stage space-time video super-resolution

    Xiaoyu Xiang, Yapeng Tian, Yulun Zhang, Yun Fu, Jan P Allebach, and Chenliang Xu. Zooming slow-mo: Fast and accurate one-stage space-time video super-resolution. InCVPR, 2020

  62. [63]

    Seeing the arrow of time in large multimodal models.arXiv preprint arXiv:2506.03340, 2025

    Zihui Xue, Mi Luo, and Kristen Grauman. Seeing the arrow of time in large multimodal models.arXiv preprint arXiv:2506.03340, 2025

  63. [64]

    Video playback rate perception for self-supervised spatio-temporal representation learning

    Yuan Yao, Chang Liu, Dezhao Luo, Yu Zhou, and Qixiang Ye. Video playback rate perception for self-supervised spatio-temporal representation learning. InCVPR, 2020

  64. [65]

    Dptext-detr: Towards better scene text detection with dynamic points in transformer

    Maoyuan Ye, Jing Zhang, Shanshan Zhao, Juhua Liu, Bo Du, and Dacheng Tao. Dptext-detr: Towards better scene text detection with dynamic points in transformer. InAAAI, 2023

  65. [66]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

  66. [67]

    VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, Yu Qiao, and Ziwei Liu. VBench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness. arXiv:2503.21755, 2025

  67. [68]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv:2504.10479, 2025. 15 A Supplementary Material Overview In this supplementary document, we provide additional implementation details ...

  68. [69]

    Watch the video and use the preview slider (1.0× to 16.0×) to experiment with different speeds

  69. [70]

    Based on your observation, determine what speed factor would make it look normal

  70. [71]

    Speed Factor

    Enter that factor in the "Speed Factor" input box

  71. [72]

    Click Done to save your answer and move to the next video

  72. [73]

    4" in the input box • If the video appears 2× slow when slider is at 10× → Enter

    If you cannot determine a reasonable speed, click Cannot Tell (use sparingly) Important Notes • Please make sure you're using Chrome • Your answer can exceed these limits - enter any value you believe is correct Example Scenarios • If the video looks normal at 4× on the slider → Enter "4" in the input box • If the video appears 2× slow when slider is at 1...