arxiv: 2604.27975 · v1 · submitted 2026-04-30 · 💻 cs.CV · cs.AI

Recognition: unknown

TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions

Ce Chen , Yi Ren , Yuanming Li , Viktor Goriachko , Zhenhui Ye , Zujin Guo , Zhibin Hong , Mingming Gong

Authors on Pith no claims yet

Pith reviewed 2026-05-07 07:17 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords shot transition detectionvision-language modelsoptical flow fusionsynthetic video datavideo segmentationtemporal dynamicsshot boundary detectionbenchmark dataset

0 comments

The pith

TransVLM detects shot transitions as continuous temporal segments by feeding a vision-language model concatenated color frames and optical flow.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that traditional shot boundary detection fails on complex transitions because it hunts for single cut points rather than the full intervals where one shot blends into another. To fix this, the authors redefine the problem as Shot Transition Detection and introduce TransVLM, a vision-language model that receives both color and motion information at the input stage. The motion prior comes from optical flow concatenated directly with color frames, so the language backbone sees temporal dynamics without extra tokens or architectural changes. A synthetic data engine generates balanced training videos to overcome the scarcity of real transition examples. If this works, video processing pipelines gain a more reliable way to locate and handle gradual edits that current methods routinely miss.

Core claim

The central claim is that formalizing Shot Transition Detection as the identification of continuous transition segments, combined with a simple concatenation of color and optical flow features passed to a standard vision-language model, produces superior detection of any transition type. This approach avoids the point-based limitation of prior shot boundary detection and removes the need for specialized spatiotemporal networks or additional visual tokens, while a scalable synthetic data engine addresses class imbalance in existing datasets.

What carries the argument

The central mechanism is the input-level feature fusion that concatenates color and optical flow representations before they reach the vision-language model, supplying motion context without increasing token count or altering the backbone.

If this is right

Redefining the task around full transition segments rather than isolated points enables detection of gradual blends that point-based methods corrupt or miss.
Concatenating optical flow at input stage improves temporal sensitivity in existing vision-language models without extra compute on the language backbone.
The synthetic data engine produces diverse transition examples that allow training despite severe class imbalance in public video datasets.
The resulting model outperforms both traditional heuristics and specialized video networks on a new STD benchmark while remaining deployable in production video pipelines.
Accurate continuous-segment detection supports downstream video tasks such as clean shot extraction for editing and summarization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same concatenation trick could be tested on other video-language tasks where motion matters more than static appearance, such as action localization.
If synthetic transition data generalizes well here, similar engines might help train models on rare temporal events in domains like surveillance or medical imaging.
A production system that already uses this model could feed its error cases back into the data engine to close the synthetic-to-real gap over time.
Treating transitions as intervals rather than points might improve metrics in video compression or streaming that rely on accurate shot boundaries.

Load-bearing premise

The load-bearing premise is that simply concatenating color and optical flow inputs is sufficient to give a standard vision-language model the temporal awareness needed for transition detection, and that transitions generated by the synthetic engine will match the complexity of real-world cases.

What would settle it

A set of real-world videos containing gradual or multi-stage transitions absent from the synthetic engine where TransVLM accuracy falls to the level of standard VLMs or heuristic methods would falsify the claim.

read the original abstract

Traditional Shot Boundary Detection (SBD) inherently struggles with complex transitions by formulating the task around isolated cut points, frequently yielding corrupted video shots. We address this fundamental limitation by formalizing the Shot Transition Detection (STD) task. Rather than searching for ambiguous points, STD explicitly detects the continuous temporal segments of transitions. To tackle this, we propose TransVLM, a Vision-Language Model (VLM) framework for STD. Unlike regular VLMs that predominantly rely on spatial semantics and struggle with fine-grained inter-shot dynamics, our method explicitly injects optical flow as a critical motion prior at the input stage. Through a simple yet effective feature-fusion strategy, TransVLM directly processes concatenated color and motion representations, significantly enhancing its temporal awareness without incurring any additional visual token overhead on the language backbone. To overcome the severe class imbalance in public data, we design a scalable data engine to synthesize diverse transition videos for robust training, alongside a comprehensive benchmark for STD. Extensive experiments demonstrate that TransVLM achieves superior overall performance, outperforming traditional heuristic methods, specialized spatiotemporal networks, and top-tier VLMs. This work has been deployed to production. For more related research, please visit HeyGen Research (https://www.heygen.com/research) and HeyGen Avatar-V (https://www.heygen.com/research/avatar-v-model). Project page: https://chence17.github.io/TransVLM/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reframes shot boundary detection as continuous segment detection and fuses optical flow into a VLM at input without extra tokens, but its superiority claims rest on synthetic training data whose match to real transitions is not demonstrated.

read the letter

The main contribution is shifting from point-based shot boundary detection to segment-based transition detection. This matches how gradual effects like fades and dissolves actually appear in video, and it avoids the ambiguity of pinning an exact cut frame. They adapt a VLM by concatenating color and optical flow representations at the input stage, which adds motion awareness without increasing token count on the language side. They also built a synthetic data engine to address class imbalance in existing datasets and released a benchmark for the new task. The production deployment note is a practical signal that someone has run it on real pipelines.

Referee Report

3 major / 3 minor

Summary. The paper formalizes Shot Transition Detection (STD) as the task of identifying continuous temporal segments containing transitions, in contrast to traditional Shot Boundary Detection (SBD) which targets isolated cut points. It introduces TransVLM, a VLM framework that injects optical flow as a motion prior via direct concatenation of color and flow representations at the input stage, avoiding extra tokens or architectural modifications to the language backbone. A scalable synthetic data engine is proposed to generate diverse transition videos and mitigate class imbalance in public datasets, paired with a new comprehensive STD benchmark. Experiments claim that TransVLM outperforms heuristic SBD methods, specialized spatiotemporal networks, and state-of-the-art VLMs, with the system already deployed in production.

Significance. If the performance claims and generalization hold, the work could meaningfully advance practical video editing and production pipelines by enabling more reliable handling of complex, continuous transitions. The input-level fusion strategy offers an efficient way to add temporal awareness to existing VLMs without token overhead, and the synthetic data engine addresses a real data scarcity issue that affects many video understanding tasks. The new benchmark has the potential to standardize evaluation for STD. Production deployment provides evidence of real-world utility, though the overall significance hinges on whether the synthetic training distribution transfers to diverse real-world cases.

major comments (3)

[§4] §4 (Experiments and Benchmark): The central claim of superior overall performance and generalization to 'any' shot transitions rests on the synthetic data engine producing representative examples. However, the manuscript provides no quantitative comparison (e.g., histograms or statistical tests) of transition speed, type distribution, visual artifacts, or co-occurrence with camera motion between synthetic and real videos. This is load-bearing; without it, reported gains on the benchmark could reflect distribution matching rather than a general solution.
[§3.2] §3.2 (Synthetic Data Engine): The description of the engine is high-level ('diverse transition videos'). The paper should include an ablation on synthesis parameters (e.g., transition duration, blending functions, camera motion injection) and report performance on a held-out real-only test set that was never seen during synthetic training. Absence of this test leaves open the possibility that gains are benchmark-specific.
[§3.1] §3.1 (TransVLM Architecture): While the concatenation of color and optical flow is presented as sufficient to inject temporal dynamics, the manuscript lacks an ablation comparing this simple fusion against alternatives (e.g., cross-attention fusion, additional temporal tokens, or flow as a separate stream). Without such controls, it is unclear whether the claimed efficiency and performance gains are due to the fusion strategy or other factors.

minor comments (3)

[Abstract / §1] The abstract and introduction repeatedly use 'any shot transitions' without a precise definition of the transition taxonomy or edge cases (e.g., gradual dissolves vs. complex effects with overlaid text). Adding a clear taxonomy table would improve clarity.
[§4] Figure captions and experimental tables should explicitly state the exact metrics (F1, precision/recall per transition type) and list all baselines with their original paper citations for reproducibility.
[§3.1] The optical flow computation method (e.g., which algorithm and parameters) and any preprocessing/normalization steps before concatenation should be detailed in §3.1 to allow exact replication.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review. The comments highlight important aspects for strengthening the claims on generalization and the contributions of the proposed components. We address each major comment below and will revise the manuscript to incorporate additional analyses and ablations as outlined.

read point-by-point responses

Referee: [§4] §4 (Experiments and Benchmark): The central claim of superior overall performance and generalization to 'any' shot transitions rests on the synthetic data engine producing representative examples. However, the manuscript provides no quantitative comparison (e.g., histograms or statistical tests) of transition speed, type distribution, visual artifacts, or co-occurrence with camera motion between synthetic and real videos. This is load-bearing; without it, reported gains on the benchmark could reflect distribution matching rather than a general solution.

Authors: We agree that explicit quantitative validation of the synthetic data distribution is necessary to support the generalization claims. In the revised manuscript, we will add histograms and statistical comparisons (including measures such as mean/variance differences and distribution similarity tests) for transition speed, type distribution, visual artifacts, and co-occurrence with camera motion between the generated synthetic videos and the real videos from the benchmark. These additions will clarify that performance improvements reflect a general solution rather than distribution matching. revision: yes
Referee: [§3.2] §3.2 (Synthetic Data Engine): The description of the engine is high-level ('diverse transition videos'). The paper should include an ablation on synthesis parameters (e.g., transition duration, blending functions, camera motion injection) and report performance on a held-out real-only test set that was never seen during synthetic training. Absence of this test leaves open the possibility that gains are benchmark-specific.

Authors: We will expand the description in Section 3.2 to provide more implementation details on the synthetic data engine. We will also add an ablation study evaluating the impact of key synthesis parameters, including transition duration, blending functions, and camera motion injection, on final model performance. For the held-out real-only test set, we will partition the real videos in the benchmark such that a subset is completely excluded from any training or validation procedures (synthetic data is used only for training) and report TransVLM performance on this held-out real subset to demonstrate generalization. revision: yes
Referee: [§3.1] §3.1 (TransVLM Architecture): While the concatenation of color and optical flow is presented as sufficient to inject temporal dynamics, the manuscript lacks an ablation comparing this simple fusion against alternatives (e.g., cross-attention fusion, additional temporal tokens, or flow as a separate stream). Without such controls, it is unclear whether the claimed efficiency and performance gains are due to the fusion strategy or other factors.

Authors: We will include a new ablation study comparing the input-level concatenation fusion against the suggested alternatives (cross-attention fusion, additional temporal tokens, and a separate flow stream). The ablation will report both accuracy metrics and computational overhead (e.g., token count and inference time) to demonstrate that the simple concatenation provides the claimed efficiency and performance benefits without architectural modifications to the language backbone. This will be added to the experiments or architecture section in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with no definitional reductions or self-referential derivations

full rationale

The paper formalizes the STD task as detecting continuous transition segments rather than isolated cuts, then describes TransVLM as a VLM that receives concatenated color and optical-flow frames at the input stage via a simple fusion strategy. It further introduces a synthetic data engine to address class imbalance and reports superior performance via experiments against heuristics, spatiotemporal networks, and other VLMs. No equations, parameter-fitting steps, or derivations appear that would reduce the performance claims to tautological constructions (e.g., no fitted quantities renamed as predictions, no uniqueness theorems imported from self-citations, and no ansatz smuggled via prior work). The synthetic engine and benchmark are presented as independent engineering contributions whose effectiveness is asserted through external validation rather than by construction from the model inputs themselves. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated. Optical flow is treated as a standard external prior rather than a new entity.

axioms (1)

domain assumption Standard VLMs can be made temporally aware by early fusion of motion features without architectural modification or extra tokens.
Implicit in the feature-fusion strategy described in the abstract.

pith-pipeline@v0.9.0 · 5577 in / 1247 out tokens · 83349 ms · 2026-05-07T07:17:13.233788+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 11 canonical work pages · 4 internal anchors

[1]

Methods and challenges in shot boundary detection: a review.Entropy, 20(4): 214, 2018

Sadiq H Abdulhussain, Abd Rahman Ramli, M Iqbal Saripan, Basheera M Mahmmod, Syed Abdul Rahman Al-Haddad, and Wissam A Jassim. Methods and challenges in shot boundary detection: a review.Entropy, 20(4): 214, 2018

2018
[2]

Qwen-vl: A versatile vision-language model for understanding, localization.Text Reading, and Beyond, 2(1):1, 2023

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization.Text Reading, and Beyond, 2(1):1, 2023

2023
[3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review arXiv 2025
[4]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. InProceedings of the IEEE/CVF international conference on computer vision, pages 1728–1738, 2021. 11 Research TransVLM: VLM for Shot Transition Detection

2021
[5]

A database and evaluation methodology for optical flow.International journal of computer vision, 92(1):1–31, 2011

Simon Baker, Daniel Scharstein, James P Lewis, Stefan Roth, Michael J Black, and Richard Szeliski. A database and evaluation methodology for optical flow.International journal of computer vision, 92(1):1–31, 2011

2011
[6]

A deep siamese network for scene detection in broadcast videos

Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. A deep siamese network for scene detection in broadcast videos. InProceedings of the 23rd ACM international conference on Multimedia, pages 1199–1202, 2015

2015
[7]

Shot and scene detection via hierarchical clustering for re-using broadcast video

Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. Shot and scene detection via hierarchical clustering for re-using broadcast video. InInternational conference on computer analysis of images and patterns, pages 801–811. Springer, 2015

2015
[8]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review arXiv 2023
[9]

Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

2024
[10]

Quo vadis, action recognition? a new model and the kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017

2017
[11]

Pyscenedetect: Video scene cut detection and analysis tool.Version 0.6

Brandon Castellano. Pyscenedetect: Video scene cut detection and analysis tool.Version 0.6. 0, 2014. URL https://github.com/Breakthrough/PySceneDetect

2014
[12]

Sora detector: A unified hallucination detection for large text-to-video models.arXiv preprint arXiv:2405.04180, 2024

Zhixuan Chu, Lei Zhang, Yichen Sun, Siqiao Xue, Zhibo Wang, Zhan Qin, and Kui Ren. Sora detector: A unified hallucination detection for large text-to-video models.arXiv preprint arXiv:2405.04180, 2024

work page arXiv 2024
[13]

Slowfast networks for video recognition

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019

2019
[14]

Video shot-boundary detection: issues, challenges and solutions.Artificial Intelligence Review, 57(4):104, 2024

Tejaswini Kar, Priyadarshi Kanungo, Sachi Nandan Mohanty, Sven Groppe, and Jinghua Groppe. Video shot-boundary detection: issues, challenges and solutions.Artificial Intelligence Review, 57(4):104, 2024

2024
[15]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

2023
[16]

Video-llava: Learning united visual representation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 5971–5984, 2024

2024
[17]

Vista: A test-time self-improving video generation agent.arXiv preprint arXiv:2510.15831, 2025

Do Xuan Long, Xingchen Wan, Hootan Nakhost, Chen-Yu Lee, Tomas Pfister, and Sercan Ö Arık. Vista: A test-time self-improving video generation agent.arXiv preprint arXiv:2510.15831, 2025

work page arXiv 2025
[18]

Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning.Neurocomputing, 508:293–304, 2022

Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning.Neurocomputing, 508:293–304, 2022

2022
[19]

VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo, 2025

Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng, et al. Veomni: Scaling any modality model training with model-centric distributed recipe zoo. arXiv preprint arXiv:2508.02317, 2025

work page arXiv 2025
[20]

SportsShot: A fine-grained dataset for shot segmentation in multiple sports

Multimedia Computing Group, Nanjing University MCG-NJU. SportsShot: A fine-grained dataset for shot segmentation in multiple sports. https://codalab.lisn.upsaclay.fr/competitions/20982, 2024. Dataset hosted on CodaLab

2024
[21]

A local-to-global approach to multi-modal movie scene segmentation

Anyi Rao, Linning Xu, Yu Xiong, Guodong Xu, Qingqiu Huang, Bolei Zhou, and Dahua Lin. A local-to-global approach to multi-modal movie scene segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10146–10155, 2020

2020
[22]

Actionatlas: A videoqa benchmark for domain-specialized action recognition.Advances in Neural Information Processing Systems, 37:137372–137402, 2024

Mohammadreza Reza Salehi, Jae Sung Park, Aditya Kusupati, Ranjay Krishna, Yejin Choi, Hanna Hajishirzi, and Ali Farhadi. Actionatlas: A videoqa benchmark for domain-specialized action recognition.Advances in Neural Information Processing Systems, 37:137372–137402, 2024

2024
[23]

Enhancing scene transition awareness in video generation via post-training

Hanwen Shen, Jiajie Lu, Yupeng Cao, and Xiaonan Yang. Enhancing scene transition awareness in video generation via post-training. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 706–721, 2025. 12 Research Trans...

2025
[24]

Transnet v2: An effective deep network architecture for fast shot transition detection

Lokoc Soucek, Tomás and Jakub. Transnet v2: An effective deep network architecture for fast shot transition detection. InProceedings of the 32nd ACM International Conference on Multimedia, pages 11218–11221, 2024

2024
[25]

Transnet: A deep network for fast detection of common shot transitions.arXiv preprint arXiv:1906.03363, 2019

Tomáš Souček, Jaroslav Moravec, and Jakub Lokoč. Transnet: A deep network for fast detection of common shot transitions.arXiv preprint arXiv:1906.03363, 2019

work page arXiv 1906
[26]

Fast video shot transition localization with deep structured models

Shitao Tang, Litong Feng, Zhanghui Kuang, Yimin Chen, and Wei Zhang. Fast video shot transition localization with deep structured models. InAsian Conference on Computer Vision, pages 577–592. Springer, 2018

2018
[27]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review arXiv 2023
[28]

Converting video formats with ffmpeg.Linux journal, 2006(146):10, 2006

Suramya Tomar. Converting video formats with ffmpeg.Linux journal, 2006(146):10, 2006

2006
[29]

Cnn-based shot boundary detection and video annotation

Wenjing Tong, Li Song, Xiaokang Yang, Hui Qu, and Rong Xie. Cnn-based shot boundary detection and video annotation. In2015 IEEE international symposium on broadband multimedia systems and broadcasting, pages 1–5. IEEE, 2015

2015
[30]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural information processing systems, 35:10078–10093, 2022

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural information processing systems, 35:10078–10093, 2022

2022
[31]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review arXiv 2024
[32]

Internvideo: General video foundation models via generative and discriminative learning

Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al. Internvideo: General video foundation models via generative and discriminative learning.arXiv preprint arXiv:2212.03191, 2022

work page arXiv 2022
[33]

Two stage shot boundary detection via feature fusion and spatial-temporal convolutional neural networks.IEEE Access, 7:77268–77276, 2019

Lifang Wu, Shuai Zhang, Meng Jian, Zhe Lu, and Dong Wang. Two stage shot boundary detection via feature fusion and spatial-temporal convolutional neural networks.IEEE Access, 7:77268–77276, 2019

2019
[34]

Videoclip: Contrastive pre-training for zero-shot video-text understanding

Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding. In Proceedings of the 2021 conference on empirical methods in natural language processing, pages 6787–6800, 2021

2021
[35]

Shot boundary detection using convolutional neural networks

Jingwei Xu, Li Song, and Rong Xie. Shot boundary detection using convolutional neural networks. In2016 Visual Communications and Image Processing (VCIP), pages 1–4. IEEE, 2016

2016
[36]

A feature-based algorithm for detecting and classifying scene breaks

Ramin Zabih, Justin Miller, and Kevin Mai. A feature-based algorithm for detecting and classifying scene breaks. InProceedings of the third ACM international conference on Multimedia, pages 189–200, 1995

1995
[37]

Waver: Wave your way to lifelike video generation,

Yifu Zhang, Hao Yang, Yuqi Zhang, Yifei Hu, Fengda Zhu, Chuang Lin, Xiaofeng Mei, Yi Jiang, Bingyue Peng, and Zehuan Yuan. Waver: Wave your way to lifelike video generation.arXiv preprint arXiv:2508.15761, 2025

work page arXiv 2025
[38]

Neuflow v2: High-efficiency optical flow estimation on edge devices.arXiv preprint arXiv:2408.10161, 6:12, 2024

Zhiyong Zhang, Aniket Gupta, Huaizu Jiang, and Hanumant Singh. Neuflow v2: High-efficiency optical flow estimation on edge devices.arXiv preprint arXiv:2408.10161, 6:12, 2024

work page arXiv 2024
[39]

Neuflow: Real-time, high-accuracy optical flow estimation on robots using edge devices

Zhiyong Zhang, Huaizu Jiang, and Hanumant Singh. Neuflow: Real-time, high-accuracy optical flow estimation on robots using edge devices. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5048–5055. IEEE, 2024

2024
[40]

Neuflow-v2: Push high-efficiency optical flow to the limit

Zhiyong Zhang, Aniket Gupta, Huaizu Jiang, and Hanumant Singh. Neuflow-v2: Push high-efficiency optical flow to the limit. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2479–2485. IEEE, 2025

2025
[41]

Autoshot: A short video dataset and state-of-the-art shot boundary detection

Wentao Zhu, Yufang Huang, Xiufeng Xie, Wenxian Liu, Jincan Deng, Debing Zhang, Zhangyang Wang, and Ji Liu. Autoshot: A short video dataset and state-of-the-art shot boundary detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2238–2247, 2023. 13 Research TransVLM: VLM for Shot Transition Detection A Append...

2023
[42]

3.wipeleft: A linear spatial wipe transitioning across the frame from right to left

fade: A standard crossfade where the outgoing shot smoothly decreases in opacity while the incoming shot increases. 3.wipeleft: A linear spatial wipe transitioning across the frame from right to left. 4.wiperight: A linear spatial wipe transitioning from left to right. 5.wipeup: A vertical linear wipe progressing from the bottom edge to the top edge. 6.wi...
[43]

12.rectcrop: Reveals the incoming shot through a symmetrically expanding rectangular mask

circlecrop: Reveals the incoming shot through an expanding circular spatial mask originating from the center. 12.rectcrop: Reveals the incoming shot through a symmetrically expanding rectangular mask. 13.distance: Blends pixels temporally based on spatial distance calculations between the two frames
[44]

fadeblack: Fades the outgoing shot completely to a solid black frame before fading into the incoming shot
[45]

16.radial: A circular, clock-like rotational sweep mask revealing the incoming shot

fadewhite: Fades the outgoing shot completely to a solid white frame before fading into the incoming shot. 16.radial: A circular, clock-like rotational sweep mask revealing the incoming shot
[46]

18 Research TransVLM: VLM for Shot Transition Detection 18.smoothright: A fluid, smoothed sliding motion from left to right

smoothleft: A fluid, smoothed sliding motion of the incoming shot from right to left with easing dynamics. 18 Research TransVLM: VLM for Shot Transition Detection 18.smoothright: A fluid, smoothed sliding motion from left to right. 19.smoothup: A fluid, smoothed vertical sliding motion from bottom to top. 20.smoothdown: A fluid, smoothed vertical sliding ...
[47]

23.vertopen: Opens a vertical slit outward from the center horizontally to reveal the incoming shot

circleclose: Contracts a circular aperture to hide the outgoing shot, revealing the underlying incoming shot. 23.vertopen: Opens a vertical slit outward from the center horizontally to reveal the incoming shot. 24.vertclose: Closes a vertical slit inward horizontally to transition between shots. 25.horzopen: Opens a horizontal slit outward vertically from...
[48]

dissolve: A specialized pixel-level cross-dissolve that smooths the structural blending of overlapping frames
[49]

29.diagtl: A diagonal wipe originating strictly from the top-left corner

pixelize: Applies a highly blocky, pixelation filter that interpolates spatial frequencies between the two shots. 29.diagtl: A diagonal wipe originating strictly from the top-left corner. 30.diagtr: A diagonal wipe originating strictly from the top-right corner. 31.diagbl: A diagonal wipe originating strictly from the bottom-left corner. 32.diagbr: A diag...