CollideNet: Hierarchical Multi-scale Video Representation Learning with Disentanglement for Time-To-Collision Forecasting

Ali Etemad, Michael Greenspan, Nishq Poorav Desai

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:32 UTC · model grok-4.3

classification 💻 cs.CV

keywords collidenetvideoforecastingmulti-scalecomponentsdisentanglementhierarchicalmethod

0 comments

The pith

CollideNet achieves state-of-the-art time-to-collision forecasting on three public datasets by combining multi-scale spatial aggregation with temporal disentanglement of trend and seasonality in a hierarchical transformer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CollideNet, a new neural network design for predicting how soon a collision might happen based on video input. It handles video by looking at each frame at several different levels of detail at once, rather than just one resolution. For the time aspect, it separates the video signal into parts that change over time in different ways: the overall trend, repeating seasonal patterns, and irregular non-stationary changes. This separation is intended to help the model focus on the relevant motion cues for collision risk. The authors test the system on standard video datasets used for this task and report better accuracy than previous methods. They also check how well the model works when trained on one dataset and tested on another to see if it generalizes.

Core claim

Our method achieves state-of-the-art performance in comparison to prior works on three commonly used public datasets, setting a new state-of-the-art by a considerable margin.

Load-bearing premise

The assumption that the proposed hierarchical multi-scale spatial aggregation combined with temporal disentanglement of non-stationarity, trend, and seasonality will generalize beyond the three evaluated datasets and produce reliable TTC forecasts in unseen real-world conditions.

Figures

Figures reproduced from arXiv: 2604.16240 by Ali Etemad, Michael Greenspan, Nishq Poorav Desai.

**Figure 3.** Figure 3: The correlation score cj quantifies the relationship between Q + i and Kj . Output Yi at the current segment i is determined by using the similarity between Qi−1 from the previous segment i − 1 and the key Kj as the weight for the value Vj+1 in the aggregation process. 4 Experiments Dataset and evaluation metrics. We use three publicly available datasets, namely DAD [9], CCD [6], and DoTA [63]. These datas… view at source ↗

read the original abstract

Time-to-Collision (TTC) forecasting is a critical task in collision prevention, requiring precise temporal prediction and comprehending both local and global patterns encapsulated in a video, both spatially and temporally. To address the multi-scale nature of video, we introduce a novel spatiotemporal hierarchical transformer-based architecture called CollideNet, specifically catered for effective TTC forecasting. In the spatial stream, CollideNet aggregates information for each video frame simultaneously at multiple resolutions. In the temporal stream, along with multi-scale feature encoding, CollideNet also disentangles the non-stationarity, trend, and seasonality components. Our method achieves state-of-the-art performance in comparison to prior works on three commonly used public datasets, setting a new state-of-the-art by a considerable margin. We conduct cross-dataset evaluations to analyze the generalization capabilities of our method, and visualize the effects of disentanglement of the trend and seasonality components of the video data. We release our code at https://github.com/DeSinister/CollideNet/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper proposes CollideNet, a hierarchical multi-scale spatiotemporal transformer architecture for time-to-collision (TTC) forecasting. The spatial stream aggregates multi-resolution information per video frame, while the temporal stream encodes multi-scale features and disentangles non-stationarity, trend, and seasonality components. The central claim is state-of-the-art performance on three public datasets, supported by cross-dataset evaluations, visualizations of the disentanglement, and public code release.

Significance. If the reported margins hold under the full experimental protocol, the work advances video representation learning for safety-critical applications such as collision avoidance. The combination of hierarchical spatial aggregation and explicit temporal disentanglement offers a structured approach to multi-scale spatiotemporal modeling, and the cross-dataset tests plus code release provide concrete support for generalization and reproducibility.

minor comments (2)

The abstract states results on three public datasets but does not name them; adding the dataset names would improve immediate readability.
Figure captions describing the disentanglement visualizations should explicitly reference the non-stationarity, trend, and seasonality components shown.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents CollideNet as an empirical neural architecture for TTC forecasting, with claims resting on experimental SOTA results across three public datasets plus cross-dataset generalization tests. No equations, derivations, or first-principles predictions appear in the provided text; the architecture (hierarchical spatial aggregation and temporal disentanglement of non-stationarity/trend/seasonality) is described as a design choice motivated by video properties rather than any self-referential definition or fitted input relabeled as output. The central performance margins are therefore not forced by construction and remain independently verifiable via the released code.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities beyond the standard assumption that video data contains separable spatial scales and temporal components amenable to transformer processing and decomposition.

pith-pipeline@v0.9.0 · 5481 in / 1175 out tokens · 21195 ms · 2026-05-10T08:32:23.075493+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 5 canonical work pages · 2 internal anchors

[1]

The Statistician25, 308 (1976)

Anderson, O.D., Kendall, M.G.: Time-Series. The Statistician25, 308 (1976)

1976
[2]

Anjum, T., Chirade, L., Lin, B., Narayan, A.: Learning Spatio-Temporal Features via3DCNNstoForecastTime-to-Accident.In:InternationalConferenceonAgents and Artificial Intelligence. pp. 532–540 (2023)

2023
[3]

In: IEEE International Joint Con- ference on Neural Networks

Anjum, T., Kumar, D., Narayan, A.: Spatio-temporal Analysis of Dashboard Cam- era Videos for Time-To-Accident Forecasting. In: IEEE International Joint Con- ference on Neural Networks. pp. 1–8 (2023)

2023
[4]

In: IEEE/CVF International Conference on Computer Vision

Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: A video vision transformer. In: IEEE/CVF International Conference on Computer Vision. pp. 6836–6846 (2021)

2021
[5]

IEEE Access11, 111093–111105 (2023)

Bajgoti, A., Gupta, R., B, P., Dwivedi, R., Siwach, M., Gupta, D.: SwinAnomaly: Real-Time Video Anomaly Detection Using Video Swin Transformer and SORT. IEEE Access11, 111093–111105 (2023)

2023
[6]

In: ACM Multimedia Conference (May 2020)

Bao, W., Yu, Q., Kong, Y.: Uncertainty-based Traffic Accident Anticipation with Spatio-Temporal Relational Learning. In: ACM Multimedia Conference (May 2020)

2020
[7]

Bertasius, G., Wang, H., Torresani, L.: Is Space-time Attention All You Need for Video Understanding? In: International Conference on Machine Learning (2021)

2021
[8]

International Conference on In- telligent Transportation Systems pp

Chakraborty, P., Sharma, A., Hegde, C.: Freeway Traffic Incident Detection from Cameras: A Semi-Supervised Learning Approach. International Conference on In- telligent Transportation Systems pp. 1840–1845 (2018)

2018
[9]

In: Asian Conference on Computer Vision

Chan, F.H., Chen, Y.T., Xiang, Y., Sun, M.: Anticipating Accidents in Dashcam Videos. In: Asian Conference on Computer Vision. pp. 136–153 (2016)

2016
[10]

In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Desai, N.P., Etemad, A., Greenspan, M.: Cyclecrash: A dataset of bicycle collision videos for collision prediction and analysis. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 6688–6698 (2025)

2025
[11]

In: International Conference on Learning Representations (2021)

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: International Conference on Learning Representations (2021)

2021
[12]

In: IEEE Inter- national Conference on Acoustics, Speech and Signal Processing

Du, D., Su, B., Wei, Z.: Preformer: Predictive Transformer with Multi-scale Segment-wise Correlations for Long-term Time Series Forecasting. In: IEEE Inter- national Conference on Acoustics, Speech and Signal Processing . pp. 1–5 (2023)

2023
[13]

In: IEEE/CVF International Conference on Com- puter Vision

Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., Feichtenhofer, C.: Multiscale Vision Transformers. In: IEEE/CVF International Conference on Com- puter Vision. pp. 6824–6835 (2021)

2021
[14]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020)

Feichtenhofer, C.: X3D: Expanding Architectures for Efficient Video Recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020)

2020
[15]

arXiv preprint arXiv:1404.7592 (2014)

Grosek, J., Kutz, J.N.: Dynamic Mode Decomposition for Real-Time Back- ground/Foreground Separation in Video. arXiv preprint arXiv:1404.7592 (2014)

work page arXiv 2014
[16]

In: IEEE International Conference on Computer Vision Workshops

Hara, K., Kataoka, H., Satoh, Y.: Learning Spatio-temporal Features with 3d Residual Networks for Action Recognition. In: IEEE International Conference on Computer Vision Workshops. pp. 3154–3160 (2017) CollideNet: Hierarchical Multi-scale Video Learning for TTC Forecasting 13

2017
[17]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15979–15988 (2022)

2022
[18]

International Co-operation on Theories and Concepts in Traffic Safety Workshop (1994)

Janssen, W., Thomas, M.: Time-to-Collision and Collision Avoidance Systems. International Co-operation on Theories and Concepts in Traffic Safety Workshop (1994)

1994
[19]

The Kinetics Human Action Video Dataset

Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, A., Suleyman, M., Zisserman, A.: The Kinetics Human Action Video Dataset. arXiv preprint arXiv:1705.06950 (2017)

work page internal anchor Pith review arXiv 2017
[20]

In: IEEE International Conference on Computer Vision Workshop

Kutz, J.N., Fu, X., Brunton, S.L., Erichson, N.B.: Multi-resolution Dynamic Mode Decomposition for Foreground/Background Separation and Object Tracking. In: IEEE International Conference on Computer Vision Workshop. pp. 921–929 (2015)

2015
[21]

Kwiatkowski, D., Phillips, P.C., Schmidt, P., Shin, Y.: Testing the Null Hypothesis of Stationarity Against the Alternative of a Unit Root: How Sure Are We That Economic Time Series Have a Unit Root? Journal of Econometrics54(1), 159–178 (1992)

1992
[22]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, Y., Wu, C.Y., Fan, H., Mangalam, K., Xiong, B., Malik, J., Feichtenhofer, C.: mViTv2: Improved Multiscale Vision Transformers for Classification and Detec- tion. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4804–4814 (2022)

2022
[23]

Liu, Y., Wu, H., Wang, J., Long, M.: Non-stationary transformers: Exploring the StationarityinTimeSeriesForecasting.AdvancesinNeuralInformationProcessing Systems35, 9881–9893 (2022)

2022
[24]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H.: Video Swin Trans- former. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3202–3211 (2022)

2022
[25]

IEEE International Conference on Acoustics, Speech and Signal Processing pp

Luo, H., Wang, F.: A Simulation-Based Framework for Urban Traffic Accident De- tection. IEEE International Conference on Acoustics, Speech and Signal Processing pp. 1–5 (2023)

2023
[26]

IEEE Transactions on Pattern Analysis and Machine Intelli- gence44, 7505–7520 (2021)

Luo, W., Liu, W., Lian, D., Gao, S.: Future Frame Prediction Network for Video Anomaly Detection. IEEE Transactions on Pattern Analysis and Machine Intelli- gence44, 7505–7520 (2021)

2021
[27]

In: IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems

Manglik, A., Weng, X., Ohn-Bar, E., Kitanil, K.M.: Forecasting Time-to-Collision from Monocular Video: Feasibility, Dataset, and Challenges. In: IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems. pp. 8081–8088 (2019)

2019
[28]

In: Neural Information Processing Systems (1991)

Mozer, M.C.: Induction of Multiscale Temporal Structure. In: Neural Information Processing Systems (1991)

1991
[29]

IEEE/CVF Winter Conference on Applications of Computer Vision pp

Nagar, P., Shastry, A., Chaudhari, J., Arora, C.: SEMA: Semantic Attention for Capturing Long-Range Dependencies in Egocentric Lifelogs. IEEE/CVF Winter Conference on Applications of Computer Vision pp. 7010–7020 (2024)

2024
[30]

National Highway Traffic Safety Administration (NHTSA): Automated Vehi- cles for Safety (2024), https://www.nhtsa.gov/vehicle-safety/automated-vehicles- safety, Accessed: 2024-12-24

2024
[31]

National Transportation Safety Board: Special Investigation Report: Highway Ve- hicleandInfrastructure-basedTechnologyforthePreventionofRear-endCollisions (2001), http://www.ntsb.gov/Publictn/2001/SIR0101.pdf

2001
[32]

International Conference on Multimedia Retrieval (2020) 14 N.P

Nguyen, K.T., Dinh, D.T., Do, M.N., Tran, M.T.: Anomaly Detection in Traf- fic Surveillance Videos with GAN-based Future Frame Prediction. International Conference on Multimedia Retrieval (2020) 14 N.P. Desai et al

2020
[33]

arXiv preprint arXiv:1905.10437 (2019)

Oreshkin, B.N., Carpov, D., Chapados, N., Bengio, Y.: N-BEATS: Neural Ba- sis Expansion Analysis for Interpretable Time Series Forecasting. arXiv preprint arXiv:1905.10437 (2019)

work page arXiv 1905
[34]

In: IEEE International Conference on Computer Vision Workshops

Pendergrass, S., Brunton, S.L., Kutz, J.N., Erichson, N.B., Askham, T.: Dynamic Mode Decomposition for Background Modeling. In: IEEE International Conference on Computer Vision Workshops. pp. 1862–1870 (2017)

2017
[35]

https://github.com/huggingface/diffusers (2022)

von Platen, P., Patil, S., Lozhkov, A., Cuenca, P., Lambert, N., Rasul, K., Davaadorj, M., Nair, D., Paul, S., Berman, W., Xu, Y., Liu, S., Wolf, T.: Diffusers: State-of-the-art Diffusion Models. https://github.com/huggingface/diffusers (2022)

2022
[36]

IEEE Transactions on Neural Networks and Learning Systems (2023)

Qin, H., Zhou, D., Xu, T., Bian, Z., Li, J.: Factorization Vision Transformer: Modeling Long Range Dependency with Local Window Cost. IEEE Transactions on Neural Networks and Learning Systems (2023)

2023
[37]

IEEE/CVF Conference on Computer Vision and Pattern Recognition pp

Qiu, Z., Yao, T., Ngo, C.W., Tian, X., Mei, T.: Learning Spatio-Temporal Repre- sentation With Local and Global Diffusion. IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 12048–12057 (2019)

2019
[38]

In: International Conference on Machine Learning (2021)

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning Transferable Visual Models from Natural Language Supervision. In: International Conference on Machine Learning (2021)

2021
[39]

Journal Of Statistics6, 3–73 (1990)

RB, C.: STL: A Seasonal-Trend Decomposition Procedure Based on Loess. Journal Of Statistics6, 3–73 (1990)

1990
[40]

YOLOv4: Optimal Speed and Accuracy of Object Detection

Redmon, J., Farhadi, A.: YOLOv3: An Incremental Improvement. arXiv preprint arXiv:2004.10934 (2018)

work page internal anchor Pith review arXiv 2004
[41]

In: International Conference on Ma- chine Learning

Ryali, C., Hu, Y.T., Bolya, D., Wei, C., Fan, H., Huang, P.Y., Aggarwal, V., Chowdhury, A., Poursaeed, O., Hoffman, J., et al.: Hiera: A Hierarchical Vision Transformer Without the Bells-and-Whistles. In: International Conference on Ma- chine Learning. pp. 29441–29454 (2023)

2023
[42]

Biometrika71(3), 599–607 (1984)

Said, S.E., Dickey, D.A.: Testing for Unit Roots in Autoregressive-Moving Average Models of Unknown Order. Biometrika71(3), 599–607 (1984)

1984
[43]

IEEE Transactions on Intelligent Transportation Systems23, 11891–11902 (2021)

Santhosh, K.K., Dogra, D.P., Roy, P.P., Mitra, A.: Vehicular Trajectory Classifica- tion and Traffic Anomaly Detection in Videos Using a Hybrid CNN-VAE Architec- ture. IEEE Transactions on Intelligent Transportation Systems23, 11891–11902 (2021)

2021
[44]

In: International Confer- ence on Learning Representations (2023)

Shabani, M.A., Abdi, A.H., Meng, L., Sylvain, T.: Scaleformer: Iterative Multi- scale Refining Transformers for Time Series Forecasting. In: International Confer- ence on Learning Representations (2023)

2023
[45]

IEEE Transactions on Intelligent Transportation Systems20, 879–887 (2019)

Singh, D., Mohan, C.K.: Deep Spatio-Temporal Representation for Detection of Road Accidents Using Stacked Autoencoder. IEEE Transactions on Intelligent Transportation Systems20, 879–887 (2019)

2019
[46]

International Conference on Intelligent Data Science Technologies and Applications pp

Srinivasan, A., Srikanth, A., Indrajit, H., Narasimhan, V.: A Novel Approach for Road Accident Detection using DETR Algorithm. International Conference on Intelligent Data Science Technologies and Applications pp. 75–80 (2020)

2020
[47]

IEEE/CVF Conference on Computer Vision and Pattern Recognition pp

Suzuki, T., Kataoka, H., Aoki, Y., Satoh, Y.: Anticipating Traffic Accidents with Adaptive Loss and Large-Scale Incident DB. IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 3521–3529 (2018)

2018
[48]

International Conference on Intelligent Transportation Systems pp

Taccari, L., Sambo, F., Bravi, L., Salti, S., Sarti, L., Simoncini, M., Lori, A.: Clas- sification of Crash and Near-Crash Events from Dashcam Videos and Telemat- ics. International Conference on Intelligent Transportation Systems pp. 2460–2465 (2018) CollideNet: Hierarchical Multi-scale Video Learning for TTC Forecasting 15

2018
[49]

The American Statistician72(1), 37–45 (2018)

Taylor, S.J., Letham, B.: Forecasting at Scale. The American Statistician72(1), 37–45 (2018)

2018
[50]

Traffic Injury Prevention (March 2021), https://www.iihs.org/topics/bibliography/ref/2211, Insurance Institute for High- way Safety, Highway Loss Data Institute, ID: 2211

Teoh, E.R.: Effectiveness of Front Crash Prevention Systems in Reducing Large Truck Real-World Crash Rates. Traffic Injury Prevention (March 2021), https://www.iihs.org/topics/bibliography/ref/2211, Insurance Institute for High- way Safety, Highway Loss Data Institute, ID: 2211

2021
[51]

In: IEEE/CVF Winter Conference on Ap- plications of Computer Vision

Thakur, N., Gouripeddi, P., Li, B.: Graph(Graph): A Nested Graph-Based Frame- work for Early Accident Anticipation. In: IEEE/CVF Winter Conference on Ap- plications of Computer Vision. pp. 7533–7541 (2024)

2024
[52]

In: IEEE International Conference on Computer Vision

Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning Spatiotempo- ral Features with 3d Convolutional Networks. In: IEEE International Conference on Computer Vision. pp. 4489–4497 (2015)

2015
[53]

Vaswani,A.:AttentionisAllYouNeed.AdvancesinNeuralInformationProcessing Systems (2017)

2017
[54]

In: Interna- tional Conference on Learning Representations (2023)

Wang, H., Peng, J., Huang, F., Wang, J., Chen, J., Xiao, Y.: MICN: Multi-scale Local and Global Context Modeling for Long-term Series Forecasting. In: Interna- tional Conference on Learning Representations (2023)

2023
[55]

arXiv preprint arXiv:2405.14616 , year=

Wang, S., Wu, H., Shi, X.L., Hu, T., Luo, H., Ma, L., Zhang, J.Y., Zhou, J.: TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting. ArXiv abs/2405.14616(2024)

work page arXiv 2024
[56]

In: Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence (2023)

Wang, T., Kim, S., Ji, W., Xie, E., Ge, C., Chen, J., Li, Z., Luo, P.: DeepAccident: A Motion and Accident Prediction Benchmark for V2X Autonomous Driving. In: Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence (2023)

2023
[57]

IEEE Transactions on Neural Networks and Learning Systems33, 2301–2312 (2020)

Wang, X., Che, Z., Yang, K., Jiang, B., Tang, J.B., Ye, J., Wang, J., Qi, Q.: Robust Unsupervised Video Anomaly Detection by Multipath Frame Prediction. IEEE Transactions on Neural Networks and Learning Systems33, 2301–2312 (2020)

2020
[58]

IEEE Transactions on Vehicular Technology69, 9497–9508 (2020)

Wang, X., Liu, J., Qiu, T., Mu, C., Chen, C., Zhou, P.: A Real-Time Collision Prediction Mechanism With Deep Learning for Intelligent Transportation System. IEEE Transactions on Vehicular Technology69, 9497–9508 (2020)

2020
[59]

In: IEEE/CVF International Conference on Computer Vision (2023)

Wasim, S.T., Khattak, M.U., Naseer, M., Khan, S., Shah, M., Khan, F.S.: Video- FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition. In: IEEE/CVF International Conference on Computer Vision (2023)

2023
[60]

IEEE Trans- actions on Circuits and Systems for Video Technology24(12), 2034–2048 (2014)

Wen, J., Xu, Y., Tang, J., Zhan, Y., Lai, Z., Guo, X.: Joint Video Frame Set Division and Low-Rank Decomposition for Background Subtraction. IEEE Trans- actions on Circuits and Systems for Video Technology24(12), 2034–2048 (2014)

2034
[61]

In: Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W

Wu, H., Xu, J., Wang, J., Long, M.: Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting. In: Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems (2021)

2021
[62]

IEEE International Confer- ence on Multimedia and Expo Workshops pp

Wu, M.X., Chang, C.S., Miao, J.M., Lee, C.Y.: Predicting Car Accidents with YOLOv7 Object Detection and Object Relationships. IEEE International Confer- ence on Multimedia and Expo Workshops pp. 87–89 (2023)

2023
[63]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2022)

Yao, Y., Wang, X., Xu, M., Pu, Z., Wang, Y., Atkins, E., Crandall, D.: DoTA: unsupervised detection of traffic anomaly in driving videos. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022)

2022
[64]

IEEE International Conference on Multimedia and Expo Workshops pp

Yi, C., Huang, T., Ye, H.J., chuan Zhan, D.: Improved Dynamic Spatial-Temporal Attention Network for Early Anticipation of Traffic Accidents. IEEE International Conference on Multimedia and Expo Workshops pp. 81–86 (2023) 16 N.P. Desai et al

2023
[65]

IEEE/CVF Conference on Computer Vision and Pattern Recognition pp

Yoo, J.H., Kim, S., Lee, D., Kim, C., Hong, S.: Towards End-to-End Genera- tive Modeling of Long Videos with Memory-Efficient Bidirectional Transformers. IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 22888– 22897 (2023)

2023
[66]

In: Eu- ropean Conference on Computer Vision (2020)

You, T., Han, B.: Traffic Accident Benchmark for Causality Recognition. In: Eu- ropean Conference on Computer Vision (2020)

2020
[67]

IEEE Conference on Computer Vision and Pattern Recognition pp

Zeng, K.H., Chou, S.H., Chan, F.H., Niebles, J.C., Sun, M.: Agent-Centric Risk As- sessment: Accident Anticipation and Risky Region Localization. IEEE Conference on Computer Vision and Pattern Recognition pp. 1330–1338 (2017)

2017
[68]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zhang, M., Wang, J., Qi, Q., Sun, H., Zhuang, Z., Ren, P., Ma, R., Liao, J.: Multi-Scale Video Anomaly Detection by Multi-Grained Spatio-Temporal Repre- sentation Learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17385–17394 (2024)

2024
[69]

In: ACM International Conference on Information and Knowledge Management

Zhao, S., Jin, M., Hou, Z., Yang, C., Li, Z., Wen, Q., Wang, Y.: HiMTM: Hierarchi- cal Multi-Scale Masked Time Series Modeling with Self-Distillation for Long-Term Forecasting. In: ACM International Conference on Information and Knowledge Management. p. 3352–3362 (2024)

2024
[70]

IEEE Transactions on Circuits and Systems for Video Technology32, 8285–8296 (2022) CollideNet: Hierarchical Multi-scale Video Learning for TTC Forecasting 17 Appendix A.1

Zhong, Y., Chen, X., Hu, Y., Tang, P., Ren, F.: Bidirectional Spatio-Temporal Feature Learning With Multiscale Evaluation for Video Anomaly Detection. IEEE Transactions on Circuits and Systems for Video Technology32, 8285–8296 (2022) CollideNet: Hierarchical Multi-scale Video Learning for TTC Forecasting 17 Appendix A.1. Release details The code to implem...

2022