pith. machine review for the scientific record. sign in

arxiv: 2605.05895 · v1 · submitted 2026-05-07 · 💻 cs.CV · cs.AI

Recognition: unknown

Detecting AI-Generated Videos with Spiking Neural Networks

Changick Kim, Heeseon Kim, Minseok Son, Minsuk Jang, Younghun Kim, Yujin Yang

Pith reviewed 2026-05-09 16:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords AI-generated video detectionspiking neural networksdeepfake detectiontemporal residualscross-generator evaluationsemantic trajectoriesvideo forensics
0
0 comments X

The pith

Spiking neural networks detect AI-generated videos by responding to temporal smoothness gaps at object boundaries, achieving 93.14% accuracy across unseen generators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that AI-generated videos differ from real ones in their smoother frame-to-frame temporal residuals at the pixel level and more compact trajectories in semantic feature space. It shows that spiking neural networks naturally respond to these artifacts by firing predominantly at boundaries in fake clips. This matters because existing detectors using standard neural networks degrade under cross-generator tests where artifact types vary. The authors build a detector called MAST that combines these cues for better generalization. If correct, it opens a path for more robust and efficient detection methods as video generators improve.

Core claim

We identify that AI-generated videos exhibit smoother temporal residuals and compact semantic trajectories, and that SNNs fire at boundaries for fakes unlike reals. Based on this, we propose MAST which processes multi-channel temporal residuals with a spike-driven temporal branch and a frozen semantic encoder. On the GenVideo benchmark, MAST achieves 93.14% mean accuracy across 10 unseen generators under strict cross-generator evaluation, matching or surpassing the strongest ANN-based detectors.

What carries the argument

The MAST detector, which processes multi-channel temporal residuals using a spike-driven temporal branch alongside a frozen semantic encoder to capture the temporal smoothness gap in AI-generated videos.

If this is right

  • SNN-based detection matches or exceeds ANN performance in cross-generator settings for AI video identification.
  • The identified signatures of smoother residuals and compact trajectories serve as generalizable cues for detection.
  • Firing at object and motion boundaries in SNNs aligns with the sparse nature of temporal artifacts in fakes.
  • The approach demonstrates that event-driven dynamics suit the structure of residual signals better than dense backbones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could enable more energy-efficient detection systems since SNNs use sparse activation.
  • The cues might apply to detecting other forms of synthetic media with temporal inconsistencies.
  • Testing on future AI generators would verify if the signatures remain reliable beyond the current benchmark.

Load-bearing premise

The smoother temporal residuals and SNN firing at boundaries are consistent and generalizable indicators of AI-generated videos across all generators.

What would settle it

A new video generator that produces fakes with temporal residuals as rough as real videos or where MAST's accuracy drops below ANN detectors in cross-evaluation would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.05895 by Changick Kim, Heeseon Kim, Minseok Son, Minsuk Jang, Younghun Kim, Yujin Yang.

Figure 1
Figure 1. Figure 1: (a) Motivation. Although real and AI-generated videos can appear similar at the frame view at source ↗
Figure 2
Figure 2. Figure 2: GenVidBench Pair 1. (a) RGB sparsity overlaps; (b)(c)(d) Laplacian residual sparsity, view at source ↗
Figure 3
Figure 3. Figure 3: (a) Per-frame spatial mean ±σ of the temporal residual, real (blue) vs. fake (red, all generators pooled); a consistent gap persists across frames. (b) Raw-residual anomaly trace Araw t aggregated over the residual channels, per generator. caption-pairing makes semantic trajectory analysis more meaningful, since real-versus-fake gaps reflect generator behavior without content differences. 3.1 Semantic Traj… view at source ↗
Figure 4
Figure 4. Figure 4: MAST overview. A trainable Spike-Driven Temporal Branch (SDTB, dashed) processes view at source ↗
Figure 5
Figure 5. Figure 5: (a) Event-camera frames from DVS￾Gesture [1]. (b) Pseudo-events derived from a conventional video. In both cases, responses are concentrated at pixels that change over time, supporting temporal residuals as event￾like inputs to the temporal module. We represent inter-frame residuals as event-like pseudo-events ( view at source ↗
Figure 6
Figure 6. Figure 6: Pixel-level temporal residuals derived from two consecutive frames view at source ↗
Figure 7
Figure 7. Figure 7: Backbone-matched comparison on IvyFake. All methods share the X-CLIP￾B/16 semantic backbone and differ only in the temporal module: D3 (RAFT), DeMamba (Mamba), Ours (SNN). B) Backbone-matched comparison on IvyFake. To verify the effectiveness of the SDTB, we compare detectors with same semantic backbone on another AIGV benchmark, IvyFake [31]. The three detectors we compare share the X-CLIP-B/16 semantic b… view at source ↗
Figure 8
Figure 8. Figure 8: Training dynamics of learnable vs. fixed LIF. Fixed: τ=2.0, Vth=1.0. D) Impact of learnable τ and threshold. We com￾pare learnable per-channel τc and threshold against a fixed LIF baseline (τ=2.0, Vth=1.0). As shown in view at source ↗
Figure 9
Figure 9. Figure 9: Per-generator Laplacian HF sparsity distributions on IvyFake. view at source ↗
Figure 10
Figure 10. Figure 10: Per-generator Sobel sparsity distributions on IvyFake. view at source ↗
Figure 11
Figure 11. Figure 11: Per-generator absolute difference sparsity distributions on IvyFake. view at source ↗
Figure 12
Figure 12. Figure 12: Per-generator second-order difference sparsity distributions on IvyFake. view at source ↗
Figure 13
Figure 13. Figure 13: Per-generator chroma sparsity distributions on IvyFake. Unlike the luma-derived channels, view at source ↗
Figure 14
Figure 14. Figure 14: Per-generator trajectory curvature distributions on IvyFake. view at source ↗
Figure 15
Figure 15. Figure 15: Per-generator spectral centroid fc distributions on IvyFake. Both statistics shift in the direction predicted by the GenVidBench Pair-1 analysis (Section 3.1). Trajectory curvature (θ) is consistently lower for generators than for the natural reference, indicating smoother and less-curved latent paths even on the more diverse IvyFake content. Spectral centroid fc is also consistently lower for generators,… view at source ↗
Figure 16
Figure 16. Figure 16: t-SNE projection of MAST feature spaces on IvyFake. Top row: spiking temporal feature view at source ↗
Figure 17
Figure 17. Figure 17: Boundary-vs-interior partition of the 14 × 14 patch grid. The outer 1-ring of 52 patches forms the boundary mask; the inner 12 × 12 = 144 patches form the interior mask view at source ↗
Figure 18
Figure 18. Figure 18: Per-frame spike gate-map visualization on one representative clip per GenVideo test view at source ↗
Figure 19
Figure 19. Figure 19: Per-frame spike gate-map visualization on one representative clip per GenVideo test view at source ↗
Figure 20
Figure 20. Figure 20: Temporal residual channels on a fake clip generated by Runway Gen-2, a commercial view at source ↗
Figure 21
Figure 21. Figure 21: Temporal residual channels on a fake clip generated by Show1, a hybrid pixel-and-latent view at source ↗
read the original abstract

Modern AI-generated videos are photorealistic at the single-frame level, leaving inter-frame dynamics as the main remaining axis for detection. Existing detectors typically handle this temporal evidence in three ways: feeding the full frame sequence to a generic temporal backbone, reducing one dominant temporal cue to fixed video-level descriptors, or comparing temporal features to real-video statistics through a detection metric. These strategies degrade sharply under cross-generator evaluation, where artifact type and timescale vary across generators. On caption-paired benchmark, GenVidBench, we identify two signatures that prior detectors do not jointly exploit: AI-generated videos exhibit smoother frame-to-frame temporal residuals at the pixel level, and more compact trajectories in the semantic feature space, indicating a temporal smoothness gap at both levels. We further observe that, when raw video is fed into a Spiking Neural Networks (SNNs), fake clips elicit firing predominantly at object and motion boundaries, unlike real clips, suggesting that the SNN responds to temporal artifacts localized at edges. These cues are sparse, asynchronous, and concentrated at moments of change, which makes SNNs a natural choice for this task: their event-driven, sparsely-activated dynamics align with the structure of the residual signal in a way that dense ANN backbones do not. Building on this observation, we propose MAST, a detector that processes multi-channel temporal residuals with a spike-driven temporal branch alongside a frozen semantic encoder for cross-generator generalization. On the GenVideo benchmark, MAST achieves 93.14\% mean accuracy across 10 unseen generators under strict cross-generator evaluation, matching or surpassing the strongest ANN-based detectors and demonstrating the practical applicability of SNNs to AI-generated video detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes MAST, a hybrid detector that feeds multi-channel temporal residuals into a spike-driven temporal branch (SNN) paired with a frozen semantic encoder. It identifies two signatures in AI-generated videos—smoother pixel-level frame-to-frame residuals and more compact semantic trajectories—and observes that SNNs produce boundary-localized firing on fakes but not reals. On the GenVideo benchmark the method reports 93.14% mean accuracy across 10 unseen generators under strict cross-generator evaluation, matching or exceeding prior ANN detectors.

Significance. If the central performance claim is reproducible and the SNN contribution is isolated, the work would provide concrete evidence that event-driven spiking dynamics can exploit sparse temporal artifacts for cross-generator video detection, an area where dense ANN backbones have shown limited generalization.

major comments (3)
  1. [Abstract / Experimental results] Abstract and experimental results: the claim that the SNN temporal branch supplies the generalization advantage rests on the differential firing observation and the final accuracy number, yet no ablation is reported that replaces the SNN branch with an equivalent ANN temporal module (identical residual input channels, same training protocol, same semantic encoder). Without this control, it remains possible that the 93.14% figure is carried by the residual representation itself rather than by any property unique to spiking computation.
  2. [Methods / §4] Methods / §4: the manuscript provides no quantitative statistics (firing-rate histograms, spatial localization metrics, or statistical tests) to support the qualitative statement that “fake clips elicit firing predominantly at object and motion boundaries.” This observation is load-bearing for the argument that SNN dynamics align with the structure of the residual signal.
  3. [Results] Results: the reported mean accuracy of 93.14% across 10 generators is presented without per-generator breakdowns, confidence intervals, or details on the number of runs and statistical significance testing. Given the cross-generator setting, these omissions prevent assessment of whether the result is robust or driven by a subset of generators.
minor comments (2)
  1. [Abstract] The abstract refers to both “GenVidBench” and “GenVideo benchmark”; the relationship between these names should be clarified in the text.
  2. [Methods] Notation for the multi-channel temporal residual input and the precise architecture of the spike-driven branch (layer counts, neuron model, encoding scheme) is not fully specified in the provided sections.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment in detail below and will incorporate the suggested analyses to strengthen the claims regarding the SNN contribution and result robustness.

read point-by-point responses
  1. Referee: [Abstract / Experimental results] Abstract and experimental results: the claim that the SNN temporal branch supplies the generalization advantage rests on the differential firing observation and the final accuracy number, yet no ablation is reported that replaces the SNN branch with an equivalent ANN temporal module (identical residual input channels, same training protocol, same semantic encoder). Without this control, it remains possible that the 93.14% figure is carried by the residual representation itself rather than by any property unique to spiking computation.

    Authors: We agree that an ablation study is required to isolate the contribution of the spiking dynamics. In the revised manuscript we will replace the SNN temporal branch with an equivalent ANN module that receives identical multi-channel residual inputs, follows the same training protocol, and shares the frozen semantic encoder. We will report the resulting cross-generator accuracy and compare it directly to the SNN version to quantify any advantage attributable to spiking computation. revision: yes

  2. Referee: [Methods / §4] Methods / §4: the manuscript provides no quantitative statistics (firing-rate histograms, spatial localization metrics, or statistical tests) to support the qualitative statement that “fake clips elicit firing predominantly at object and motion boundaries.” This observation is load-bearing for the argument that SNN dynamics align with the structure of the residual signal.

    Authors: We acknowledge that the qualitative firing observation requires quantitative backing. The revised manuscript will include firing-rate histograms for real versus generated clips, spatial localization metrics (e.g., fraction of spikes occurring at object and motion boundaries), and statistical tests (e.g., Wilcoxon rank-sum) to demonstrate the significance of the boundary-localized firing pattern in fake videos. revision: yes

  3. Referee: [Results] Results: the reported mean accuracy of 93.14% across 10 generators is presented without per-generator breakdowns, confidence intervals, or details on the number of runs and statistical significance testing. Given the cross-generator setting, these omissions prevent assessment of whether the result is robust or driven by a subset of generators.

    Authors: We agree that detailed reporting is necessary for evaluating robustness in the cross-generator setting. The revised results section will provide per-generator accuracy tables, 95% confidence intervals, the number of independent runs performed, and p-values from appropriate statistical tests comparing MAST against baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation with no derivation chain

full rationale

The paper identifies empirical signatures (smoother temporal residuals, compact semantic trajectories, boundary-localized SNN firing) from observation on GenVidBench, proposes MAST as a hybrid architecture, and reports measured cross-generator accuracy (93.14%) on held-out generators. No equations, fitted parameters renamed as predictions, self-citations as load-bearing premises, or ansatzes are present in the abstract or described structure. The central result is an external performance metric on unseen data, not a quantity forced by construction from the inputs or prior self-work. This is a standard empirical ML paper with independent evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on empirical observations of video properties and SNN behavior rather than first-principles derivations; these are treated as general but are benchmark-derived assumptions.

axioms (2)
  • domain assumption AI-generated videos exhibit smoother frame-to-frame temporal residuals at the pixel level and more compact trajectories in the semantic feature space
    Presented as an observation from GenVidBench that prior detectors do not jointly exploit.
  • domain assumption When raw video is fed into SNNs, fake clips elicit firing predominantly at object and motion boundaries unlike real clips
    Empirical observation used to motivate the spike-driven branch.

pith-pipeline@v0.9.0 · 5617 in / 1302 out tokens · 57690 ms · 2026-05-09T16:11:24.180249+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

89 extracted references · 15 canonical work pages · 7 internal anchors

  1. [1]

    A low power, fully event-based gesture recognition system

    Arnon Amir, Brian Taba, David Berg, Timothy Melano, Jeffrey McKinstry, Carmelo Di Nolfo, Tapan Nayak, Alexander Andreopoulos, Guillaume Garreau, Marcela Mendoza, et al. A low power, fully event-based gesture recognition system. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7243–7252, 2017

  2. [2]

    Ai-generated video detection via spatial-temporal anomaly learning

    Jianfa Bai, Man Lin, Gang Cao, and Zijie Lou. Ai-generated video detection via spatial-temporal anomaly learning. InChinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 460–470. Springer, 2024

  3. [3]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  4. [4]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024

  5. [5]

    The laplacian pyramid as a compact image code.IEEE Transactions on Communications, 31(4):532–540, 1983

    P Burt and E Adelson. The laplacian pyramid as a compact image code.IEEE Transactions on Communications, 31(4):532–540, 1983

  6. [6]

    Carnevale and Michael L

    Nicholas T. Carnevale and Michael L. Hines.The NEURON Book. Cambridge University Press, 2006

  7. [7]

    Quo vadis, action recognition? a new model and the kinetics dataset

    Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. Inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017

  8. [8]

    Recent event camera innovations: A survey

    Bharatesh Chakravarthi, Aayush Atul Verma, Kostas Daniilidis, Cornelia Fermuller, and Yezhou Yang. Recent event camera innovations: A survey. InEuropean conference on computer vision, pages 342–376. Springer, 2024

  9. [9]

    arXiv preprint arXiv:2310.19512 , year=

    Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, et al. Videocrafter1: Open diffusion models for high-quality video generation.arXiv preprint arXiv:2310.19512, 2023

  10. [10]

    Videocrafter2: Overcoming data limitations for high-quality video diffusion models

    Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7310–7320, 2024

  11. [11]

    Demamba: Ai-generated video detection on million-scale genvideo benchmark,

    Haoxing Chen, Yan Hong, Zizheng Huang, Zhuoer Xu, Zhangxuan Gu, Yaohui Li, Jun Lan, Huijia Zhu, Jianfu Zhang, Weiqiang Wang, et al. Demamba: Ai-generated video detection on million-scale genvideo benchmark.arXiv preprint arXiv:2405.19707, 2024

  12. [12]

    Seine: Short-to-long video diffusion model for generative transition and prediction

    Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffusion model for generative transition and prediction. InThe Twelfth International Conference on Learning Representations, 2023

  13. [13]

    Unleashing tempo- ral capacity of spiking neural networks through spatiotemporal separation.arXiv preprint arXiv:2512.05472, 2025

    Yiting Dong, Zhaofei Yu, Jianhao Ding, Zijie Xu, and Tiejun Huang. Unleashing tempo- ral capacity of spiking neural networks through spatiotemporal separation.arXiv preprint arXiv:2512.05472, 2025

  14. [14]

    Structure and content-guided video synthesis with diffusion models

    Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 7346–7356, 2023

  15. [15]

    Spikingjelly: An open-source machine learning infrastructure platform for spike-based intelligence.Science Advances, 9(40):eadi1480, 2023

    Wei Fang, Yanqi Chen, Jianhao Ding, Zhaofei Yu, Timothée Masquelier, Ding Chen, Liwei Huang, Huihui Zhou, Guoqi Li, and Yonghong Tian. Spikingjelly: An open-source machine learning infrastructure platform for spike-based intelligence.Science Advances, 9(40):eadi1480, 2023. 10 Jang et al. Detecting AI-Generated Videos with Spiking Neural Networks

  16. [16]

    Incorporating learnable membrane time constant to enhance learning of spiking neural networks

    Wei Fang, Zhaofei Yu, Yanqi Chen, Timothée Masquelier, Tiejun Huang, and Yonghong Tian. Incorporating learnable membrane time constant to enhance learning of spiking neural networks. InProceedings of the IEEE/CVF international conference on computer vision, pages 2661–2671, 2021

  17. [17]

    X3d: Expanding architectures for efficient video recognition

    Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 203–213, 2020

  18. [18]

    Slowfast networks for video recognition

    Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019

  19. [19]

    Event-based vision: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):154–180, 2020

    Guillermo Gallego, Tobi Delbruck, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew Davison, Jörg Conradt, Kostas Daniilidis, and Davide Scaramuzza. Event-based vision: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):154–180, 2020

  20. [20]

    Spiking neural networks.International journal of neural systems, 19(04):295–308, 2009

    Samanwoy Ghosh-Dastidar and Hojjat Adeli. Spiking neural networks.International journal of neural systems, 19(04):295–308, 2009

  21. [21]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

  22. [22]

    Sharpening Your Density Fields: Spiking Neuron Aided Fast Geometry Learning

    Yi Gu, Zhaorui Wang, and Renjing Xu. Sharpening Your Density Fields: Spiking Neuron Aided Fast Geometry Learning. In Beibei Wang and Alexander Wilkie, editors,Eurographics Symposium on Rendering. The Eurographics Association, 2025

  23. [23]

    Spatiotemporal inconsistency learning for deepfake video detection

    Zhihao Gu, Yang Chen, Taiping Yao, Shouhong Ding, Jilin Li, Feiyue Huang, and Lizhuang Ma. Spatiotemporal inconsistency learning for deepfake video detection. InProceedings of the 29th ACM international conference on multimedia, pages 3473–3481, 2021

  24. [24]

    Enof-snn: Training accurate spiking neural networks via enhancing the output feature

    Yufei Guo, Weihang Peng, Xiaode Liu, Yuanpei Chen, Yuhan Zhang, Xin Tong, Zhou Jie, and Zhe Ma. Enof-snn: Training accurate spiking neural networks via enhancing the output feature. Advances in Neural Information Processing Systems, 37:51708–51726, 2024

  25. [25]

    Reverb-snn: Reversing bit of the weight and activation for spiking neural networks

    Yufei Guo, Yuhan Zhang, Zhou Jie, Xiaode Liu, Xin Tong, Yuanpei Chen, Weihang Peng, and Zhe Ma. Reverb-snn: Reversing bit of the weight and activation for spiking neural networks. In International Conference on Machine Learning, pages 21220–21231. PMLR, 2025

  26. [26]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

  27. [27]

    1.1 computing’s energy problem (and what we can do about it)

    Mark Horowitz. 1.1 computing’s energy problem (and what we can do about it). In2014 IEEE international solid-state circuits conference digest of technical papers (ISSCC), pages 10–14. IEEE, 2014

  28. [28]

    Non-negative matrix factorization with sparseness constraints.Journal of machine learning research, 5(Nov):1457–1469, 2004

    Patrik O Hoyer. Non-negative matrix factorization with sparseness constraints.Journal of machine learning research, 5(Nov):1457–1469, 2004

  29. [29]

    v2e: From video frames to realistic dvs events

    Yuhuang Hu, Shih-Chii Liu, and Tobi Delbruck. v2e: From video frames to realistic dvs events. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1312–1321, 2021

  30. [30]

    AI-generated video detection via perceptual straightening

    Cristian Internò, Robert Geirhos, Markus Olhofer, Sunny Liu, Barbara Hammer, and David Klindt. AI-generated video detection via perceptual straightening. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  31. [31]

    Ivy-Fake: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection

    Changjiang Jiang, Wenhui Dong, Zhonghao Zhang, Chenyang Si, Fengchang Yu, Wei Peng, Xinbin Yuan, Yifei Bi, Ming Zhao, Zian Zhou, and Caifeng Shan. IVY-FAKE: A unified explainable framework and benchmark for image and video AIGC detection.arXiv preprint arXiv:2506.00979, 2025. 11 Jang et al. Detecting AI-Generated Videos with Spiking Neural Networks

  32. [32]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017

  33. [33]

    Text2video-zero: Text-to-image diffusion models are zero-shot video generators

    Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023

  34. [34]

    Supervised contrastive learning.Advances in neural information processing systems, 33:18661–18673, 2020

    Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning.Advances in neural information processing systems, 33:18661–18673, 2020

  35. [35]

    Deepfake detection using the rate of change between frames based on computer vision.Sensors, 21(21):7367, 2021

    Gihun Lee and Mihui Kim. Deepfake detection using the rate of change between frames based on computer vision.Sensors, 21(21):7367, 2021

  36. [36]

    Cifar10-dvs: an event- stream dataset for object classification.Frontiers in neuroscience, 11:244131, 2017

    Hongmin Li, Hanchao Liu, Xiangyang Ji, Guoqi Li, and Luping Shi. Cifar10-dvs: an event- stream dataset for object classification.Frontiers in neuroscience, 11:244131, 2017

  37. [37]

    Video- mamba: State space model for efficient video understanding

    Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. Video- mamba: State space model for efficient video understanding. InEuropean conference on computer vision, pages 237–255. Springer, 2024

  38. [38]

    Mvitv2: Improved multiscale vision transformers for classification and detection

    Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. Mvitv2: Improved multiscale vision transformers for classification and detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4804–4814, 2022

  39. [39]

    Seenn: Towards temporal spiking early exit neural networks.Advances in Neural Information Processing Systems, 36:63327–63342, 2023

    Yuhang Li, Tamar Geller, Youngeun Kim, and Priyadarshini Panda. Seenn: Towards temporal spiking early exit neural networks.Advances in Neural Information Processing Systems, 36:63327–63342, 2023

  40. [40]

    Spik-NeRF: Bridging spiking neural networks and neural radiance fields for real-time rendering

    Yuxin Liao, Mingrui Wang, Hao Bian, Cheng Sun, and Bei Yu. Spik-NeRF: Bridging spiking neural networks and neural radiance fields for real-time rendering. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  41. [41]

    Tsm: Temporal shift module for efficient video under- standing

    Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video under- standing. InProceedings of the IEEE/CVF international conference on computer vision, pages 7083–7093, 2019

  42. [42]

    Video swin transformer

    Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022

  43. [43]

    Spiking neural networks for temporal processing: Status quo and future prospects.arXiv preprint arXiv:2502.09449, 2025

    Chenxiang Ma, Xinyi Chen, Yanchen Li, Qu Yang, Yujie Wu, Guoqi Li, Gang Pan, Huajin Tang, Kay Chen Tan, and Jibin Wu. Spiking neural networks for temporal processing: Status quo and future prospects.arXiv preprint arXiv:2502.09449, 2025

  44. [44]

    Detecting ai-generated video via frame consistency

    Long Ma, Zhiyuan Yan, Qinglang Guo, Yong Liao, Haiyang Yu, and Pengyuan Zhou. Detecting ai-generated video via frame consistency. In2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2025

  45. [45]

    Latte: Latent diffusion transformer for video generation.Transactions on Machine Learning Research, 2025

    Xin Ma, Yaohui Wang, Xinyuan Chen, Jia Gengyun, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.Transactions on Machine Learning Research, 2025

  46. [46]

    Networks of spiking neurons: The third generation of neural network models

    Wolfgang Maass. Networks of spiking neurons: The third generation of neural network models. Neural Networks, 10(9):1659–1671, 1997

  47. [47]

    Moonvalley text to video generator

    Moonvalley.ai. Moonvalley text to video generator. https://moonvalley.ai/, 2022. Ac- cessed: 2026-05-04

  48. [48]

    Hotshot-xl

    John Mullan, Duncan Crawbuck, and Aakash Sastry. Hotshot-xl. https://github.com/ hotshotco/hotshot-xl, 2023. Accessed: 2026-05-04. 12 Jang et al. Detecting AI-Generated Videos with Spiking Neural Networks

  49. [49]

    Surrogate gradient learning in spiking neural networks.IEEE Signal Processing Magazine, 36(6):51–63, 2019

    Emre O Neftci, Hesham Mostafa, and Friedemann Zenke. Surrogate gradient learning in spiking neural networks.IEEE Signal Processing Magazine, 36(6):51–63, 2019

  50. [50]

    Expanding language-image pretrained models for general video recognition

    Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling. Expanding language-image pretrained models for general video recognition. InEuropean conference on computer vision, pages 1–18. Springer, 2022

  51. [51]

    Genvidbench: A 6-million benchmark for ai-generated video detection

    Zhenliang Ni, Qiangyu Yan, Mouxiao Huang, Tianning Yuan, Yehui Tang, Hailin Hu, Xinghao Chen, and Yunhe Wang. Genvidbench: A 6-million benchmark for ai-generated video detection. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 15582– 15590, 2026

  52. [52]

    Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, 2024

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, 2024

  53. [53]

    Oraibi and Abdulkareem M

    Mohammed R. Oraibi and Abdulkareem M. Radhi. Enhancement digital forensic approach for inter-frame video forgery detection using a deep learning technique.Iraqi Journal of Science, 63(6):2686–2701, 2022

  54. [54]

    Converting static image datasets to spiking neuromorphic datasets using saccades.Frontiers in neuroscience, 9:437, 2015

    Garrick Orchard, Ajinkya Jayawant, Gregory K Cohen, and Nitish Thakor. Converting static image datasets to spiking neuromorphic datasets using saccades.Frontiers in neuroscience, 9:437, 2015

  55. [55]

    Pika ai video generator.https://pika.art, 2024

    Pika Labs. Pika ai video generator.https://pika.art, 2024. Accessed: 2026-05-04

  56. [56]

    Ucf-crime- dvs: A novel event-based dataset for video anomaly detection with spiking neural networks

    Yuanbin Qian, Shuhan Ye, Chong Wang, Xiaojie Cai, Jiangbo Qian, and Jiafei Wu. Ucf-crime- dvs: A novel event-based dataset for video anomaly detection with spiking neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 6577–6585, 2025

  57. [57]

    Nitin Rathi and Kaushik Roy. Diet-snn: A low-latency spiking neural network with direct input encoding and leakage and threshold optimization.IEEE Transactions on Neural Networks and Learning Systems, 34(6):3174–3182, 2021

  58. [58]

    Enabling deep spiking neural networks with hybrid conversion and spike timing dependent backpropagation

    Nitin Rathi, Gopalakrishnan Srinivasan, Priyadarshini Panda, and Kaushik Roy. Enabling deep spiking neural networks with hybrid conversion and spike timing dependent backpropagation. InInternational Conference on Learning Representations, 2020

  59. [59]

    Spiking pointnet: Spiking neural networks for point clouds.Advances in Neural Information Processing Systems, 36:41797–41808, 2023

    Dayong Ren, Zhe Ma, Yuanpei Chen, Weihang Peng, Xiaode Liu, Yuhan Zhang, and Yufei Guo. Spiking pointnet: Spiking neural networks for point clouds.Advances in Neural Information Processing Systems, 36:41797–41808, 2023

  60. [60]

    Ts-lif: A temporal segment spiking neuron network for time series forecasting

    FENG SHIBO, Wanjin Feng, Xingyu Gao, Peilin Zhao, and Zhiqi Shen. Ts-lif: A temporal segment spiking neuron network for time series forecasting. InThe Thirteenth International Conference on Learning Representations, 2025

  61. [61]

    On learning multi-modal forgery representation for diffusion generated video detection

    Xiufeng Song, Xiao Guo, Jiache Zhang, Qirui Li, Lei Bai, Xiaoming Liu, Guangtao Zhai, and Xiaohong Liu. On learning multi-modal forgery representation for diffusion generated video detection. InAdvances in Neural Information Processing Systems, volume 37, pages 122054–122077, 2024

  62. [62]

    Morph studio text to video generator

    Morph studio. Morph studio text to video generator. https://www.morphstudio.com/,

  63. [63]

    Accessed: 2026-05-04

  64. [64]

    Spiking neural networks for video analysis: An in-depth review of models and architectures.Neural Networks, page 108844, 2026

    SK Sudha and S Aji. Spiking neural networks for video analysis: An in-depth review of models and architectures.Neural Networks, page 108844, 2026

  65. [65]

    Rethinking the upsampling operations in CNN-based generative network for generalizable deepfake detection

    Chuangchuang Tan et al. Rethinking the upsampling operations in CNN-based generative network for generalizable deepfake detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 28130–28139, 2024. 13 Jang et al. Detecting AI-Generated Videos with Spiking Neural Networks

  66. [66]

    Raft: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In European conference on computer vision, pages 402–419. Springer, 2020

  67. [67]

    Lanesnns: Spiking neural networks for lane detection on the loihi neuromorphic processor

    Alberto Viale, Alessio Marchisio, Maurizio Martina, Guido Masera, and Muhammad Shafique. Lanesnns: Spiking neural networks for lane detection on the loihi neuromorphic processor. In2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 79–86. IEEE, 2022

  68. [68]

    ModelScope Text-to-Video Technical Report

    Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023

  69. [69]

    Event-based high dynamic range image and very high frame rate video generation using conditional generative adversarial networks

    Lin Wang, Yo-Sung Ho, Kuk-Jin Yoon, et al. Event-based high dynamic range image and very high frame rate video generation using conditional generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10081–10090, 2019

  70. [70]

    arXiv preprint arXiv:2305.10874 (2023)

    Wenjing Wang, Huan Yang, Zixi Tuo, Huiguo He, Junchen Zhu, Jianlong Fu, and Jiaying Liu. Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation.arXiv preprint arXiv:2305.10874, 2023

  71. [71]

    Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, 133(5):3059–3078, 2025

    Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, 133(5):3059–3078, 2025

  72. [72]

    Internvid: A large-scale video-text dataset for multi- modal understanding and generation

    Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multi- modal understanding and generation. InThe Twelfth International Conference on Learning Representations, 2024

  73. [73]

    Internvideo2: Scaling foundation models for multimodal video understanding

    Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for multimodal video understanding. InEuropean conference on computer vision, pages 396–416. Springer, 2024

  74. [74]

    Musev: Infinite-length and high fidelity virtual human video generation with visual conditioned parallel denoising.arxiv, 2024

    Zhiqiang Xia, Zhaokang Chen, Bin Wu, Chao Li, Kwok-Wai Hung, Chao Zhan, Yingjie He, and Wenjiang Zhou. Musev: Infinite-length and high fidelity virtual human video generation with visual conditioned parallel denoising.arxiv, 2024

  75. [75]

    Tall: Thumbnail layout for deepfake video detection

    Yuting Xu, Jian Liang, Gengyun Jia, Ziming Yang, Yanhao Zhang, and Ran He. Tall: Thumbnail layout for deepfake video detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22658–22668, 2023

  76. [76]

    Vript: A video is worth thousands of words.Advances in Neural Information Processing Systems, 37:57240–57261, 2024

    Dongjie Yang, Suyuan Huang, Chengqiang Lu, Xiaodong Han, Haoxin Zhang, Yan Gao, Yao Hu, and Hai Zhao. Vript: A video is worth thousands of words.Advances in Neural Information Processing Systems, 37:57240–57261, 2024

  77. [77]

    Spike- driven transformer.Advances in neural information processing systems, 36:64043–64058, 2023

    Man Yao, Jiakui Hu, Zhaokun Zhou, Li Yuan, Yonghong Tian, Bo Xu, and Guoqi Li. Spike- driven transformer.Advances in neural information processing systems, 36:64043–64058, 2023

  78. [78]

    Scaling spike-driven transformer with efficient spike firing approximation training.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2973–2990, 2025

    Man Yao, Xuerui Qiu, Tianxiang Hu, Jiakui Hu, Yuhong Chou, Keyu Tian, Jianxing Liao, Luziwei Leng, Bo Xu, and Guoqi Li. Scaling spike-driven transformer with efficient spike firing approximation training.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(4):2973–2990, 2025

  79. [79]

    Spikingnerf: Making bio-inspired neural networks see through the real world.arXiv preprint arXiv:2309.10987, 2023

    Xingting Yao, Qinghao Hu, Tielong Liu, Zitao Mo, Zeyu Zhu, Zhengyang Zhuge, and Jian Cheng. Spikingnerf: Making bio-inspired neural networks see through the real world.arXiv preprint arXiv:2309.10987, 2023

  80. [80]

    Loas: Fully temporal-parallel dataflow for dual-sparse spiking neural networks

    Ruokai Yin, Youngeun Kim, Di Wu, and Priyadarshini Panda. Loas: Fully temporal-parallel dataflow for dual-sparse spiking neural networks. In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1107–1121. IEEE, 2024. 14 Jang et al. Detecting AI-Generated Videos with Spiking Neural Networks

Showing first 80 references.