pith. machine review for the scientific record. sign in

arxiv: 2604.13593 · v1 · submitted 2026-04-15 · 💻 cs.MM

Recognition: unknown

AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction

Depeng Wang, Hao Lin, Huijia Zhu, Ke Xu, Li Luo, Tanfeng Sun, Xinghao Jiang, Ya Guo, Zixuan Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:19 UTC · model grok-4.3

classification 💻 cs.MM
keywords audio-visual inconsistencyomni-modal benchmarkcross-modal conflictsvideo understandingtemporal groundingagent-driven constructionmultimodal reasoninginconsistency detection
0
0 comments X

The pith

AVID benchmark tests how well AI models detect and reason about audio-visual inconsistencies in long videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to build the first large-scale benchmark specifically for omni-modal models to perceive cross-modal conflicts, a skill humans use routinely but current systems largely lack. Existing tests emphasize aligned events or forgery detection and leave long-form video inconsistency evaluation unaddressed, which matters because reliable AI needs to flag when sound and image contradict each other. The authors construct AVID through temporal segmentation of videos into active-speaker, voiceover, and scenic types, followed by an agent planner that chooses inconsistency categories and five targeted injectors that insert the conflicts. The resulting collection of over eleven thousand videos supplies annotated events for four core tasks: detection, temporal grounding, classification, and reasoning. Fine-tuning a base model on this data produces measurable gains over both the untuned version and other models, positioning the benchmark as a practical driver for more trustworthy multimodal systems.

Core claim

AVID is generated by a pipeline that first segments long videos by content type, then uses an agent-based planner to select appropriate inconsistency categories, and finally applies five specialized injectors to create realistic audio-visual conflicts; the outcome is a dataset of 11.2K videos containing 39.4K annotated events across eight categories that exposes clear weaknesses in current omni-modal models on temporal grounding and reasoning while allowing fine-tuned models to achieve stronger results on the same tasks.

What carries the argument

The agent-driven construction pipeline that performs temporal segmentation, selects inconsistency types via a planner, and deploys five specialized injectors to insert audio-visual conflicts.

If this is right

  • The benchmark supplies standardized test cases for detection, temporal grounding, classification, and reasoning across eight inconsistency categories.
  • Current omni-modal models exhibit measurable shortcomings in temporal grounding and reasoning when evaluated on the dataset.
  • Fine-tuning on the AVID data yields stronger performance than the base model or other compared systems on the supported tasks.
  • AVID functions as a reusable testbed for developing omni-modal systems that handle cross-modal conflicts more reliably.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same agent-planner-plus-injector approach could be reused to synthesize training data that improves model robustness on naturally occurring video mismatches in editing or surveillance settings.
  • AVID-style benchmarks may help researchers test whether stronger inconsistency awareness also improves performance on aligned multimodal tasks such as captioning or retrieval.
  • Developers of content-moderation or media-authenticity tools could adopt the dataset to measure how well systems flag real-world audio-visual contradictions.

Load-bearing premise

The pipeline's agent planner and five injectors produce inconsistency events that are semantically fitting, realistic, and typical of conflicts humans notice in videos.

What would settle it

A controlled human study in which raters judge a random sample of AVID events as unnatural or non-conflicting, or a controlled experiment showing that models fine-tuned on AVID gain no advantage on a separate collection of naturally occurring inconsistent videos.

Figures

Figures reproduced from arXiv: 2604.13593 by Depeng Wang, Hao Lin, Huijia Zhu, Ke Xu, Li Luo, Tanfeng Sun, Xinghao Jiang, Ya Guo, Zixuan Chen.

Figure 1
Figure 1. Figure 1: We introduce AVID, the first benchmark with explicit taxonomy and formal definitions for audio-visual inconsistency [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the AVID construction pipeline, consisting of three stages: (1) data preprocessing with temporal segmen [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comprehensive dataset static analysis: (a) Video duration distribution of both training and test sets, (b) segment [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The AVID-Qwen architecture for omni-modal audio￾visual inconsistency detection and reasoning. ability to perform fine-grained inconsistency recognition at the segment level. 4 AVID-Qwen: A Strong Omni-Modal Baseline To demonstrate the learnability of AVID and provide a reproducible reference point, we build a strong baseline by fine-tuning an open￾weights omni-modal model on our benchmark protocol. We deno… view at source ↗
Figure 5
Figure 5. Figure 5: Confusion matrices on segment-level 8-category [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Model behavior analysis on full-video temporal [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt for multi-turn audio-visual inconsistency detection in full videos. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt for strategy planning agent [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt for multi-turn audio-visual inconsistency judgment. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
read the original abstract

We present AVID, the first large-scale benchmark for audio-visual inconsistency understanding in videos. While omni-modal large language models excel at temporally aligned tasks such as captioning and question answering, they struggle to perceive cross-modal conflicts, a fundamental human capability that is critical for trustworthy AI. Existing benchmarks predominantly focus on aligned events or deepfake detection, leaving a significant gap in evaluating inconsistency perception in long-form video contexts. AVID addresses this with: (1) a scalable construction pipeline comprising temporal segmentation that classifies video content into Active Speaker, Voiceover, and Scenic categories; an agent-driven strategy planner that selects semantically appropriate inconsistency categories; and five specialized injectors for diverse audio-visual conflict injection; (2) 11.2K long videos (avg. 235.5s) with 39.4K annotated inconsistency events and 78.7K segment clips, supporting evaluation across detection, temporal grounding, classification, and reasoning with 8 fine-grained inconsistency categories. Comprehensive evaluations of state-of-the-art omni-models reveal significant limitations in temporal grounding and reasoning. Our fine-tuned baseline, AVID-Qwen, achieves substantial improvements over the base model (2.8$\times$ higher BLEU-4 in segment reasoning) and surpasses all compared models in temporal grounding (mIoU: 36.1\% vs 26.2\%) and holistic understanding (SODA-m: 7.47 vs 6.15), validating AVID as an effective testbed for advancing trustworthy omni-modal AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces AVID as the first large-scale benchmark for audio-visual inconsistency understanding in long-form videos. It describes a scalable construction pipeline using temporal segmentation into Active Speaker/Voiceover/Scenic categories, an agent-driven strategy planner to select inconsistency categories, and five specialized injectors to create 11.2K videos (avg. 235.5s) with 39.4K annotated events across 8 fine-grained categories and 78.7K segment clips. The benchmark supports detection, temporal grounding, classification, and reasoning tasks. Evaluations show limitations in state-of-the-art omni-modal models, while the fine-tuned AVID-Qwen baseline reports gains including 2.8× higher BLEU-4 in segment reasoning, mIoU 36.1% (vs. 26.2%), and SODA-m 7.47 (vs. 6.15).

Significance. If the injected inconsistencies prove realistic and representative of natural cross-modal conflicts, AVID would fill a notable gap in evaluating omni-modal models on conflict perception rather than aligned captioning or deepfake detection, supporting progress toward trustworthy AI. The dataset scale, fine-grained categories, and reported improvements in the fine-tuned baseline demonstrate the benchmark's potential as a testbed. The agent-driven pipeline and multi-task annotations are strengths that could enable reproducible advances if validated.

major comments (1)
  1. The construction pipeline (temporal segmentation, agent-driven planner, and five injectors) is presented as producing semantically appropriate and realistic inconsistencies, yet the manuscript reports no human validation such as plausibility ratings, inter-rater agreement on naturalness, or comparison to naturally occurring conflicts. This is load-bearing for the central claim, as the quantitative gains (e.g., temporal grounding mIoU and reasoning BLEU-4) could reflect exploitation of synthetic artifacts rather than genuine inconsistency understanding.
minor comments (2)
  1. The abstract and results sections should include a table explicitly listing all compared models with their full metric scores to support the claim of surpassing 'all compared models' in temporal grounding and holistic understanding.
  2. Dataset statistics such as the distribution of the 8 inconsistency categories and video length variance (beyond the reported average of 235.5s) would improve clarity on the benchmark's coverage and balance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment on human validation of the inconsistency construction pipeline below.

read point-by-point responses
  1. Referee: The construction pipeline (temporal segmentation, agent-driven planner, and five injectors) is presented as producing semantically appropriate and realistic inconsistencies, yet the manuscript reports no human validation such as plausibility ratings, inter-rater agreement on naturalness, or comparison to naturally occurring conflicts. This is load-bearing for the central claim, as the quantitative gains (e.g., temporal grounding mIoU and reasoning BLEU-4) could reflect exploitation of synthetic artifacts rather than genuine inconsistency understanding.

    Authors: We agree that human validation is important to substantiate the realism and semantic appropriateness of the injected inconsistencies. The manuscript does not currently report plausibility ratings, inter-rater agreement, or direct comparisons to natural conflicts, which is a limitation in the presented work. The pipeline is designed with an agent-driven planner that selects inconsistency categories based on semantic analysis of each segment's category (Active Speaker, Voiceover, or Scenic) and five specialized injectors that enforce contextual constraints during injection. Nevertheless, to directly address this concern, we will add a dedicated human evaluation section in the revised manuscript. This will include plausibility and naturalness ratings by multiple annotators on a representative sample of the 39.4K events, along with inter-rater agreement metrics (e.g., Fleiss' kappa). Where feasible, we will also provide qualitative comparisons to naturally occurring audio-visual conflicts drawn from existing video datasets. These additions will help confirm that the reported gains (such as the 2.8× BLEU-4 improvement in segment reasoning and mIoU gains in temporal grounding) arise from genuine inconsistency understanding rather than exploitation of synthetic artifacts. The differential performance patterns across SOTA models and tasks provide supporting indirect evidence that the benchmark poses meaningful challenges. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark construction or evaluation chain

full rationale

The paper introduces AVID as a new benchmark via an explicit agent-driven pipeline (temporal segmentation into Active Speaker/Voiceover/Scenic, planner, and five injectors) that generates 11.2K videos and 39.4K events. All reported metrics (BLEU-4, mIoU, SODA-m) are standard external measures applied to fine-tuned AVID-Qwen versus base models and other baselines. No equations, fitted parameters renamed as predictions, self-citations to author prior uniqueness results, or ansatzes smuggled via citation appear in the abstract or described methods. The construction pipeline and evaluation results are presented as independent of the evaluated models, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on domain assumptions about video categorization and the validity of synthetic inconsistency injection; these are introduced without external benchmarks or human validation data in the abstract.

axioms (2)
  • domain assumption Video content can be reliably classified into Active Speaker, Voiceover, and Scenic categories for segmentation
    Invoked as the first step of the scalable construction pipeline.
  • ad hoc to paper An agent-driven strategy planner can select semantically appropriate inconsistency categories for injection
    Core mechanism for generating the 39.4K events.
invented entities (1)
  • Five specialized injectors no independent evidence
    purpose: To perform diverse audio-visual conflict injection into video segments
    New components created for the benchmark construction; no independent evidence of their fidelity provided in abstract.

pith-pipeline@v0.9.0 · 5606 in / 1490 out tokens · 38548 ms · 2026-05-10T12:19:25.166434+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 12 canonical work pages

  1. [1]

    Muhammad Aleem, Muhammad Umair, Muhammad Zubair, Rozeena Ibrahim, Muhammad Tahir Naseem, Muhammad Mohsin Raza, Muhammad Nadeem Ali, and Byung-Seo Kim. 2026. Seeing Through the Fake: Explainable AI With Multiple CNNs for Deepfake Detection.IEEE Access14 (2026), 131–162. doi:10. 1109/ACCESS.2025.3649128

  2. [2]

    Matyas Bohacek and Hany Farid. 2024. Lost in Translation: Lip-Sync Deepfake Detection from Audio-Video Mismatch. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, Seattle, WA, USA, 4315–

  3. [3]

    doi:10.1109/CVPRW63382.2024.00435

  4. [4]

    Zhixi Cai, Kartik Kuckreja, Shreya Ghosh, Akanksha Chuchra, Muhammad Haris Khan, Usman Tariq, Tom Gedeon, and Abhinav Dhall. 2025. AV-Deepfake1M++: A Large-Scale Audio-Visual Deepfake Benchmark with Real-World Perturbations. InProceedings of the 33rd ACM International Conference on Multimedia (ACM MM). doi:10.1145/3746027.3761979

  5. [5]

    Zhixi Cai, Kalin Stefanov, Abhinav Dhall, and Munawar Hayat. 2022. Do You Re- ally Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization. In2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA). IEEE, Sydney, Australia, 1–10. doi:10.1109/DICTA56598.2022.10034605

  6. [6]

    Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. 2020. Vg- gsound: A large-scale audio-visual dataset. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 721–725

  7. [7]

    Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, and Jing Liu. 2023. VAST: A Vision-Audio-Subtitle-Text Omni-Modality Founda- tion Model and Dataset. InAdvances in Neural Information Processing Systems (NeurIPS)

  8. [8]

    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming- Hsuan Yang, and Sergey Tulyakov. 2024. Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  9. [9]

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. 2025. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 24108–24118

  10. [10]

    Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. TALL: Temporal Activity Localization via Language Query. InProceedings of the IEEE International Conference on Computer Vision (ICCV). 5267–5275

  11. [11]

    Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 776–780

  12. [12]

    Tiantian Geng, Teng Wang, Jinming Duan, Runmin Cong, and Feng Zheng. 2023. Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  13. [13]

    Tiantian Geng, Jinrui Zhang, Qingni Wang, Teng Wang, Jinming Duan, and Feng Zheng. 2025. LongVALE: Vision-Audio-Language-Event Benchmark To- wards Time-Aware Omni-Modal Perception of Long Videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  14. [14]

    Goel et al

    A. Goel et al. 2024. OMCAT: Omni-Context Audio-Visual Transformer for Video Understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

  15. [15]

    Alexandros Haliassos, Konstantinos Vougioukas, Stavros Petridis, and Maja Pan- tic. 2021. Lips don’t lie: A generalisable and robust approach to face forgery detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5039–5049

  16. [16]

    Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, et al. 2024. Trustllm: Trustwor- thiness in large language models.arXiv preprint arXiv:2401.05561(2024)

  17. [17]

    Hasam Khalid, Shahroz Tariq, Minha Kim, and Simon S. Woo. 2021. FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset. InAdvances in Neural Infor- mation Processing Systems (NeurIPS), Datasets and Benchmarks Track

  18. [18]

    Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. 2019. AudioCaps: Generating Captions for Audios in The Wild. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT)

  19. [19]

    Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles

  20. [20]

    InProceedings of the IEEE International Conference on Computer Vision (ICCV)

    Dense-Captioning Events in Videos. InProceedings of the IEEE International Conference on Computer Vision (ICCV)

  21. [21]

    Sangho Lee, Jiwan Chung, Youngjae Yu, Gunhee Kim, Thomas Breuel, Gal Chechik, and Yale Song. 2021. ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

  22. [22]

    Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen, Fang Wen, and Bain- ing Guo. 2020. Face x-ray for more general face forgery detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5001–5010

  23. [23]

    Jing Liu, Sihan Chen, Xingjian He, Longteng Guo, Xinxin Zhu, Weining Wang, and Jinhui Tang. 2024. VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset.IEEE Transactions on Multimedia(2024). doi:10.1109/TMM. 2024.3480698

  24. [24]

    Miao Liu, Jing Wang, Xinyuan Qian, and Haizhou Li. 2023. Audio-Visual Tempo- ral Forgery Detection Using Embedding-Level Fusion and Multi-Dimensional Contrastive Loss.IEEE Transactions on Circuits and Systems for Video Technology PP (2023), 1–1. doi:10.1109/TCSVT.2023.3326694

  25. [25]

    Weifeng Liu, Tianyi She, Jiawei Liu, Boheng Li, Dongyu Yao, Ziyou Liang, and Run Wang. 2024. Lips Are Lying: Spotting the Temporal Inconsistency between Audio and Visual in Lip-Syncing DeepFakes. doi:10.48550/arXiv.2401.15668 arXiv:2401.15668 [cs]

  26. [26]

    WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio- Language Multimodal Research

    Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D. Plumbley, Yuexian Zou, and Wenwu Wang. 2024. WavCaps: A ChatGPT- Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multi- modal Research.IEEE/ACM Transactions on Audio, Speech, and Language Process- ing(2024). doi:10.1109/TASLP.2024.3419446

  27. [27]

    Trevine Oorloff, Surya Koppisetti, Nicolo Bonettini, Divyaraj Solanki, Ben Col- man, Yaser Yacoob, Ali Shahriyari, and Gaurav Bharaj. [n. d.]. AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection. ([n. d.])

  28. [28]

    Andrew Owens and Alexei A Efros. 2018. Audio-visual scene analysis with self-supervised multisensory features. InProceedings of the European conference on computer vision (ECCV). 631–648

  29. [29]

    Yapeng Tian, Di Li, and Chenliang Xu. 2020. Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing. InComputer Vision – ECCV 2020. 436–454

  30. [30]

    Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. 2018. Audio- Visual Event Localization in Unconstrained Videos. InProceedings of the European Conference on Computer Vision (ECCV)

  31. [31]

    Ruben Tolosana, Ruben Vera-Rodriguez, Julian Fierrez, Aythami Morales, and Javier Ortega-Garcia. 2020. Deepfakes and beyond: A survey of face manipulation and fake detection.Information fusion64 (2020), 131–148

  32. [32]

    Sung Jin Um, Dongjin Kim, Sangmin Lee, and Jung Uk Kim. 2025. Object-aware Sound Source Localization via Audio-Visual Scene Understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8342– 8351

  33. [33]

    Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. 2023. Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation.arXiv preprint arXiv:2311.07397(2023)

  34. [34]

    Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Conghui He, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, and Yu Qiao. 2024. InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation. InInternational Conference on Learning Representations (ICLR)

  35. [35]

    Shichuang Xie, Tong Qiao, Sheng Li, Xinpeng Zhang, Jiantao Zhou, and Guorui Feng. 2026. DeepFake Detection in the AIGC Era: A Survey, Benchmarks, and Future Perspectives.Information Fusion127 (2026), 103740. doi:10.1016/j.inffus. 2025.103740

  36. [36]

    Wenyuan Yang, Xiaoyu Zhou, Zhikai Chen, Bofei Guo, Zhongjie Ba, Zhihua Xia, Xiaochun Cao, and Kui Ren. 2023. AVoiD-DF: Audio-Visual Joint Learning for Detecting Deepfake.IEEE Transactions on Information Forensics and Security18 (2023), 2015–2029. doi:10.1109/TIFS.2023.3262148

  37. [37]

    Qilang Ye, Wei Zeng, Meng Liu, Jie Zhang, Yupeng Hu, Zitong Yu, and Yu Zhou

  38. [38]

    InProceedings of the AAAI Conference on Artificial Intelligence, Vol

    When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Con- fusion?. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 11955–11963

  39. [39]

    Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-llama: An instruction-tuned audio-visual language model for video understanding. InProceedings of the 2023 conference on empirical methods in natural language processing: system demon- strations(EMNLP). 543–553

  40. [40]

    away” (1.0→ 0.01, simulating moving away) or “toward

    Hao Zhu, Man-Di Luo, Rui Wang, Ai-Hua Zheng, and Ran He. 2021. Deep audio- visual learning: A survey.International Journal of Automation and Computing18, 3 (2021), 351–376. A Appendix Overview This appendix provides implementation-level details that are inten- tionally omitted from the main paper for space reasons. We organize the content into four main s...