arxiv: 2604.08405 · v2 · submitted 2026-04-09 · 💻 cs.CV

Recognition: unknown

SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation

Wenli Zhang , Xianglong Shi , Sirui Zhao , Xinqi Chen , Guo Cheng , Yifan Xu , Tong Xu , Yong Liao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:26 UTC · model grok-4.3

classification 💻 cs.CV

keywords adversarial attackstalking head generationdiffusion modelsmultimodal attackslip synchronizationaudio-driven animationproactive protection

0 comments

The pith

SyncBreaker jointly perturbs portrait images and driving audio with stage-specific guidance to degrade lip synchronization and facial dynamics in audio-driven talking-head generators more than single-modality attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that existing single-modality defenses fail to suppress speech-driven facial movements in diffusion-based talking-head models because they do not target both input streams at the right generation stages. SyncBreaker addresses this by optimizing image perturbations to steer the output toward a static reference portrait and audio perturbations to suppress cross-attention responses that carry motion information. These two streams are trained separately and then combined at inference time under perceptual constraints. Experiments in a white-box setting confirm stronger degradation of synchronization metrics while input quality remains high and the attack survives common purification steps.

Core claim

SyncBreaker is a stage-aware multimodal framework that applies nullifying supervision with Multi-Interval Sampling on the image stream to aggregate guidance across denoising intervals toward a static portrait, and Cross-Attention Fooling on the audio stream to suppress interval-specific audio-conditioned responses; the independently optimized perturbations are combined at inference to break lip synchronization and facial dynamics while preserving input perceptual quality.

What carries the argument

Stage-aware multimodal perturbations using Multi-Interval Sampling (MIS) for images and Cross-Attention Fooling (CAF) for audio, optimized independently and merged at inference.

If this is right

The attack degrades both lip synchronization and overall facial dynamics more effectively than image-only or audio-only baselines.
Input perceptual quality remains comparable to clean inputs under standard metrics.
The protection remains effective after common purification or defense steps.
Independent optimization of the two streams allows flexible deployment without retraining the full pipeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same stage-specific idea could be tested on other diffusion-based animation or video generation tasks that rely on cross-attention between modalities.
If the attack generalizes, it suggests that future protection methods should consider modality-specific timing inside the denoising process rather than treating inputs as single blocks.
An interesting extension would be to measure how much the attack leaks information about the target model architecture through the required white-box access.

Load-bearing premise

Independently optimized image and audio perturbations can be added together at inference time without significant loss of effectiveness against the target model.

What would settle it

A white-box experiment on the same diffusion talking-head model where the combined attack produces no greater drop in lip-sync metrics than the stronger of the two single-modality attacks alone.

Figures

Figures reproduced from arXiv: 2604.08405 by Guo Cheng, Sirui Zhao, Tong Xu, Wenli Zhang, Xianglong Shi, Xinqi Chen, Yifan Xu, Yong Liao.

**Figure 2.** Figure 2: Overview of SyncBreaker. The image stream employs MIS-based Nullifying Loss to redirect the generation objective [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of denoising results across different diffusion stages, including the global structure stage, contour [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Audio-conditioned cross-attention maps. (a) At a [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of videos generated from inputs protected by all compared attack methods. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

read the original abstract

Diffusion-based audio-driven talking-head generation enables realistic portrait animation, but also introduces risks of misuse, such as fraud and misinformation. Existing protection methods are largely limited to a single modality, and neither image-only nor audio-only attacks can effectively suppress speech-driven facial dynamics. To address this gap, we propose SyncBreaker, a stage-aware multimodal protection framework that jointly perturbs portrait and audio inputs under modality-specific perceptual constraints. Our key contributions are twofold. First, for the image stream, we introduce nullifying supervision with Multi-Interval Sampling (MIS) across diffusion stages to steer the generation toward the static reference portrait by aggregating guidance from multiple denoising intervals. Second, for the audio stream, we propose Cross-Attention Fooling (CAF), which suppresses interval-specific audio-conditioned cross-attention responses. Both streams are optimized independently and combined at inference time to enable flexible deployment. We evaluate SyncBreaker in a white-box proactive protection setting. Extensive experiments demonstrate that SyncBreaker more effectively degrades lip synchronization and facial dynamics than strong single-modality baselines, while preserving input perceptual quality and remaining robust under purification. Code: https://github.com/kitty384/SyncBreaker.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SyncBreaker adds stage-aware multimodal attacks on talking head diffusion models but the evidence that independent image and audio perturbations compose without interference is thin.

read the letter

The main takeaway is that this paper proposes SyncBreaker, a white-box attack that perturbs both the input portrait and driving audio for diffusion-based talking head generators. It uses Multi-Interval Sampling on the image side to push generations back toward the static reference across denoising stages, and Cross-Attention Fooling on the audio side to suppress interval-specific attention responses, then combines the two at inference time. The claim is that this degrades lip sync and facial dynamics more than single-modality baselines while keeping perceptual quality intact and holding up under purification attempts.

Referee Report

2 major / 2 minor

Summary. The paper introduces SyncBreaker, a stage-aware multimodal adversarial framework for proactively protecting against misuse of diffusion-based audio-driven talking-head generators. It proposes Multi-Interval Sampling (MIS) to apply nullifying supervision on the image stream across multiple denoising stages and Cross-Attention Fooling (CAF) to suppress audio-conditioned cross-attention responses in the audio stream. The two perturbations are optimized independently under perceptual constraints and simply added at inference time. The central empirical claim is that this joint attack degrades lip synchronization and facial dynamics more effectively than strong single-modality baselines while preserving input quality and remaining robust to purification defenses.

Significance. If the superiority claim holds after proper verification, the work would be significant for the field of adversarial robustness in multimodal generative models. It provides a concrete, deployable protection method that exploits the staged nature of diffusion and cross-attention fusion, which single-modality attacks miss. The public code release and focus on white-box proactive settings are positive contributions that could inform both attack and defense research in talking-head synthesis.

major comments (2)

[Method (MIS and CAF optimization) and Experiments (comparison tables)] The central claim of multimodal superiority rests on the assumption that independently optimized MIS and CAF perturbations compose additively without destructive interference. Because the target model fuses audio features into image denoising exclusively via cross-attention at every stage, an image perturbation that nullifies one set of attention maps could be partially counteracted by an audio perturbation targeting a different interval. No ablation is described that directly compares the joint attack against the stronger of the two single-modality attacks on identical random seeds and diffusion trajectories; without this check, reported gains could be an artifact of non-additive interaction rather than true synergy.
[Abstract and §4 (Experiments)] The abstract and method description assert that SyncBreaker outperforms single-modality baselines on lip synchronization and facial dynamics metrics, yet the provided text supplies no quantitative tables, dataset details, statistical significance tests, or ablation breakdowns. This makes it impossible to verify the magnitude of improvement or whether the stage-aware components are load-bearing.

minor comments (2)

[Abstract] The abstract claims 'extensive experiments' but does not name the specific talking-head models, datasets, or evaluation metrics (e.g., LSE, FID, or lip-sync error) used; these should be stated explicitly even in the abstract for clarity.
[Method] Notation for the diffusion stages and cross-attention intervals in the MIS and CAF formulations could be made more precise; currently the interval aggregation is described at a high level without explicit equations for the combined guidance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the valuable comments and suggestions. Below we provide point-by-point responses to the major comments and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Method (MIS and CAF optimization) and Experiments (comparison tables)] The central claim of multimodal superiority rests on the assumption that independently optimized MIS and CAF perturbations compose additively without destructive interference. Because the target model fuses audio features into image denoising exclusively via cross-attention at every stage, an image perturbation that nullifies one set of attention maps could be partially counteracted by an audio perturbation targeting a different interval. No ablation is described that directly compares the joint attack against the stronger of the two single-modality attacks on identical random seeds and diffusion trajectories; without this check, reported gains could be an artifact of non-additive interaction rather than true synergy.

Authors: We thank the referee for pointing out the potential issue of destructive interference in the composition of MIS and CAF perturbations. Our design optimizes each modality's perturbation independently to respect perceptual constraints and targets distinct aspects: MIS applies nullifying supervision on image denoising stages, while CAF fools the cross-attention in the audio-conditioned path. However, we agree that without a direct comparison under controlled conditions, the synergy claim requires stronger evidence. In the revised version, we will add an ablation that evaluates the joint attack versus the best single-modality attack on identical seeds and trajectories to confirm the gains are not due to non-additive effects. revision: yes
Referee: [Abstract and §4 (Experiments)] The abstract and method description assert that SyncBreaker outperforms single-modality baselines on lip synchronization and facial dynamics metrics, yet the provided text supplies no quantitative tables, dataset details, statistical significance tests, or ablation breakdowns. This makes it impossible to verify the magnitude of improvement or whether the stage-aware components are load-bearing.

Authors: We acknowledge the referee's concern that the experimental results are not presented with sufficient detail in the current manuscript. The abstract summarizes the findings, but we agree that quantitative tables, dataset information, significance tests, and ablations are essential for verification. We will substantially expand §4 in the revision to include these elements, ensuring the magnitude of improvements and the role of stage-aware components are clearly demonstrated and statistically supported. revision: yes

Circularity Check

0 steps flagged

Empirical attack framework with no load-bearing derivations or self-referential reductions

full rationale

The paper describes an empirical multimodal adversarial attack (SyncBreaker) using independently optimized MIS image perturbations and CAF audio perturbations that are simply added at inference. No equations, first-principles derivations, or predictions are presented that reduce claimed performance to fitted parameters, self-definitions, or self-citation chains. All claims rest on experimental comparisons against single-modality baselines in a white-box setting, with no mathematical structure that could be circular by construction. The method is self-contained as a practical attack design evaluated on external models.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of the proposed MIS and CAF modules under perceptual constraints; no explicit free parameters, mathematical axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5523 in / 1138 out tokens · 42616 ms · 2026-05-10T17:26:12.673235+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 44 canonical work pages · 4 internal anchors

[1]

S. Boll. 1979. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing27, 2 (1979), 113–120. doi:10.1109/TASSP.1979.1163209

work page doi:10.1109/tassp.1979.1163209 1979
[2]

Nicholas Carlini and David Wagner. 2017. Towards Evaluating the Robustness of Neural Networks. arXiv:1608.04644 [cs.CR] https://arxiv.org/abs/1608.04644

work page arXiv 2017
[3]

Nicholas Carlini and David Wagner. 2018. Audio Adversarial Examples: Targeted Attacks on Speech-to-Text. In2018 IEEE Security and Privacy Workshops (SPW). 1–7. doi:10.1109/SPW.2018.00009

work page doi:10.1109/spw.2018.00009 2018
[4]

Context-awarecrowdcounting, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, Computer Vision Foundation / IEEE

Lele Chen, Ross K. Maddox, Zhiyao Duan, and Chenliang Xu. 2019. Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss . In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, 7824–7833. doi:10.1109/CVPR.2019. 00802

work page doi:10.1109/cvpr.2019 2019
[5]

Joon Son Chung and Andrew Zisserman. 2016. Out of time: automated lip sync in the wild. InWorkshop on Multi-view Lip-reading, ACCV

2016
[6]

Jiahao Cui, Hui Li, Yao Yao, Hao Zhu, Hanlin Shang, Kaihui Cheng, Hang Zhou, Siyu Zhu, and Jingdong Wang. 2024. Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation. arXiv:2410.07718 [cs.CV] https://arxiv. org/abs/2410.07718

work page arXiv 2024
[7]

Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, and Siyu Zhu. 2024. Hallo3: Highly Dy- namic and Realistic Portrait Image Animation with Video Diffusion Transformer. arXiv:2412.00733 [cs.CV]

work page arXiv 2024
[8]

Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. 2018. Boosting Adversarial Attacks with Momentum. arXiv:1710.06081 [cs.LG] https://arxiv.org/abs/1710.06081

work page arXiv 2018
[9]

Yinpeng Dong, Tianyu Pang, Hang Su, and Jun Zhu. 2019. Evading Defenses to Transferable Adversarial Examples by Translation-Invariant Attacks. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 4307–

2019
[10]

doi:10.1109/CVPR.2019.00444

work page doi:10.1109/cvpr.2019.00444 2019
[11]

Tianyu Du, Shouling Ji, Jinfeng Li, Qinchen Gu, Ting Wang, and Raheem Beyah
[12]

InProceedings of the 15th ACM Asia Conference on Computer and Commu- nications Security(Taipei, Taiwan)(ASIA CCS ’20)

SirenAttack: Generating Adversarial Audio for End-to-End Acoustic Sys- tems. InProceedings of the 15th ACM Asia Conference on Computer and Commu- nications Security(Taipei, Taiwan)(ASIA CCS ’20). Association for Computing Machinery, New York, NY, USA, 357–369. doi:10.1145/3320269.3384733

work page doi:10.1145/3320269.3384733
[13]

Yuan Gan, Jiaxu Miao, Yunze Wang, and Yi Yang. 2025. Silence is Golden: Leverag- ing Adversarial Examples to Nullify Audio Control in LDM-based Talking-Head Generation. InProceedings of the Computer Vision and Pattern Recognition Con- ference. 13434–13444

2025
[14]

Yuan Gan, Zongxin Yang, Xihang Yue, Lingyun Sun, and Yi Yang. 2023. Ef- ficient Emotional Adaptation for Audio-Driven Talking-Head Generation. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV). 22577–22588. doi:10.1109/ICCV51070.2023.02069

work page doi:10.1109/iccv51070.2023.02069 2023
[15]

Lianli Gao, Qilong Zhang, Jingkuan Song, Xianglong Liu, and Heng Tao Shen
[16]

InComputer Vi- sion – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII(Glasgow, United Kingdom)

Patch-Wise Attack for Fooling Deep Neural Network. InComputer Vi- sion – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII(Glasgow, United Kingdom). Springer-Verlag, Berlin, Heidelberg, 307–322. doi:10.1007/978-3-030-58604-1_19

work page doi:10.1007/978-3-030-58604-1_19 2020
[17]

Explaining and Harnessing Adversarial Examples

Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples. arXiv:1412.6572 [stat.ML] https://arxiv.org/ abs/1412.6572

work page internal anchor Pith review arXiv 2015
[18]

Hanqing Guo, Guangjing Wang, Bocheng Chen, Yuanda Wang, Xiao Zhang, Xun Chen, Qiben Yan, and Li Xiao. 2024. WavePurifier: Purifying Audio Adversarial Examples via Hierarchical Diffusion Models. 1268–1282. doi:10.1145/3636534. 3690692

work page doi:10.1145/3636534 2024
[19]

Yudong Guo, Keyu Chen, Sen Liang, Yong-Jin Liu, Hujun Bao, and Juyong Zhang
[20]

Multiscale Vision Transformers , isbn =

AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis. In2021 IEEE/CVF International Conference on Computer Vision (ICCV). 5764–5774. doi:10.1109/ICCV48922.2021.00573

work page doi:10.1109/iccv48922.2021.00573 2021
[21]

Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng. 2014. Deep Speech: Scaling up end-to-end speech recognition. arXiv:1412.5567 [cs.CL] https://arxiv.org/abs/1412.5567

work page arXiv 2014
[22]

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs trained by a two time-scale update rule converge to a local nash equilibrium. InProceedings of the 31st International Conference on Neural Information Processing Systems(Long Beach, California, USA)(NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6629–6640

2017
[23]

Xiaozhong Ji, Xiaobin Hu, Zhihong Xu, Junwei Zhu, Chuming Lin, Qingdong He, Jiangning Zhang, Donghao Luo, Yi Chen, Qin Lin, Qinglin Lu, and Chengjie Wang
[24]

arXiv:2411.16331 [cs.MM] https://arxiv.org/abs/2411.16331

Sonic: Shifting Focus to Global Audio Perception in Portrait Animation. arXiv:2411.16331 [cs.MM] https://arxiv.org/abs/2411.16331

work page arXiv
[25]

Jianwen Jiang, Chao Liang, Jiaqi Yang, Gaojie Lin, Tianyun Zhong, and Yanbo Zheng. 2025. Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency. arXiv:2409.02634 [cs.CV] https://arxiv.org/abs/2409.02634

work page arXiv 2025
[26]

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2017. Progressive growing of gans for improved quality, stability, and variation.arXiv preprint arXiv:1710.10196(2017)

work page internal anchor Pith review arXiv 2017
[27]

Hammad Ali Khan. 2023. ASRAdversarialAttacks: Adversarial Attacks for Automatic Speech Recognition. https://github.com/hammaad2002/ ASRAdversarialAttacks GitHub repository

2023
[28]

Alexey Kurakin, Ian Goodfellow, and Samy Bengio. 2017. Adversarial examples in the physical world. arXiv:1607.02533 [cs.CV] https://arxiv.org/abs/1607.02533

work page arXiv 2017
[29]

Scalalog: Scalable log-based failure diagnosis using llm

Nikolai L. Kühne, Astrid H. F. Kitchena, Marie S. Jensen, Mikkel S. L. Brøndt, Mar- tin Gonzalez, Christophe Biscio, and Zheng-Hua Tan. 2025. Detecting and Defend- ing Against Adversarial Attacks on Automatic Speech Recognition via Diffusion Models. InICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1–5...

work page doi:10.1109/icassp49660.2025.10890611 2025
[30]

Chunyu Li, Chao Zhang, Weikai Xu, Jingyu Lin, Jinghui Xie, Weiguo Feng, Bingyue Peng, Cunjian Chen, and Weiwei Xing. 2024. LatentSync: Taming Audio- Conditioned Latent Diffusion Models for Lip Sync with SyncNet Supervision. arXiv preprint arXiv:2412.09262(2024)

work page arXiv 2024
[31]

Chumeng Liang and Xiaoyu Wu. 2023. Mist: Towards Improved Adversarial Examples for Diffusion Models. arXiv:2305.12683 [cs.CV] https://arxiv.org/abs/ 2305.12683

work page arXiv 2023
[32]

Chumeng Liang, Xiaoyu Wu, Yang Hua, Jiaru Zhang, Yiming Xue, Tao Song, Zhengui Xue, Ruhui Ma, and Haibing Guan. 2023. Adversarial Example Does Good: Preventing Painting Imitation from Diffusion Models via Adversarial Examples. InProceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202), Andre...

2023
[33]

Yixin Liu, Ruoxi Chen, and Lichao Sun. 2024. Investigating and Defending Short- cut Learning in Personalized Diffusion Models.arXiv preprint arXiv:2406.18944 (2024)

work page arXiv 2024
[34]

Zhenjie Liu, Jianzhang Lu, Renjie Lu, Cong Liang, and Shangfei Wang. 2025. Con- sistTalk: Intensity Controllable Temporally Consistent Talking Head Generation with Diffusion Noise Search. arXiv:2511.06833 [cs.CV] https://arxiv.org/abs/ 2511.06833

work page arXiv 2025
[35]

Yuyang Long, Qilong Zhang, Boheng Zeng, Lianli Gao, Xianglong Liu, Jian Zhang, and Jingkuan Song. 2022. Frequency Domain Model Augmentation for Adversarial Attack. InEuropean Conference on Computer Vision

2022
[36]

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards Deep Learning Models Resistant to Adversar- ial Attacks. InInternational Conference on Learning Representations. https: //openreview.net/forum?id=rJzIBfZAb

2018
[37]

Weili Nie, Brandon Guo, Yujia Huang, Chaowei Xiao, Arash Vahdat, and An- imashree Anandkumar. 2022. Diffusion Models for Adversarial Purification. InProceedings of the 39th International Conference on Machine Learning (Pro- ceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Siva...

2022
[38]

Raphael Olivier and Bhiksha Raj. 2023. There is more than one kind of robust- ness: Fooling Whisper with adversarial examples. InInterspeech 2023. 4394–4398. doi:10.21437/Interspeech.2023-1105

work page doi:10.21437/interspeech.2023-1105 2023
[39]

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: An ASR corpus based on public domain audio books. In2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5206–5210. doi:10.1109/ICASSP.2015.7178964

work page doi:10.1109/icassp.2015.7178964 2015
[40]

Yao Qin, Nicholas Carlini, Garrison Cottrell, Ian Goodfellow, and Colin Raffel
[41]

InProceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol

Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition. InProceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 5231–5240. https://proceedings.mlr. press/v97/qin19a.html
[42]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. Robust Speech Recognition via Large-Scale Weak Supervision. arXiv:2212.04356 [eess.AS] https://arxiv.org/abs/2212.04356

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

Vyas Raina, Rao Ma, Charles McGhee, Kate Knill, and Mark Gales. 2024. Mut- ing Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models. InProceedings of the 2024 Conference on Empirical Methods in Natu- ral Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, ...

2024
[44]

doi:10.18653/v1/2024.emnlp-main.430

work page doi:10.18653/v1/2024.emnlp-main.430 2024
[45]

A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra. 2001. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. InProceedings of the Acoustics, Speech, and Signal Processing, 200. on IEEE International Conference - Volume 02 (ICASSP ’01). IEEE Computer Society, USA, 749–752. do...

work page doi:10.1109/icassp.2001.941023 2001
[46]

Tim Sainburg and Asaf Zorea. 2024. Noisereduce: Domain General Noise Reduc- tion for Time Series Signals. arXiv:2412.17851 [eess.SP] https://arxiv.org/abs/ 2412.17851

work page arXiv 2024
[47]

Hadi Salman, Alaa Khaddaj, Guillaume Leclerc, Andrew Ilyas, and Aleksander Madry. 2023. Raising the Cost of Malicious AI-Powered Image Editing.arXiv preprint arXiv:2302.06588(2023)

work page arXiv 2023
[48]

Pedro Sandoval-Segura, Jonas Geiping, and Tom Goldstein. 2023. JPEG Compressed Images Can Bypass Protections Against AI Editing. arXiv:2304.02234 [cs.LG] https://arxiv.org/abs/2304.02234

work page arXiv 2023
[49]

Shuai Shen, Wenliang Zhao, Zibin Meng, Wanhua Li, Zheng Zhu, Jie Zhou, and Jiwen Lu. 2023. DiffTalk: Crafting Diffusion Models for Generalized Audio- Driven Portraits Animation . In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, 1982–1991. doi:10.1109/CVPR52729.2023.00197

work page doi:10.1109/cvpr52729.2023.00197 2023
[50]

Shuai Tan, Bin Ji, Mengxiao Bi, and Ye Pan. 2025. Edtalk: Efficient disentanglement for emotional talking head synthesis. InEuropean Conference on Computer Vision. Springer, 398–416

2025
[51]

Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. 2024. EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions. arXiv:2402.17485 [cs.CV]

work page arXiv 2024
[52]

Mengchao Wang, Qiang Wang, Fan Jiang, Yaqi Fan, Yunpeng Zhang, Yonggang Qi, Kun Zhao, and Mu Xu. 2025. FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis.arXiv preprint arXiv:2504.04842 (2025)

work page arXiv 2025
[53]

InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing13, 4 (2004), 600–612. doi:10.1109/TIP.2003.819861

work page doi:10.1109/tip.2003.819861 2004
[54]

Huawei Wei, Zejun Yang, and Zhisheng Wang. 2024. AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animations. arXiv:2403.17694 [cs.CV]

work page arXiv 2024
[55]

Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, and Alan Yuille. 2018. Mit- igating Adversarial Effects Through Randomization. InInternational Conference on Learning Representations

2018
[56]

Cihang Xie, Zhishuai Zhang, Yuyin Zhou, Song Bai, Jianyu Wang, Zhou Ren, and Alan L. Yuille. 2019. Improving Transferability of Adversarial Examples With Input Diversity. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2725–2734. doi:10.1109/CVPR.2019.00284

work page doi:10.1109/cvpr.2019.00284 2019
[57]

Mingwang Xu, Hui Li, Qingkun Su, Hanlin Shang, Liwei Zhang, Ce Liu, Jingdong Wang, Yao Yao, and Siyu zhu. 2024. Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation. arXiv:2406.08801 [cs.CV]

work page arXiv 2024
[58]

Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, and Baining Guo. 2024. VASA-1: lifelike audio-driven talking faces generated in real time. InProceedings of the 38th International Conference on Neural Information Processing Systems(Vancouver, BC, Canada) (NIPS ’24). Curran Associates Inc., Red Hook, NY, ...

2024
[59]

Haotian Xue, Chumeng Liang, Xiaoyu Wu, and Yongxin Chen. 2023. Toward effective protection against diffusion-based mimicry through score distillation. InThe Twelfth International Conference on Learning Representations

2023
[60]

Haojie Zhang, Zhihao Liang, Ruibo Fu, Bingyan Liu, Zhengqi Wen, Xuefei Liu, Jianhua Tao, and Yaling Liang. 2025. Efficient Long-duration Talking Video Synthesis with Linear Diffusion Transformer under Multimodal Guid- ance. arXiv:2411.16748 [cs.CV] https://arxiv.org/abs/2411.16748

work page internal anchor Pith review arXiv 2025
[61]

Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. 2022. SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation.arXiv preprint arXiv:2211.12194(2022)

work page arXiv 2022
[62]

Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. 2021. Flow-guided One-shot Talking Face Generation with a High-resolution Audio-visual Dataset. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3660–3669. doi:10.1109/CVPR46437.2021.00366

work page doi:10.1109/cvpr46437.2021.00366 2021
[63]

Zhengyu Zhao, Zhuoran Liu, and Martha Larson. 2020. Towards Large Yet Imperceptible Adversarial Image Perturbations With Perceptual Color Distance. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1036–1045. doi:10.1109/CVPR42600.2020.00112

work page doi:10.1109/cvpr42600.2020.00112 2020
[64]

Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. 2021. Pose-Controllable Talking Face Generation by Implicitly Mod- ularized Audio-Visual Representation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

2021
[65]

Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria, Evangelos Kalogerakis, and Dingzeyu Li. 2020. MakeItTalk: speaker-aware talking-head animation.ACM Trans. Graph.39, 6, Article 221 (Nov. 2020), 15 pages. doi:10.1145/3414685.3417774

work page doi:10.1145/3414685.3417774 2020