arxiv: 2604.10367 · v1 · submitted 2026-04-11 · 💻 cs.AI · cs.SD

Recognition: unknown

Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels

Yuzhe Weng , Haotian Wang , Xinyi Yu , Xiaoyan Wu , Haoran Xu , Shan He , Jun Du

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:11 UTC · model grok-4.3

classification 💻 cs.AI cs.SD

keywords audio-driven avatar generationfull-duplex interactiontalking-listening avatarsGaussian kerneltemporal inductive biasconversational video synthesisVoxHear datasetinteractive digital humans

0 comments

The pith

A multi-head Gaussian kernel bridges the temporal scale gap between talking and listening to enable full-duplex interactive avatars.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to move audio-driven avatar generation past one-way monologues into realistic two-way conversations where the agent must speak its own lines and react naturally to incoming speech. It notes that exact frame alignment works for lip sync during talking but makes listening responses stiff over long exchanges, while full global attention destroys synchronization. The authors therefore insert a multi-head Gaussian kernel that progressively embeds the physical difference in time scales as an inductive bias. This produces a model that handles dual audio streams at once and outperforms prior methods on naturalness and responsiveness.

Core claim

Recognizing the unique temporal scale discrepancy between talking and listening behaviors, we introduce a multi-head Gaussian kernel to explicitly inject this physical intuition into the model as a progressive temporal inductive bias. Building upon this, we construct a full-duplex interactive virtual agent capable of simultaneously processing dual-stream audio inputs for both talking and listening. Furthermore, we introduce a rigorously cleaned Talking-Listening dataset VoxHear featuring perfectly decoupled speech and background audio tracks. Extensive experiments demonstrate that our approach successfully fuses strong temporal alignment with deep contextual semantics.

What carries the argument

multi-head Gaussian kernel that supplies progressive temporal inductive bias to capture the scale discrepancy between short-term talking alignment and longer-range listening context

If this is right

The same dual-stream audio processing supports simultaneous talking and listening without separate modules.
The kernel bias lets the model keep precise short-term alignment while still using longer conversational context.
A cleaned dataset with decoupled speech and background tracks becomes usable for training interactive agents.
The resulting avatars achieve higher naturalness and responsiveness than prior monologue extensions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar scale-aware kernels could be tested on other mismatched-timing generation tasks such as gesture or facial expression synthesis during dialogue.
The success of an explicit physical bias over generic attention suggests that domain-specific temporal priors may reduce compute in real-time conversational systems.
Deploying the model on live user audio streams would test whether the inductive bias holds without retraining on every new speaker.

Load-bearing premise

Frame-by-frame alignment makes listening responses rigid while global attention destroys lip synchronization, and the Gaussian kernel resolves the mismatch without new artifacts or full attention.

What would settle it

Side-by-side evaluation on extended live conversations in which the kernel model shows no measurable gain in lip-sync accuracy or naturalness ratings compared with strong frame-alignment or attention baselines.

Figures

Figures reproduced from arXiv: 2604.10367 by Haoran Xu, Haotian Wang, Jun Du, Shan He, Xiaoyan Wu, Xinyi Yu, Yuzhe Weng.

**Figure 2.** Figure 2: Overview of the proposed framework. Top left: our training and inference scheme that unifies arbitrary-position [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Audio-Visual 3D Spatiotemporal Cross-Attention [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Our cleaning and filtering pipeline for the VoxHear dataset consists of two stages: Visual Track filtering and Audio [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison with state-of-the-art audio-driven video generation methods. The yellow boxes indicate visual [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison with other methods in the [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison with other methods in full [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Audio-driven human video generation has achieved remarkable success in monologue scenarios, largely driven by advancements in powerful video generation foundation models. Moving beyond monologues, authentic human communication is inherently a full-duplex interactive process, requiring virtual agents not only to articulate their own speech but also to react naturally to incoming conversational audio. Most existing methods simply extend conventional audio-driven paradigms to listening scenarios. However, relying on strict frame-to-frame alignment renders the model's response to long-range conversational dynamics rigid, whereas directly introducing global attention catastrophically degrades lip synchronization. Recognizing the unique temporal Scale Discrepancy between talking and listening behaviors, we introduce a multi-head Gaussian kernel to explicitly inject this physical intuition into the model as a progressive temporal inductive bias. Building upon this, we construct a full-duplex interactive virtual agent capable of simultaneously processing dual-stream audio inputs for both talking and listening. Furthermore, we introduce a rigorously cleaned Talking-Listening dataset VoxHear featuring perfectly decoupled speech and background audio tracks. Extensive experiments demonstrate that our approach successfully fuses strong temporal alignment with deep contextual semantics, setting a new state-of-the-art for generating highly natural and responsive full-duplex interactive digital humans. The project page is available at https://warmcongee.github.io/beyond-monologue/ .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The multi-head Gaussian kernel is a sensible way to encode talking-listening scale differences, but the paper still needs to show isolated results and numbers before the SOTA claim lands.

read the letter

The paper's core move is to treat the different time scales of speaking and listening as a physical fact worth baking into the architecture rather than hoping attention or frame-wise alignment will sort it out. They build a multi-head Gaussian kernel for that inductive bias and pair it with a new VoxHear dataset that keeps speech and background tracks cleanly separated. That combination is the actual novelty here, and it directly targets the full-duplex gap that monologue video methods ignore. The motivation reads cleanly: strict alignment gets rigid on long context, full attention wrecks lip sync, so something in between is needed. The dataset itself looks like a practical contribution that others could reuse. Those two pieces are worth noting even if the rest stays unproven. The soft spots sit right where the stress-test flagged them. The abstract asserts successful fusion and new SOTA without any reported metrics, baselines, or ablations, so we cannot yet tell whether the kernel is carrying the load or whether the dual-stream setup and dataset cleaning are doing most of the work. There is also no visible check on whether the chosen kernel widths introduce their own synchronization artifacts or require hidden tuning. If those premises do not hold in the experiments, the central claim does not follow. This is for groups already working on audio-driven avatars or conversational agents who want to see an inductive-bias alternative to pure attention. A reader looking for a quick drop-in improvement will probably come away wanting the numbers; someone tracking new dataset releases or kernel-style priors might still pull a useful idea. The work deserves a serious referee because the problem is real, the proposed bias is distinct from prior monologue methods, and the dataset is independently useful. A review would force the authors to supply the missing controls and let the community judge whether the kernel actually resolves the scale discrepancy without new costs.

Referee Report

3 major / 1 minor

Summary. The paper claims to advance audio-driven avatar generation beyond monologues to full-duplex interactive scenarios. It introduces a multi-head Gaussian kernel to inject physical intuition about the temporal scale discrepancy between talking and listening as a progressive inductive bias, enabling simultaneous dual-stream audio processing. The work also presents the cleaned VoxHear dataset with decoupled speech and background tracks, and asserts via extensive experiments that the approach fuses strong temporal alignment with contextual semantics to achieve SOTA natural and responsive interactive digital humans.

Significance. If validated, the work could meaningfully advance interactive virtual agent generation by addressing rigidity in frame-to-frame alignment and degradation from global attention through an explicit temporal-scale kernel. The VoxHear dataset with perfectly decoupled tracks is a concrete positive contribution that could support future full-duplex research.

major comments (3)

Abstract: The central SOTA claim for fusing temporal alignment with deep contextual semantics and generating highly natural full-duplex avatars is asserted without any quantitative metrics, baselines, ablation results, or error analysis, leaving the claim unsupported in the provided text.
Method (multi-head Gaussian kernel description): The claim that the kernel resolves the temporal scale discrepancy via progressive inductive bias without new artifacts rests on untested premises; no derivation of effective temporal support, parameter sensitivity analysis, or ablation isolating the kernel from dual-stream audio processing is supplied, which is load-bearing for the responsiveness and naturalness assertions.
Experiments section (implied by abstract): Absence of details on how the kernel's scales/variances were set, specific lip-synchronization or naturalness metrics, or comparisons against strict frame-to-frame and global-attention baselines prevents verification that the kernel outperforms alternatives on the VoxHear dataset.

minor comments (1)

Abstract: The project page link is a helpful addition for supplementary materials.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights opportunities to strengthen the presentation of our claims and technical details. We address each major comment below and will incorporate revisions to improve clarity and support for the assertions.

read point-by-point responses

Referee: Abstract: The central SOTA claim for fusing temporal alignment with deep contextual semantics and generating highly natural full-duplex avatars is asserted without any quantitative metrics, baselines, ablation results, or error analysis, leaving the claim unsupported in the provided text.

Authors: The abstract serves as a high-level summary of the contributions and results. The full manuscript provides quantitative metrics, baseline comparisons, ablations, and error analysis in the Experiments section, demonstrating SOTA performance on the VoxHear dataset. To better support the claim within the abstract itself, we will revise it to include key quantitative highlights such as specific improvements in lip synchronization and naturalness scores. revision: yes
Referee: Method (multi-head Gaussian kernel description): The claim that the kernel resolves the temporal scale discrepancy via progressive inductive bias without new artifacts rests on untested premises; no derivation of effective temporal support, parameter sensitivity analysis, or ablation isolating the kernel from dual-stream audio processing is supplied, which is load-bearing for the responsiveness and naturalness assertions.

Authors: The multi-head Gaussian kernel is formulated in Section 3.2 to inject the temporal scale discrepancy as an inductive bias, with different heads applying variances suited to short-term talking alignment versus longer-range listening context. While the design rationale is explained, we agree that additional supporting analysis is warranted. In the revised manuscript, we will add a derivation of the effective temporal support, a parameter sensitivity analysis on the variances, and an ablation isolating the kernel from the dual-stream processing. revision: yes
Referee: Experiments section (implied by abstract): Absence of details on how the kernel's scales/variances were set, specific lip-synchronization or naturalness metrics, or comparisons against strict frame-to-frame and global-attention baselines prevents verification that the kernel outperforms alternatives on the VoxHear dataset.

Authors: The Experiments section details evaluations on the VoxHear dataset using lip-synchronization metrics (such as LSE-C and LSE-D) and naturalness via user studies, with comparisons to existing methods. The kernel variances were set via empirical tuning aligned to conversational temporal scales. To enhance verifiability, we will explicitly report the selected variance values, expand metric descriptions, and include direct quantitative comparisons to strict frame-to-frame and global-attention baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: kernel introduced as explicit inductive bias from stated intuition

full rationale

The paper claims to recognize a temporal scale discrepancy between talking and listening, then introduces a multi-head Gaussian kernel to inject that intuition as progressive temporal inductive bias. This architectural choice is presented as a direct modeling decision rather than a fitted parameter, self-referential definition, or result derived from the target outputs. No equations, self-citations, or steps are shown that would make the claimed fusion of alignment and context equivalent to the inputs by construction. The subsequent claims rest on empirical results with the VoxHear dataset and baseline comparisons, keeping the derivation chain self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that temporal scale discrepancy is a key physical property that can be injected via kernels; no new physical entities are postulated, but kernel hyperparameters are implicit free parameters.

free parameters (1)

multi-head Gaussian kernel scales/variances
Different heads require parameters to capture distinct temporal scales for talking versus listening; these are tuned to provide the progressive inductive bias.

axioms (1)

domain assumption Talking and listening behaviors exhibit a unique temporal scale discrepancy that standard attention mechanisms cannot handle without degrading lip sync.
Invoked to justify why frame-to-frame alignment is rigid and global attention fails, motivating the kernel design.

pith-pipeline@v0.9.0 · 5542 in / 1317 out tokens · 71720 ms · 2026-05-10T15:11:23.629761+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 19 canonical work pages · 4 internal anchors

[1]

Vasu Agrawal, Akinniyi Akinyemi, Kathryn Alvero, Morteza Behrooz, Julia Buffalini, Fabio Maria Carlucci, Joy Chen, Junming Chen, Zhang Chen, Shiyang Cheng, et al. 2025. Seamless interaction: Dyadic audiovisual motion modeling and large-scale dataset.arXiv preprint arXiv:2506.22554(2025)

work page arXiv 2025
[2]

Lele Chen, Zhiheng Li, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. 2018. Lip movements generation at a glance. InProceedings of the European conference on computer vision (ECCV). 520–535

2018
[3]

Zhiyuan Chen, Jiajiong Cao, Zhiquan Chen, Yuming Li, and Chenguang Ma. 2025. Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 2403–2410

2025
[4]

Hanbo Cheng, Limin Lin, Chenyu Liu, Pengcheng Xia, Pengfei Hu, Jiefeng Ma, Jun Du, and Jia Pan. 2024. DAWN: Dynamic Frame Avatar with Non- autoregressive Diffusion Framework for Talking Head Video Generation.arXiv preprint arXiv:2410.13726(2024)

work page arXiv 2024
[5]

Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, and Siyu Zhu. 2025. Hallo3: Highly dynamic and 8 realistic portrait image animation with video diffusion transformer. InProceedings of the Computer Vision and Pattern Recognition Conference. 21086–21095

2025
[6]

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4690–4699

2019
[7]

Qijun Gan, Ruizi Yang, Jianke Zhu, Shaofei Xue, and Steven Hoi. 2025. Om- niavatar: Efficient audio-driven avatar video generation with adaptive body animation.arXiv preprint arXiv:2506.18866(2025)

work page arXiv 2025
[8]

Scott Geng, Revant Teotia, Purva Tendulkar, Sachit Menon, and Carl Vondrick
[9]

Affective faces for goal-driven dyadic communication.arXiv preprint arXiv:2301.10939(2023)

work page arXiv 2023
[10]

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems30 (2017)

2017
[11]

Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. 2024. VBench: Comprehensive Benchmark Suite for Video Generative Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

2024
[13]

Youngjoon Jang, Ji-Hoon Kim, Junseok Ahn, Doyeop Kwak, Hong-Sun Yang, Yoon-Cheol Ju, Il-Hwan Kim, Byeong-Yeol Kim, and Joon Son Chung. 2024. Faces that speak: Jointly synthesising talking face and speech from text. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8818– 8828

2024
[14]

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le
[15]

Flow matching for generative modeling.arXiv preprint arXiv:2210.02747 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Jin Liu, Xi Wang, Xiaomeng Fu, Yesheng Chai, Cai Yu, Jiao Dai, and Jizhong Han. 2023. Mfr-net: Multi-faceted responsive listening head generation via denoising diffusion model. InProceedings of the 31st ACM international conference on multimedia. 6734–6743

2023
[17]

Xi Liu, Ying Guo, Cheng Zhen, Tong Li, Yingying Ao, and Pengfei Yan. 2024. Customlistener: Text-guided responsive interaction for user-friendly listening head generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2415–2424

2024
[18]

Chetwin Low, Weimin Wang, and Calder Katyal. 2025. Ovi: Twin backbone cross-modal fusion for audio-video generation.arXiv preprint arXiv:2510.01284 (2025)

work page arXiv 2025
[19]

Cheng Luo, Jianghui Wang, Bing Li, Siyang Song, and Bernard Ghanem. 2025. OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions.arXiv preprint arXiv:2505.21724(2025)

work page arXiv 2025
[20]

Yifeng Ma, Suzhen Wang, Zhipeng Hu, Changjie Fan, Tangjie Lv, Yu Ding, Zhidong Deng, and Xin Yu. 2023. StyleTalk: one-shot talking head generation with controllable speaking styles. InProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth...

work page doi:10.1609/aaai.v37i2.25280 2023
[21]

Yifeng Ma, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yingya Zhang, and Zhidong Deng. 2023. DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models.arXiv preprint arXiv:2312.09767(2023)

work page arXiv 2023
[22]

Rang Meng, Yan Wang, Weipeng Wu, Ruobing Zheng, Yuming Li, and Chenguang Ma. 2026. Echomimicv3: 1.3 b parameters are all you need for unified multi- modal and multi-task human animation. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 8008–8015

2026
[23]

Niranjan D Narvekar and Lina J Karam. 2011. A no-reference image blur metric based on the cumulative probability of blur detection (CPBD).IEEE Transactions on Image Processing20, 9 (2011), 2678–2683

2011
[24]

Evonne Ng, Hanbyul Joo, Liwen Hu, Hao Li, Trevor Darrell, Angjoo Kanazawa, and Shiry Ginosar. 2022. Learning to listen: Modeling non-deterministic dyadic facial motion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 20395–20405

2022
[25]

Evonne Ng, Sanjay Subramanian, Dan Klein, Angjoo Kanazawa, Trevor Darrell, and Shiry Ginosar. 2023. Can language models learn to listen?. InProceedings of the IEEE/CVF International Conference on Computer Vision. 10083–10093

2023
[26]

William Peebles and Saining Xie. 2023. Scalable diffusion models with transform- ers. InProceedings of the IEEE/CVF international conference on computer vision. 4195–4205

2023
[27]

Ziqiao Peng, Wentao Hu, Yue Shi, Xiangyu Zhu, Xiaomei Zhang, Hao Zhao, Jun He, Hongyan Liu, and Zhaoxin Fan. 2024. Synctalk: The devil is in the synchro- nization for talking head synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 666–676

2024
[28]

Ofir Press, Noah A Smith, and Mike Lewis. 2021. Train short, test long: At- tention with linear biases enables input length extrapolation.arXiv preprint arXiv:2108.12409(2021)

work page internal anchor Pith review arXiv 2021
[29]

Ivan Skorokhodov, Sergey Tulyakov, and Mohamed Elhoseiny. 2022. Stylegan- v: A continuous video generator with the price, image quality and perks of stylegan2. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3626–3636

2022
[30]

Luchuan Song, Guojun Yin, Zhenchao Jin, Xiaoyi Dong, and Chenliang Xu. 2023. Emotional listener portrait: Realistic listener motion simulation in conversation. In2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 20782–20792

2023
[31]

Siyang Song, Micol Spitale, Cheng Luo, Cristina Palmero, German Barquero, Hengde Zhu, Sergio Escalera, Michel Valstar, Tobias Baur, Fabien Ringeval, et al
[32]

In2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG)

React 2024: the second multiple appropriate facial reaction generation challenge. In2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG). IEEE, 1–5

2024
[33]

Zhiyao Sun, Ziqiao Peng, Yifeng Ma, Yi Chen, Zhengguang Zhou, Zixiang Zhou, Guozhen Zhang, Youliang Zhang, Yuan Zhou, Qinglin Lu, et al. 2025. StreamA- vatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars. arXiv preprint arXiv:2512.22065(2025)

work page arXiv 2025
[34]

Shuai Tan, Bin Ji, and Ye Pan. 2024. Flowvqtalker: High-quality emotional talking face generation through normalizing flow and quantization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 26317–26327

2024
[35]

Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. 2024. Emo: Emote portrait alive generating expressive portrait videos with audio2video diffusion model under weak conditions. InEuropean Conference on Computer Vision. Springer, 244–260

2024
[36]

Minh Tran, Di Chang, Maksim Siniukov, and Mohammad Soleymani. 2024. Dim: Dyadic interaction modeling for social behavior generation. InEuropean Confer- ence on Computer Vision. Springer, 484–503

2024
[37]

Shuyuan Tu, Yueming Pan, Yinming Huang, Xintong Han, Zhen Xing, Qi Dai, Chong Luo, Zuxuan Wu, and Yu-Gang Jiang. 2025. Stableavatar: Infinite-length audio-driven avatar video generation.arXiv preprint arXiv:2508.08248(2025)

work page arXiv 2025
[38]

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. 2019. FVD: A new metric for video genera- tion. (2019)

2019
[39]

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. 2025. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Haotian Wang, Yuzhe Weng, Jun Du, Haoran Xu, Xiaoyan Wu, Shan He, Bing Yin, Cong Liu, Jianqing Gao, and Qingfeng Liu. 2026. Read: Real-time and efficient asynchronous diffusion for audio-driven talking head generation. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 9766–9774

2026
[41]

Haotian Wang, Yuzhe Weng, Yueyan Li, Zilu Guo, Jun Du, Shutong Niu, Jiefeng Ma, Shan He, Xiaoyan Wu, Qiming Hu, Bing Yin, Cong Liu, and Qingfeng Liu. 2025. EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashvill...

work page arXiv 2025
[42]

Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. 2020. MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation. InECCV

2020
[43]

Mengchao Wang, Qiang Wang, Fan Jiang, Yaqi Fan, Yunpeng Zhang, Yonggang Qi, Kun Zhao, and Mu Xu. 2025. Fantasytalking: Realistic talking portrait generation via coherent motion synthesis. InProceedings of the 33rd ACM International Conference on Multimedia. 9891–9900

2025
[44]

Zhenzhi Wang, Jian Wang, Ke Ma, Dahua Lin, and Bing Zhou. 2025. TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation.arXiv preprint arXiv:2512.14938(2025)

work page arXiv 2025
[45]

Mingwang Xu, Hui Li, Qingkun Su, Hanlin Shang, Liwei Zhang, Ce Liu, Jingdong Wang, Yao Yao, and Siyu Zhu. 2024. Hallo: Hierarchical audio-driven visual synthesis for portrait image animation.arXiv preprint arXiv:2406.08801(2024)

work page arXiv 2024
[46]

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang
[47]

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. InCVPR
[48]

Youliang Zhang, Zhaoyang Li, Duomin Wang, Jiahe Zhang, Deyu Zhou, Zixin Yin, Xili Dai, Gang Yu, and Xiu Li. 2025. Speakervid-5m: A large-scale high-quality dataset for audio-visual dyadic interactive human generation.arXiv preprint arXiv:2507.09862(2025)

work page arXiv 2025
[49]

Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. 2021. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3661–3670

2021
[50]

Shengkui Zhao, Yukun Ma, Chongjia Ni, Chong Zhang, Hao Wang, Trung Hieu Nguyen, Kun Zhou, Jia Qi Yip, Dianwen Ng, and Bin Ma. 2024. Mossformer2: Combining transformer and rnn-free recurrent network for enhanced time- domain monaural speech separation. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEE...

2024
[51]

Shengkui Zhao, Zexu Pan, and Bin Ma. 2025. Clearervoice-studio: Bridging advanced speech processing research and practical deployment.arXiv preprint arXiv:2506.19398(2025)

work page arXiv 2025
[52]

Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, and Tao Mei. 2022. Responsive listening head generation: a benchmark dataset and baseline. In European conference on computer vision. Springer, 124–142

2022
[53]

Yongming Zhu, Longhao Zhang, Zhengkun Rong, Tianshu Hu, Shuang Liang, and Zhipeng Ge. 2025. INFP: Audio-driven interactive head generation in dyadic conversations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10667–10677. 10

2025