FacePlex: Full-Duplex Joint Speech-Facial Motion Generation for Conversational Avatars

Gyeong-Moon Park; Habin Lim; Hah Min Lew; Jae-Ho Lee; Ji-Su Kang

arxiv: 2606.30145 · v1 · pith:PH2OP2Z4new · submitted 2026-06-29 · 💻 cs.AI · cs.CV· cs.LG

FacePlex: Full-Duplex Joint Speech-Facial Motion Generation for Conversational Avatars

Habin Lim , Jae-Ho Lee , Hah Min Lew , Ji-Su Kang , Gyeong-Moon Park This is my paper

Pith reviewed 2026-06-30 06:36 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.LG

keywords full-duplex generationjoint speech-motion synthesisstreaming avatarsflow matchingcross-attentionlip synchronizationconversational AIfacial animation

0 comments

The pith

FacePlex generates speech tokens and facial motion tokens jointly at every streaming step for online conversational avatars.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formalizes full-duplex joint speech-facial motion generation as the task of producing speech and motion tokens together in real time rather than sequentially or from pre-existing audio. It introduces FacePlex as a streaming framework that uses two rolling mechanisms to keep both outputs synchronized while new input arrives. The approach matters because natural face-to-face conversation requires both modalities to emerge together without waiting for a full utterance. Experiments and a user study indicate that the joint method yields stronger lip synchronization and motion quality than systems that drive facial animation from already-generated audio.

Core claim

FacePlex is a unified streaming framework for full-duplex joint speech-facial motion generation. It adapts flow matching through Rolling Flow Matching, which commits new motion frames at each streaming step, and couples audio and motion queues through Rolling Cross-Attention so that speech and facial motion condition each other progressively. This produces speech and motion tokens together every step under online constraints and improves lip-sync quality and motion fidelity over audio-driven baselines.

What carries the argument

Rolling Flow Matching and Rolling Cross-Attention, which together enable joint conditioning and incremental commitment of motion frames within a single streaming pipeline.

If this is right

Full-duplex conversational avatars can run under continuous online constraints without separate audio and animation stages.
Lip-sync quality and motion fidelity improve relative to models that animate a face from already-available audio.
Speech and motion can mutually condition each other as generation proceeds rather than after one modality is complete.
Ablation results isolate the contribution of the rolling mechanisms to the observed quality gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Avatar systems could move from post-hoc animation pipelines to single-stream joint generation, reducing cumulative delay in interactive settings.
The same rolling commitment pattern might apply to additional modalities such as gesture or eye movement within the same framework.
Real-time communication tools could adopt the joint output directly instead of routing audio through a separate facial animation module.

Load-bearing premise

Jointly producing speech tokens and facial motion tokens at every streaming step via the rolling mechanisms is both feasible and superior to separate speech-only or audio-driven systems.

What would settle it

A live streaming test in which an audio-first pipeline followed by separate motion generation matches or exceeds FacePlex on measured lip-sync error and motion naturalness while meeting the same latency bound.

Figures

Figures reproduced from arXiv: 2606.30145 by Gyeong-Moon Park, Habin Lim, Hah Min Lew, Jae-Ho Lee, Ji-Su Kang.

**Figure 2.** Figure 2: Overview of FacePlex. At each training step, LLM, audio branch, and motion branch are jointly trained, taking the user audio and previous model streams as input to produce a hidden state and the next audio chunk. The audio chunk is temporarily enqueued so the motion branch can update a rolling motion queue with a short predicted-speech look-ahead. With L=4 queue slots, the front audio chunk and its aligned… view at source ↗

**Figure 3.** Figure 3: Lifecycle of noise X0 T to clean X1 T . (a) Rolling Flow Matching maintains a motion queue with staggered flow-time states, committing the front slot and appending a new noisy slot at each step. (b) Rolling Cross-Attention aligns the rolling motion queue with the hidden-state queue HT , providing a sliding speech-context window for denoising. is AT = [aT −L+1, . . . , aT ], the hidden-state queue is HT = [… view at source ↗

**Figure 4.** Figure 4: Qualitative comparisons. For each word, the two frames show the 80 ms audio chunk [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Real and synthetic motion distribution in the training set. We sample 6K frames and [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: More qualitative results of FacePlex. We visualize additional generated facial-motion sequences with the corresponding phonetic and prosodic cues shown above each sequence. Best viewed with zoom. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 8.** Figure 8: Google Form rating questions used in the perceptual user study. Participants rated each [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Example video slide deck used in the perceptual user study. The first two panels show the [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

read the original abstract

Natural face-to-face conversation requires real-time speech generation together with synchronized facial motion. Existing systems only partially address this problem: speech-only full-duplex models can generate speech in real time but do not produce facial motion, while audio-driven facial motion models animate a face from already available audio rather than jointly generating speech and motion online. To bridge this gap, we first formalize full-duplex joint speech-facial motion generation, where speech tokens and facial motion tokens are produced together every step. Building on this formulation, we propose FacePlex, a unified streaming framework with two key components. First, Rolling Flow Matching adapts flow matching to online motion generation by committing new motion frames at each streaming step. Second, Rolling Cross-Attention couples the streaming audio queue with the motion queue, allowing speech and facial motion to condition each other as generation progresses. Through extensive experiments, ablation studies, and a user study, we show that FacePlex enables full-duplex joint speech-facial motion generation under online streaming constraints, while achieving stronger lip-sync quality and motion fidelity than audio-driven facial motion baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FacePlex formalizes joint full-duplex streaming of speech and facial motion tokens with rolling flow matching and cross-attention, but the abstract supplies no numbers or baselines so the performance claims cannot be checked.

read the letter

The core new piece is the formalization of full-duplex joint token generation, where speech and motion tokens are produced together at each streaming step instead of handling one after the other. The two rolling components—Rolling Flow Matching that commits motion frames incrementally and Rolling Cross-Attention that lets the audio and motion queues condition each other—directly target the online constraint that prior speech-only or audio-driven systems left open.

That formulation is a clean way to state the gap. It builds on standard flow matching and attention without inventing new primitives, which keeps the approach grounded.

The abstract claims stronger lip-sync and motion fidelity from experiments, ablations, and a user study, yet it contains no tables, no metric values, no listed baselines, and no latency figures. Without those details the central claim that the joint rolling setup is actually superior stays untestable. The assumption that joint generation at every step is both feasible and better than separate pipelines is stated but not evidenced here.

The work sits squarely in conversational avatar and real-time multimodal generation. Readers already working on streaming speech or facial animation models could pick up the formulation and the rolling adaptations even if they later run their own checks. The citation pattern looks standard for the area.

If the full paper contains proper quantitative comparisons and reproducible details, it is worth sending to referees. Based on the abstract alone the evidence is too thin to judge the result, but the problem framing is clear enough that a serious review makes sense.

Referee Report

1 major / 0 minor

Summary. The paper claims to formalize full-duplex joint speech-facial motion generation and proposes FacePlex, a streaming framework using Rolling Flow Matching and Rolling Cross-Attention to jointly generate speech and facial motion tokens in real time. It asserts through experiments, ablations, and a user study that it enables this under online constraints and outperforms audio-driven baselines in lip-sync and motion fidelity.

Significance. If substantiated, this would represent a meaningful advance in conversational AI avatars by addressing the joint real-time generation of speech and synchronized facial motion, which existing systems handle separately. The rolling adaptations of flow matching and cross-attention could influence streaming generative models more broadly.

major comments (1)

[Abstract] The abstract asserts superior performance from experiments, ablations, and a user study, but provides no quantitative results, baselines, or methodological details; without these, the support for the central claim cannot be assessed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their feedback on the abstract. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] The abstract asserts superior performance from experiments, ablations, and a user study, but provides no quantitative results, baselines, or methodological details; without these, the support for the central claim cannot be assessed.

Authors: We acknowledge that the abstract does not include specific quantitative results, baseline names, or methodological details. This is standard for abstracts due to length limits, with full details (including lip-sync metrics, motion fidelity comparisons to audio-driven baselines, ablation studies, and user study outcomes) provided in the Experiments and Results sections of the manuscript. To strengthen the abstract's support for the claims, we will revise it to include one or two key quantitative highlights from the experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description formalize full-duplex joint generation as a new problem statement and introduce Rolling Flow Matching plus Rolling Cross-Attention as adaptations of established flow matching and attention mechanisms. No equations, fitted parameters, or self-citations are shown that would reduce any claimed prediction or result to an input quantity defined by the authors' own prior work. The derivation chain remains self-contained against external benchmarks such as standard flow matching and attention techniques, with validation via experiments rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities; the work is described as adapting existing flow matching and attention methods.

pith-pipeline@v0.9.1-grok · 5742 in / 1079 out tokens · 37132 ms · 2026-06-30T06:36:31.737858+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 14 canonical work pages · 4 internal anchors

[1]

arXiv preprint arXiv:2506.22554 (2025) 4, 5, 22, 29

Vasu Agrawal, Akinniyi Akinyemi, Kathryn Alvero, Morteza Behrooz, Julia Buffalini, Fabio Maria Carlucci, Joy Chen, Junming Chen, Zhang Chen, Shiyang Cheng, et al. Seamless interaction: Dyadic audiovisual motion modeling and large-scale dataset.arXiv preprint arXiv:2506.22554, 2025

work page arXiv 2025
[2]

Wiley- Interscience, 2000

John C Bellamy.Digital Telephony (Wiley Series in Telecommunications and Signal Processing). Wiley- Interscience, 2000

2000
[3]

Percep- tually accurate 3d talking head generation: New definitions, speech-mesh representation, and evaluation metrics

Lee Chae-Yeon, Oh Hyun-Bin, Han EunGi, Kim Sung-Bin, Suekyeong Nam, and Tae-Hyun Oh. Percep- tually accurate 3d talking head generation: New definitions, speech-mesh representation, and evaluation metrics. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21065–21074, 2025

2025
[4]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

2024
[5]

Artalk: Speech-driven 3d head animation via autoregressive model

Xuangeng Chu, Nabarun Goswami, Ziteng Cui, Hanqin Wang, and Tatsuya Harada. Artalk: Speech-driven 3d head animation via autoregressive model. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–9, 2025

2025
[6]

Unils: End-to-end audio-driven avatars for unified listening and speaking.arXiv preprint arXiv:2512.09327, 2025

Xuangeng Chu, Ruicong Liu, Yifei Huang, Yun Liu, Yichen Peng, and Bo Zheng. Unils: End-to-end audio-driven avatars for unified listening and speaking.arXiv preprint arXiv:2512.09327, 2025

work page arXiv 2025
[7]

Using uh and um in spontaneous speaking.Cognition, 84(1):73–111, 2002

Herbert H Clark and Jean E Fox Tree. Using uh and um in spontaneous speaking.Cognition, 84(1):73–111, 2002

2002
[8]

Capture, learning, and synthesis of 3d speaking styles

Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael J Black. Capture, learning, and synthesis of 3d speaking styles. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10101–10111, 2019

2019
[9]

Emotional speech-driven animation with content-emotion disentanglement

Radek Danˇeˇcek, Kiran Chhatre, Shashank Tripathi, Yandong Wen, Michael Black, and Timo Bolkart. Emotional speech-driven animation with content-emotion disentanglement. InSIGGRAPH Asia 2023 Conference Papers, pages 1–13, 2023

2023
[10]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Stable audio open

Zach Evans, Julian D Parker, CJ Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons. Stable audio open. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

2025
[12]

Faceformer: Speech-driven 3d facial animation with transformers

Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, and Taku Komura. Faceformer: Speech-driven 3d facial animation with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18770–18780, 2022

2022
[13]

Pauses, gaps and overlaps in conversations.Journal of Phonetics, 38(4): 555–568, 2010

Mattias Heldner and Jens Edlund. Pauses, gaps and overlaps in conversations.Journal of Phonetics, 38(4): 555–568, 2010

2010
[14]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020
[15]

Audio-driven facial animation by joint end-to-end learning of pose and emotion.ACM Transactions on Graphics (ToG), 36(4):1–12, 2017

Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. Audio-driven facial animation by joint end-to-end learning of pose and emotion.ACM Transactions on Graphics (ToG), 36(4):1–12, 2017

2017
[16]

Forgotten little words: How backchannels and particles may facilitate speech planning in conversation?Frontiers in Psychology, 11:593671, 2020

Birgit Knudsen, Ava Creemers, and Antje S Meyer. Forgotten little words: How backchannels and particles may facilitate speech planning in conversation?Frontiers in Psychology, 11:593671, 2020

2020
[17]

Streamdiffusion: A pipeline-level solution for real-time interactive generation

Akio Kodaira, Chenfeng Xu, Toshiki Hazama, Takanori Yoshimoto, Kohei Ohno, Shogo Mitsuhori, Soichi Sugano, Hanying Cho, Zhijian Liu, Masayoshi Tomizuka, et al. Streamdiffusion: A pipeline-level solution for real-time interactive generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12371–12380, 2025

2025
[18]

Turn-taking in human communication–origins and implications for language process- ing.Trends in cognitive sciences, 20(1):6–14, 2016

Stephen C Levinson. Turn-taking in human communication–origins and implications for language process- ing.Trends in cognitive sciences, 20(1):6–14, 2016. 10

2016
[19]

Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4D scans.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6): 194:1–194:17, 2017. URLhttps://doi.org/10.1145/3130800.3130813

work page doi:10.1145/3130800.3130813 2017
[20]

Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities.arXiv preprint arXiv:2503.04721, 2025

Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui Wang, Gopala Anumanchipalli, Alexander H Liu, and Hung- yi Lee. Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities.arXiv preprint arXiv:2503.04721, 2025

work page arXiv 2025
[21]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Reactface: Online multiple appropriate facial reaction generation in dyadic interactions.IEEE Transactions on Visualization and Computer Graphics, 31(9):6190–6207, 2024

Cheng Luo, Siyang Song, Weicheng Xie, Micol Spitale, Zongyuan Ge, Linlin Shen, and Hatice Gunes. Reactface: Online multiple appropriate facial reaction generation in dyadic interactions.IEEE Transactions on Visualization and Computer Graphics, 31(9):6190–6207, 2024

2024
[24]

Omniresponse: Online mul- timodal conversational response generation in dyadic interactions.arXiv preprint arXiv:2505.21724, 2025

Cheng Luo, Jianghui Wang, Bing Li, Siyang Song, and Bernard Ghanem. Omniresponse: Online mul- timodal conversational response generation in dyadic interactions.arXiv preprint arXiv:2505.21724, 2025

work page arXiv 2025
[25]

Mit Press, 1998

Dominic W Massaro.Perceiving talking faces: From speech perception to a behavioral principle. Mit Press, 1998

1998
[26]

Hearing lips and seeing voices.Nature, 264(5588):746–748, 1976

Harry McGurk and John MacDonald. Hearing lips and seeing voices.Nature, 264(5588):746–748, 1976

1976
[27]

Spoken question answering and speech continuation using spectrogram-powered llm.arXiv preprint arXiv:2305.15255, 2023

Eliya Nachmani, Alon Levkovitch, Roy Hirsch, Julian Salazar, Chulayuth Asawaroengchai, Soroosh Mariooryad, Ehud Rivlin, RJ Skerry-Ryan, and Michelle Tadmor Ramanovich. Spoken question answering and speech continuation using spectrogram-powered llm.arXiv preprint arXiv:2305.15255, 2023

work page arXiv 2023
[28]

Learning to listen: Modeling non-deterministic dyadic facial motion

Evonne Ng, Hanbyul Joo, Liwen Hu, Hao Li, Trevor Darrell, Angjoo Kanazawa, and Shiry Ginosar. Learning to listen: Modeling non-deterministic dyadic facial motion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20395–20405, 2022

2022
[29]

Can language models learn to listen? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10083–10093, 2023

Evonne Ng, Sanjay Subramanian, Dan Klein, Angjoo Kanazawa, Trevor Darrell, and Shiry Ginosar. Can language models learn to listen? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10083–10093, 2023

2023
[30]

Emotalk: Speech-driven emotional disentanglement for 3d face animation

Ziqiao Peng, Haoyu Wu, Zhenbo Song, Hao Xu, Xiangyu Zhu, Jun He, Hongyan Liu, and Zhaoxin Fan. Emotalk: Speech-driven emotional disentanglement for 3d face animation. InProceedings of the IEEE/CVF international conference on computer vision, pages 20687–20697, 2023

2023
[31]

Dualtalk: Dual-speaker interaction for 3d talking head conversations

Ziqiao Peng, Yanbo Fan, Haoyu Wu, Xuan Wang, Hongyan Liu, Jun He, and Zhaoxin Fan. Dualtalk: Dual-speaker interaction for 3d talking head conversations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21055–21064, 2025

2025
[32]

A lip sync expert is all you need for speech to lip generation in the wild

KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. InProceedings of the 28th ACM international conference on multimedia, pages 484–492, 2020

2020
[33]

Meshtalk: 3d face animation from speech using cross-modality disentanglement

Alexander Richard, Michael Zollhöfer, Yandong Wen, Fernando De la Torre, and Yaser Sheikh. Meshtalk: 3d face animation from speech using cross-modality disentanglement. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1173–1182, 2021

2021
[34]

Personaplex: V oice and role control for full duplex conversational speech models.arXiv preprint arXiv:2602.06053, 2026

Rajarshi Roy, Jonathan Raiman, Sang-gil Lee, Teodor-Dumitru Ene, Robert Kirby, Sungwon Kim, Jaehyeon Kim, and Bryan Catanzaro. Personaplex: V oice and role control for full duplex conversational speech models.arXiv preprint arXiv:2602.06053, 2026

work page arXiv 2026
[35]

AudioPaLM: A Large Language Model That Can Speak and Listen

Paul K Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, et al. Audiopalm: A large language model that can speak and listen.arXiv preprint arXiv:2306.12925, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Rolling diffusion models.arXiv preprint arXiv:2402.09470, 2024

David Ruhe, Jonathan Heek, Tim Salimans, and Emiel Hoogeboom. Rolling diffusion models.arXiv preprint arXiv:2402.09470, 2024

work page arXiv 2024
[37]

A simplest systematics for the organization of turn-taking for conversation.language, 50(4):696–735, 1974

Harvey Sacks, Emanuel A Schegloff, and Gail Jefferson. A simplest systematics for the organization of turn-taking for conversation.language, 50(4):696–735, 1974. 11

1974
[38]

Universals and cultural variation in turn-taking in conversation.Proceedings of the National Academy of Sciences, 106(26): 10587–10592, 2009

Tanya Stivers, Nicholas J Enfield, Penelope Brown, Christina Englert, Makoto Hayashi, Trine Heinemann, Gertie Hoymann, Federico Rossano, Jan Peter De Ruiter, Kyung-Eun Yoon, et al. Universals and cultural variation in turn-taking in conversation.Proceedings of the National Academy of Sciences, 106(26): 10587–10592, 2009

2009
[39]

Visual contribution to speech intelligibility in noise.The journal of the acoustical society of america, 26(2):212–215, 1954

William H Sumby and Irwin Pollack. Visual contribution to speech intelligibility in noise.The journal of the acoustical society of america, 26(2):212–215, 1954

1954
[40]

Diffposetalk: Speech-driven stylistic 3d facial animation and head pose generation via diffusion models

Zhiyao Sun, Tian Lv, Sheng Ye, Matthieu Lin, Jenny Sheng, Yu-Hui Wen, Minjing Yu, and Yong-jin Liu. Diffposetalk: Speech-driven stylistic 3d facial animation and head pose generation via diffusion models. ACM Transactions on Graphics (ToG), 43(4):1–9, 2024

2024
[41]

Improving and generalizing flow-based generative models with minibatch optimal transport.Transactions on Machine Learning Research

Alexander Tong, Kilian FATRAS, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Guy Wolf, and Yoshua Bengio. Improving and generalizing flow-based generative models with minibatch optimal transport.Transactions on Machine Learning Research
[42]

Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm

Xiong Wang, Yangze Li, Chaoyou Fu, Yunhang Shen, Lei Xie, Ke Li, Xing Sun, and Long Ma. Freeze- omni: A smart and low latency speech-to-speech dialogue model with frozen llm.arXiv preprint arXiv:2411.00774, 2024

work page arXiv 2024
[43]

Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen LLM

Xiong Wang, Yangze Li, Chaoyou Fu, Yike Zhang, Yunhang Shen, Lei Xie, Ke Li, Xing Sun, and Long MA. Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen LLM. In F orty-second International Conference on Machine Learning, 2025. URL https://openreview.net/ forum?id=s1EImzs5Id

2025
[44]

Codetalker: Speech-driven 3d facial animation with discrete motion prior

Jinbo Xing, Menghan Xia, Yuechen Zhang, Xiaodong Cun, Jue Wang, and Tien-Tsin Wong. Codetalker: Speech-driven 3d facial animation with discrete motion prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12780–12790, 2023

2023
[45]

On getting a word in edgewise

Victor H Yngve. On getting a word in edgewise. InPapers from the sixth regional meeting Chicago Linguistic Society, April 16-18, 1970, Chicago Linguistic Society, Chicago, pages 567–578, 1970

1970
[46]

Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities

Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 15757–15773, 2023

2023
[47]

Omniflatten: An end-to-end gpt model for seamless voice conversation

Qinglin Zhang, Luyao Cheng, Chong Deng, Qian Chen, Wen Wang, Siqi Zheng, Jiaqing Liu, Hai Yu, Chao-Hong Tan, Zhihao Du, et al. Omniflatten: An end-to-end gpt model for seamless voice conversation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 14570–14580, 2025

2025
[48]

Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation

Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8652–8661, 2023

2023
[49]

Responsive listening head generation: a benchmark dataset and baseline

Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, and Tao Mei. Responsive listening head generation: a benchmark dataset and baseline. InEuropean conference on computer vision, pages 124–142. Springer, 2022

2022
[50]

Makelttalk: speaker-aware talking-head animation.ACM Transactions On Graphics (TOG), 39(6):1–15, 2020

Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria, Evangelos Kalogerakis, and Dingzeyu Li. Makelttalk: speaker-aware talking-head animation.ACM Transactions On Graphics (TOG), 39(6):1–15, 2020. 12 In this Appendix, we provide supplementary details and supporting analyses for FacePlex as follows: A. Data Construction. . . . . . . . . . . . . . . . . ....

work page arXiv 2020

[1] [1]

arXiv preprint arXiv:2506.22554 (2025) 4, 5, 22, 29

Vasu Agrawal, Akinniyi Akinyemi, Kathryn Alvero, Morteza Behrooz, Julia Buffalini, Fabio Maria Carlucci, Joy Chen, Junming Chen, Zhang Chen, Shiyang Cheng, et al. Seamless interaction: Dyadic audiovisual motion modeling and large-scale dataset.arXiv preprint arXiv:2506.22554, 2025

work page arXiv 2025

[2] [2]

Wiley- Interscience, 2000

John C Bellamy.Digital Telephony (Wiley Series in Telecommunications and Signal Processing). Wiley- Interscience, 2000

2000

[3] [3]

Percep- tually accurate 3d talking head generation: New definitions, speech-mesh representation, and evaluation metrics

Lee Chae-Yeon, Oh Hyun-Bin, Han EunGi, Kim Sung-Bin, Suekyeong Nam, and Tae-Hyun Oh. Percep- tually accurate 3d talking head generation: New definitions, speech-mesh representation, and evaluation metrics. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21065–21074, 2025

2025

[4] [4]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2024

2024

[5] [5]

Artalk: Speech-driven 3d head animation via autoregressive model

Xuangeng Chu, Nabarun Goswami, Ziteng Cui, Hanqin Wang, and Tatsuya Harada. Artalk: Speech-driven 3d head animation via autoregressive model. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–9, 2025

2025

[6] [6]

Unils: End-to-end audio-driven avatars for unified listening and speaking.arXiv preprint arXiv:2512.09327, 2025

Xuangeng Chu, Ruicong Liu, Yifei Huang, Yun Liu, Yichen Peng, and Bo Zheng. Unils: End-to-end audio-driven avatars for unified listening and speaking.arXiv preprint arXiv:2512.09327, 2025

work page arXiv 2025

[7] [7]

Using uh and um in spontaneous speaking.Cognition, 84(1):73–111, 2002

Herbert H Clark and Jean E Fox Tree. Using uh and um in spontaneous speaking.Cognition, 84(1):73–111, 2002

2002

[8] [8]

Capture, learning, and synthesis of 3d speaking styles

Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael J Black. Capture, learning, and synthesis of 3d speaking styles. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10101–10111, 2019

2019

[9] [9]

Emotional speech-driven animation with content-emotion disentanglement

Radek Danˇeˇcek, Kiran Chhatre, Shashank Tripathi, Yandong Wen, Michael Black, and Timo Bolkart. Emotional speech-driven animation with content-emotion disentanglement. InSIGGRAPH Asia 2023 Conference Papers, pages 1–13, 2023

2023

[10] [10]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Stable audio open

Zach Evans, Julian D Parker, CJ Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons. Stable audio open. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

2025

[12] [12]

Faceformer: Speech-driven 3d facial animation with transformers

Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, and Taku Komura. Faceformer: Speech-driven 3d facial animation with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18770–18780, 2022

2022

[13] [13]

Pauses, gaps and overlaps in conversations.Journal of Phonetics, 38(4): 555–568, 2010

Mattias Heldner and Jens Edlund. Pauses, gaps and overlaps in conversations.Journal of Phonetics, 38(4): 555–568, 2010

2010

[14] [14]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020

[15] [15]

Audio-driven facial animation by joint end-to-end learning of pose and emotion.ACM Transactions on Graphics (ToG), 36(4):1–12, 2017

Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. Audio-driven facial animation by joint end-to-end learning of pose and emotion.ACM Transactions on Graphics (ToG), 36(4):1–12, 2017

2017

[16] [16]

Forgotten little words: How backchannels and particles may facilitate speech planning in conversation?Frontiers in Psychology, 11:593671, 2020

Birgit Knudsen, Ava Creemers, and Antje S Meyer. Forgotten little words: How backchannels and particles may facilitate speech planning in conversation?Frontiers in Psychology, 11:593671, 2020

2020

[17] [17]

Streamdiffusion: A pipeline-level solution for real-time interactive generation

Akio Kodaira, Chenfeng Xu, Toshiki Hazama, Takanori Yoshimoto, Kohei Ohno, Shogo Mitsuhori, Soichi Sugano, Hanying Cho, Zhijian Liu, Masayoshi Tomizuka, et al. Streamdiffusion: A pipeline-level solution for real-time interactive generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12371–12380, 2025

2025

[18] [18]

Turn-taking in human communication–origins and implications for language process- ing.Trends in cognitive sciences, 20(1):6–14, 2016

Stephen C Levinson. Turn-taking in human communication–origins and implications for language process- ing.Trends in cognitive sciences, 20(1):6–14, 2016. 10

2016

[19] [19]

Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4D scans.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6): 194:1–194:17, 2017. URLhttps://doi.org/10.1145/3130800.3130813

work page doi:10.1145/3130800.3130813 2017

[20] [20]

Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities.arXiv preprint arXiv:2503.04721, 2025

Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui Wang, Gopala Anumanchipalli, Alexander H Liu, and Hung- yi Lee. Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities.arXiv preprint arXiv:2503.04721, 2025

work page arXiv 2025

[21] [21]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[22] [22]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[23] [23]

Reactface: Online multiple appropriate facial reaction generation in dyadic interactions.IEEE Transactions on Visualization and Computer Graphics, 31(9):6190–6207, 2024

Cheng Luo, Siyang Song, Weicheng Xie, Micol Spitale, Zongyuan Ge, Linlin Shen, and Hatice Gunes. Reactface: Online multiple appropriate facial reaction generation in dyadic interactions.IEEE Transactions on Visualization and Computer Graphics, 31(9):6190–6207, 2024

2024

[24] [24]

Omniresponse: Online mul- timodal conversational response generation in dyadic interactions.arXiv preprint arXiv:2505.21724, 2025

Cheng Luo, Jianghui Wang, Bing Li, Siyang Song, and Bernard Ghanem. Omniresponse: Online mul- timodal conversational response generation in dyadic interactions.arXiv preprint arXiv:2505.21724, 2025

work page arXiv 2025

[25] [25]

Mit Press, 1998

Dominic W Massaro.Perceiving talking faces: From speech perception to a behavioral principle. Mit Press, 1998

1998

[26] [26]

Hearing lips and seeing voices.Nature, 264(5588):746–748, 1976

Harry McGurk and John MacDonald. Hearing lips and seeing voices.Nature, 264(5588):746–748, 1976

1976

[27] [27]

Spoken question answering and speech continuation using spectrogram-powered llm.arXiv preprint arXiv:2305.15255, 2023

Eliya Nachmani, Alon Levkovitch, Roy Hirsch, Julian Salazar, Chulayuth Asawaroengchai, Soroosh Mariooryad, Ehud Rivlin, RJ Skerry-Ryan, and Michelle Tadmor Ramanovich. Spoken question answering and speech continuation using spectrogram-powered llm.arXiv preprint arXiv:2305.15255, 2023

work page arXiv 2023

[28] [28]

Learning to listen: Modeling non-deterministic dyadic facial motion

Evonne Ng, Hanbyul Joo, Liwen Hu, Hao Li, Trevor Darrell, Angjoo Kanazawa, and Shiry Ginosar. Learning to listen: Modeling non-deterministic dyadic facial motion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20395–20405, 2022

2022

[29] [29]

Can language models learn to listen? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10083–10093, 2023

Evonne Ng, Sanjay Subramanian, Dan Klein, Angjoo Kanazawa, Trevor Darrell, and Shiry Ginosar. Can language models learn to listen? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10083–10093, 2023

2023

[30] [30]

Emotalk: Speech-driven emotional disentanglement for 3d face animation

Ziqiao Peng, Haoyu Wu, Zhenbo Song, Hao Xu, Xiangyu Zhu, Jun He, Hongyan Liu, and Zhaoxin Fan. Emotalk: Speech-driven emotional disentanglement for 3d face animation. InProceedings of the IEEE/CVF international conference on computer vision, pages 20687–20697, 2023

2023

[31] [31]

Dualtalk: Dual-speaker interaction for 3d talking head conversations

Ziqiao Peng, Yanbo Fan, Haoyu Wu, Xuan Wang, Hongyan Liu, Jun He, and Zhaoxin Fan. Dualtalk: Dual-speaker interaction for 3d talking head conversations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21055–21064, 2025

2025

[32] [32]

A lip sync expert is all you need for speech to lip generation in the wild

KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. InProceedings of the 28th ACM international conference on multimedia, pages 484–492, 2020

2020

[33] [33]

Meshtalk: 3d face animation from speech using cross-modality disentanglement

Alexander Richard, Michael Zollhöfer, Yandong Wen, Fernando De la Torre, and Yaser Sheikh. Meshtalk: 3d face animation from speech using cross-modality disentanglement. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1173–1182, 2021

2021

[34] [34]

Personaplex: V oice and role control for full duplex conversational speech models.arXiv preprint arXiv:2602.06053, 2026

Rajarshi Roy, Jonathan Raiman, Sang-gil Lee, Teodor-Dumitru Ene, Robert Kirby, Sungwon Kim, Jaehyeon Kim, and Bryan Catanzaro. Personaplex: V oice and role control for full duplex conversational speech models.arXiv preprint arXiv:2602.06053, 2026

work page arXiv 2026

[35] [35]

AudioPaLM: A Large Language Model That Can Speak and Listen

Paul K Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, et al. Audiopalm: A large language model that can speak and listen.arXiv preprint arXiv:2306.12925, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

Rolling diffusion models.arXiv preprint arXiv:2402.09470, 2024

David Ruhe, Jonathan Heek, Tim Salimans, and Emiel Hoogeboom. Rolling diffusion models.arXiv preprint arXiv:2402.09470, 2024

work page arXiv 2024

[37] [37]

A simplest systematics for the organization of turn-taking for conversation.language, 50(4):696–735, 1974

Harvey Sacks, Emanuel A Schegloff, and Gail Jefferson. A simplest systematics for the organization of turn-taking for conversation.language, 50(4):696–735, 1974. 11

1974

[38] [38]

Universals and cultural variation in turn-taking in conversation.Proceedings of the National Academy of Sciences, 106(26): 10587–10592, 2009

Tanya Stivers, Nicholas J Enfield, Penelope Brown, Christina Englert, Makoto Hayashi, Trine Heinemann, Gertie Hoymann, Federico Rossano, Jan Peter De Ruiter, Kyung-Eun Yoon, et al. Universals and cultural variation in turn-taking in conversation.Proceedings of the National Academy of Sciences, 106(26): 10587–10592, 2009

2009

[39] [39]

Visual contribution to speech intelligibility in noise.The journal of the acoustical society of america, 26(2):212–215, 1954

William H Sumby and Irwin Pollack. Visual contribution to speech intelligibility in noise.The journal of the acoustical society of america, 26(2):212–215, 1954

1954

[40] [40]

Diffposetalk: Speech-driven stylistic 3d facial animation and head pose generation via diffusion models

Zhiyao Sun, Tian Lv, Sheng Ye, Matthieu Lin, Jenny Sheng, Yu-Hui Wen, Minjing Yu, and Yong-jin Liu. Diffposetalk: Speech-driven stylistic 3d facial animation and head pose generation via diffusion models. ACM Transactions on Graphics (ToG), 43(4):1–9, 2024

2024

[41] [41]

Improving and generalizing flow-based generative models with minibatch optimal transport.Transactions on Machine Learning Research

Alexander Tong, Kilian FATRAS, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Guy Wolf, and Yoshua Bengio. Improving and generalizing flow-based generative models with minibatch optimal transport.Transactions on Machine Learning Research

[42] [42]

Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm

Xiong Wang, Yangze Li, Chaoyou Fu, Yunhang Shen, Lei Xie, Ke Li, Xing Sun, and Long Ma. Freeze- omni: A smart and low latency speech-to-speech dialogue model with frozen llm.arXiv preprint arXiv:2411.00774, 2024

work page arXiv 2024

[43] [43]

Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen LLM

Xiong Wang, Yangze Li, Chaoyou Fu, Yike Zhang, Yunhang Shen, Lei Xie, Ke Li, Xing Sun, and Long MA. Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen LLM. In F orty-second International Conference on Machine Learning, 2025. URL https://openreview.net/ forum?id=s1EImzs5Id

2025

[44] [44]

Codetalker: Speech-driven 3d facial animation with discrete motion prior

Jinbo Xing, Menghan Xia, Yuechen Zhang, Xiaodong Cun, Jue Wang, and Tien-Tsin Wong. Codetalker: Speech-driven 3d facial animation with discrete motion prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12780–12790, 2023

2023

[45] [45]

On getting a word in edgewise

Victor H Yngve. On getting a word in edgewise. InPapers from the sixth regional meeting Chicago Linguistic Society, April 16-18, 1970, Chicago Linguistic Society, Chicago, pages 567–578, 1970

1970

[46] [46]

Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities

Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 15757–15773, 2023

2023

[47] [47]

Omniflatten: An end-to-end gpt model for seamless voice conversation

Qinglin Zhang, Luyao Cheng, Chong Deng, Qian Chen, Wen Wang, Siqi Zheng, Jiaqing Liu, Hai Yu, Chao-Hong Tan, Zhihao Du, et al. Omniflatten: An end-to-end gpt model for seamless voice conversation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 14570–14580, 2025

2025

[48] [48]

Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation

Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8652–8661, 2023

2023

[49] [49]

Responsive listening head generation: a benchmark dataset and baseline

Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, and Tao Mei. Responsive listening head generation: a benchmark dataset and baseline. InEuropean conference on computer vision, pages 124–142. Springer, 2022

2022

[50] [50]

Makelttalk: speaker-aware talking-head animation.ACM Transactions On Graphics (TOG), 39(6):1–15, 2020

Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria, Evangelos Kalogerakis, and Dingzeyu Li. Makelttalk: speaker-aware talking-head animation.ACM Transactions On Graphics (TOG), 39(6):1–15, 2020. 12 In this Appendix, we provide supplementary details and supporting analyses for FacePlex as follows: A. Data Construction. . . . . . . . . . . . . . . . . ....

work page arXiv 2020