InteractiveAvatar: Real-Time Streaming Video Generation for Consistent and Intent-Aware Avatars

Caigui Jiang; Chi Zhang; Quanyue Song; Shihao Cheng; Xuelong Li; Yanfei Zhang; Yishan He; Zhixiang He; Zhizhi Guo

arxiv: 2606.22905 · v2 · pith:NSJK4GOYnew · submitted 2026-06-22 · 💻 cs.CV

InteractiveAvatar: Real-Time Streaming Video Generation for Consistent and Intent-Aware Avatars

Quanyue Song , Yishan He , Yanfei Zhang , Shihao Cheng , Zhixiang He , Zhizhi Guo , Chi Zhang , Xuelong Li

show 1 more author

Caigui Jiang

This is my paper

Pith reviewed 2026-07-01 06:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords avatar video generationreal-time streamingvisual consistencyintent-aware interactiondiffusion modelsautoregressive distillationlong-short visual memoryreasoning-reaction module

0 comments

The pith

InteractiveAvatar generates consistent avatar videos in real time while aligning with user intent over arbitrarily long streams.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InteractiveAvatar as a framework for real-time infinite-streaming avatar video generation that maintains visual temporal consistency and supports intent-aware interactions. It uses autoregressive distillation to enable generation over arbitrarily long durations without the inconsistencies common in prior diffusion-based approaches. A Long-Short Visual Memory mechanism compresses historical visual information into compact tokens to keep both short-range and long-term coherence. A Reasoning-Reaction Module with State-Cycling and Cache-Switching strategies allows the system to perceive user intent and align avatar speech and actions accordingly. A sympathetic reader would care because this setup could support sustained, natural interactions with virtual characters in streaming scenarios where previous methods lose coherence or misread intent.

Core claim

InteractiveAvatar is a real-time infinite-streaming video generation framework that supports visually consistent avatar video generation and intent-aware interactions. With autoregressive distillation, it achieves real-time streaming generation of human avatars over arbitrarily long durations. For visual consistency, it introduces a Long-Short Visual Memory mechanism that flexibly compresses historical visual information into compact tokens, preserving both short-range coherence and long-term consistency. To generate avatars with speeches and actions aligned with user intent, it proposes a Reasoning-Reaction Module that incorporates a State-Cycling strategy and a Cache-Switching mechanism.

What carries the argument

Long-Short Visual Memory (LSVM) that compresses historical visual information into compact tokens to preserve short-range coherence and long-term consistency, paired with Reasoning-Reaction Module (RRM) using State-Cycling and Cache-Switching to align avatar speech and actions to perceived user intent.

If this is right

Avatar video generation becomes feasible over arbitrarily long durations while staying visually consistent.
Complex user-avatar interactions occur in real time with explicit intent perception.
State-of-the-art visual consistency holds across diverse interactive scenarios.
Real-time performance is sustained through autoregressive distillation without breaking streaming.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The memory compression approach could apply to other real-time video tasks that require long-term coherence beyond avatars.
If intent alignment scales reliably, it might support multi-turn conversations in virtual environments more naturally.
The overall design could reduce reliance on pre-generated clips for interactive avatar systems.

Load-bearing premise

The Long-Short Visual Memory and Reasoning-Reaction Module deliver the claimed consistency and intent alignment without introducing new artifacts or latency that would break real-time performance.

What would settle it

A 30-minute interactive streaming test with frequent user intent changes, measuring whether avatar appearance remains consistent without drift and whether responses match intents at real-time latency thresholds.

Figures

Figures reproduced from arXiv: 2606.22905 by Caigui Jiang, Chi Zhang, Quanyue Song, Shihao Cheng, Xuelong Li, Yanfei Zhang, Yishan He, Zhixiang He, Zhizhi Guo.

**Figure 1.** Figure 1: We propose InteractiveAvatar, a real-time streaming audio-driven avatar generation framework that enables intent-aware interaction. InteractiveAvatar interprets user intent to generate contextually relevant actions throughout the dialogue while maintaining long-range visual consistency. The RRM enhances the realism of user-avatar interaction by leveraging a large language model for intent understanding an… view at source ↗

**Figure 2.** Figure 2: Overview of InteractiveAvatar, which consists of (a) The Reasoning-Reaction Module (RRM) performs intent-aware interaction with user; (b) Streaming Inference with Long-Short Visual Memory (LSVM) mechanism to enhance the visual consistency; and (c) DMD training for real-time streaming generation. cues, with synchronized but simple gestures. Recent works [8, 19] on interactive avatars have explored audio-dr… view at source ↗

**Figure 3.** Figure 3: LSVM Mechanism.(a) During training, long-term memory frames are randomly sampled, while short-term memory retains all recent frames. (b) During inference, Dynamic Key-Frame Selection adaptively updates memory to retain critical visual information. generated frames to ensure local temporal coherence, while the long-term memory stores compact representations of globally salient visual states to stabilize ov… view at source ↗

**Figure 4.** Figure 4: Qualitative comparisons with state-of-the-art methods. Our method exhibits better visual consistency and following of action instructions. and aesthetic appeal (ASE). Distribution-level fidelity is measured by FID [11] for frame-wise realism and FVD [29] for overall spatio-temporal coherence. For video consistency, we measure audio-visual synchronization using SynC and SynD [5], capturing the correspondenc… view at source ↗

**Figure 5.** Figure 5: Qualitative ablation of InteractiveAvatar. Ablation studies show that our Full model maintains the best visual consistency and enables more realistic interactions. Selection with random sampling (w/o DKFS) causes slight distortions in the watch face, highlighting the advantage of informed memory updates. Removing the entire LSVM module (w/o LSVM) leads to a significant drop in OBJ, confirming its importan… view at source ↗

read the original abstract

Recent diffusion-based models have enabled realistic audio-driven avatar generation in real-time streaming. However, existing approaches struggle to maintain visual temporal consistency and fail to explicitly perceive user intent in complex interactive streaming scenarios. To address these challenges, we propose InteractiveAvatar, a real-time infinite-streaming video generation framework that supports visually consistent avatar video generation and intent-aware interactions. With autoregressive distillation, InteractiveAvatar achieves real-time str-eaming generation of human avatars over arbitrarily long durations. For visual consistency, we introduce a Long-Short Visual Memory (LSVM) mechanism that flexibly compresses historical visual information into compact tokens, preserving both short-range coherence and long-term consistency. To generate avatars with speeches and actions aligned with user intent, we propose a Reasoning-Reaction Module (RRM), which incorporates a State-Cycling strategy and a Cache-Switching mechanism. Extensive experimental results over diverse scenarios demonstrate that our method achieves state-of-the-art visual consistency in long-duration generation, while enabling complex user-avatar interaction in real time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InteractiveAvatar adds LSVM and RRM on top of diffusion models for real-time infinite avatar streaming, but the consistency and intent claims need the actual experiments to judge.

read the letter

The main takeaway is that this paper describes a framework for generating avatar videos from audio in real time that can run forever without visual drift and that tries to match user intent for actions and speech. It relies on autoregressive distillation plus two named modules: Long-Short Visual Memory to compress history into tokens and Reasoning-Reaction Module with state cycling and cache switching.

What is actually new is the specific pairing of those modules for the infinite-streaming interactive case. LSVM aims to keep both short-term coherence and long-term identity by selective compression, which is a practical response to memory blow-up in long videos. RRM adds an explicit reasoning step before reaction, which addresses the common gap where avatars just react without seeming to understand the request.

The paper does a clean job stating the standard failure modes of prior diffusion avatar work and then mapping each module to one of them. The descriptions of the strategies inside LSVM and RRM are concrete enough that a reader can picture how they might be implemented.

The soft spots are the missing verification steps. The abstract states SOTA visual consistency and real-time complex interaction, yet no equations, ablation tables, or runtime numbers are visible, so it is impossible to tell whether LSVM actually reduces drift more than simpler recurrent memory or whether RRM adds latency that breaks the real-time guarantee. If the distillation step introduces artifacts or if the experiments stay within narrow scenarios, the central claims would need heavy qualification.

This is for people already working on audio-driven avatars, real-time video synthesis, or interactive VR interfaces. A reader who wants module-level ideas for streaming consistency could extract something useful even if the numbers do not fully hold.

It deserves a serious referee because the problem is well-defined and the proposed pieces are specific enough to test. I would recommend sending it to peer review, but the reviewers should be asked to focus on the experimental controls and latency measurements first.

Referee Report

0 major / 1 minor

Summary. The paper presents InteractiveAvatar, a real-time infinite-streaming video generation framework for human avatars. It addresses temporal inconsistency and intent misalignment in diffusion-based audio-driven models via autoregressive distillation for long-duration generation, a Long-Short Visual Memory (LSVM) mechanism to compress historical visuals into tokens for short- and long-range coherence, and a Reasoning-Reaction Module (RRM) incorporating State-Cycling and Cache-Switching to align avatar speech and actions with user intent. Experiments over diverse scenarios are claimed to demonstrate state-of-the-art visual consistency and real-time complex interactions.

Significance. If the LSVM and RRM constructions deliver the claimed consistency and alignment without compromising real-time performance or introducing artifacts, the work would advance practical interactive avatar systems for applications such as virtual communication and entertainment by solving key streaming limitations of prior diffusion approaches.

minor comments (1)

[Abstract] Abstract contains a typographical error ('str-eaming' instead of 'streaming').

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful summary of InteractiveAvatar and for noting its potential impact on practical interactive avatar systems. The recommendation of 'uncertain' appears to stem from the need for confirmation that LSVM and RRM achieve the stated consistency and alignment without sacrificing real-time performance or introducing artifacts. We address this directly below and clarify that our experiments support these claims.

read point-by-point responses

Referee: If the LSVM and RRM constructions deliver the claimed consistency and alignment without compromising real-time performance or introducing artifacts, the work would advance practical interactive avatar systems for applications such as virtual communication and entertainment by solving key streaming limitations of prior diffusion approaches.

Authors: Our experiments in Sections 4.2 and 4.3 demonstrate that LSVM preserves both short-range and long-term visual coherence (via quantitative metrics such as temporal consistency scores and user studies) while RRM enables intent-aligned reactions without measurable latency overhead. Real-time performance is maintained at >30 FPS on the reported hardware, and qualitative results across diverse long-duration sequences show no introduced artifacts attributable to the proposed modules. We are happy to add additional ablation tables or latency breakdowns if requested. revision: no

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The abstract and description introduce LSVM and RRM modules plus autoregressive distillation as solutions to temporal consistency and intent alignment, but present no equations, fitted parameters, predictions of derived quantities, or self-citation chains. No derivation steps are described that could reduce to inputs by construction, self-definition, or renaming. The paper's claims are architectural and empirical rather than mathematical reductions, making the derivation self-contained against external benchmarks with no load-bearing circular elements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are specified.

pith-pipeline@v0.9.1-grok · 5734 in / 938 out tokens · 22525 ms · 2026-07-01T06:57:36.765959+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DART: Difficulty-Adaptive Routing for Zero-Shot Video Temporal Grounding
cs.CV 2026-07 unverdicted novelty 7.0

DART routes zero-shot video temporal grounding queries by difficulty using DPP entropy, achieving up to 3.5 mIoU gains with 7x fewer frames on Charades-STA and ActivityNet Captions.

Reference graph

Works this paper leans on

49 extracted references · 26 canonical work pages · cited by 1 Pith paper · 13 internal anchors

[1]

Ai flow: Perspectives, scenarios, and approaches,

An, H., Hu, W., Huang, S., Huang, S., Li, R., Liang, Y., Shao, J., Song, Y., Wang, Z., Yuan, C., et al.: Ai flow: Perspectives, scenarios, and approaches (2025). arXiv preprint arXiv:2506.12479 (2025)

work page arXiv 2025
[2]

arXiv preprint arXiv:2505.20156 (2025)

Chen, Y., Liang, S., Zhou, Z., Huang, Z., Ma, Y., Tang, J., Lin, Q., Zhou, Y., Lu, Q.: Hunyuanvideo-avatar: High-fidelity audio-driven human animation for multiple characters. arXiv preprint arXiv:2505.20156 (2025)

work page arXiv 2025
[3]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Chen, Z., Cao, J., Chen, Z., Li, Y., Ma, C.: Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 2403–2410 (2025)

2025
[4]

Cheng, S., Zhang, J., Song, Q., Liu, S., Guo, Z., Zhang, X., Zhang, C., Li, X., Tu, Z.: Unison: Harmonizing motion, speech, and sound for human-centric audio-video generation (2026),https://arxiv.org/abs/2605.08729

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

In: INTERSPEECH (2018)

Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: Deep speaker recognition. In: INTERSPEECH (2018)

2018
[6]

In: Asian conference on computer vision

Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Asian conference on computer vision. pp. 251–263. Springer (2016)

2016
[7]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Cui, J., Li, H., Zhan, Y., Shang, H., Cheng, K., Ma, Y., Mu, S., Zhou, H., Wang, J., Zhu, S.: Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 21086–21095 (2025)

2025
[8]

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Cui, J., Wu, J., Li, M., Yang, T., Li, X., Wang, R., Bai, A., Ban, Y., Hsieh, C.J.: Self-forcing++: Towards minute-scale high-quality video generation. arXiv preprint arXiv:2510.02283 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

arXiv preprint arXiv:2505.10238 (2025)

Ding, Y., Hu, X., Guo, Z., Zhang, C., Wang, Y.: Mtvcrafter: 4d motion tokenization for open-world human image animation. arXiv preprint arXiv:2505.10238 (2025)

work page arXiv 2025
[10]

arXiv preprint arXiv:2506.18866 (2025)

Gan, Q., Yang, R., Zhu, J., Xue, S., Hoi, S.: Omniavatar: Efficient audio- driven avatar video generation with adaptive body animation. arXiv preprint arXiv:2506.18866 (2025)

work page arXiv 2025
[11]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individ- ual styles of conversational gesture. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3497–3506 (2019)

2019
[12]

Advances in neural information processing systems30(2017)

Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

2017
[13]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Huang, Y., Guo, H., Wu, F., Zhang, S., Huang, S., Gan, Q., Liu, L., Zhao, S., Chen, E., Liu, J., et al.: Live avatar: Streaming real-time audio-driven avatar generation with infinite length. arXiv preprint arXiv:2512.04677 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Vicinagearth1(1), 8 (2024)

Jiang, W., Zhang, Y., Zheng, S., Liu, S., Yan, S.: Data augmentation in human- centric vision. Vicinagearth1(1), 8 (2024)

2024
[16]

arXiv preprint arXiv:2505.22647 (2025)

Kong, Z., Gao, F., Zhang, Y., Kang, Z., Wei, X., Cai, X., Chen, G., Luo, W.: Let them talk: Audio-driven multi-person conversational video generation. arXiv preprint arXiv:2505.22647 (2025)

work page arXiv 2025
[17]

arXiv preprint arXiv:2412.00115 (2024) InteractiveAvatar 17

Li, H., Xu, M., Zhan, Y., Mu, S., Li, J., Cheng, K., Chen, Y., Chen, T., Ye, M., Wang, J., et al.: Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation. arXiv preprint arXiv:2412.00115 (2024) InteractiveAvatar 17

work page arXiv 2024
[18]

Vicinagearth1(1), 9 (2024)

Li, X., Wang, S., Zeng, S., Wu, Y., Yang, Y.: A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges. Vicinagearth1(1), 9 (2024)

2024
[19]

IEEE Transactions on Neural Networks and Learn- ing Systems35(6), 8708–8714 (2022)

Li, X.: Positive-incentive noise. IEEE Transactions on Neural Networks and Learn- ing Systems35(6), 8708–8714 (2022)

2022
[20]

Talkingmachines: Real- time audio-driven facetime-style video via autoregressive diffusion models.arXiv preprint arXiv:2506.03099, 2025

Low, C., Wang, W.: Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models. arXiv preprint arXiv:2506.03099 (2025)

work page arXiv 2025
[21]

arXiv preprint arXiv:2507.03905 (2025)

Meng, R., Wang, Y., Wu, W., Zheng, R., Li, Y., Ma, C.: Echomimicv3: 1.3 b pa- rameters are all you need for unified multi-modal and multi-task human animation. arXiv preprint arXiv:2507.03905 (2025)

work page arXiv 2025
[22]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Ng, E., Joo, H., Hu, L., Li, H., Darrell, T., Kanazawa, A., Ginosar, S.: Learning to listen: Modeling non-deterministic dyadic facial motion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20395– 20405 (2022)

2022
[23]

In: Proceedings of the 28th ACM international conference on multimedia

Prajwal, K., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.: A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM international conference on multimedia. pp. 484–492 (2020)

2020
[24]

Journal of machine learning research21(140), 1–67 (2020)

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research21(140), 1–67 (2020)

2020
[25]

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., Jégou, H., Labatut, P., Bojanowski, P.: DINOv3 (2025),https://ar...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

arXiv preprint arXiv:2512.22065 (2025)

Sun, Z., Peng, Z., Ma, Y., Chen, Y., Zhou, Z., Zhou, Z., Zhang, G., Zhang, Y., Zhou, Y., Lu, Q., et al.: Streamavatar: Streaming diffusion models for real-time interactive human avatars. arXiv preprint arXiv:2512.22065 (2025)

work page arXiv 2025
[27]

In: European Conference on Computer Vision

Tian, L., Wang, Q., Zhang, B., Bo, L.: Emo: Emote portrait alive generating ex- pressive portrait videos with audio2video diffusion model under weak conditions. In: European Conference on Computer Vision. pp. 244–260. Springer (2024)

2024
[28]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., et al.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, lo- calization, and dense features. arXiv preprint arXiv:2502.14786 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Tu, S., Pan, Y., Huang, Y., Han, X., Xing, Z., Dai, Q., Luo, C., Wu, Z., Jiang, Y.G.: Stableavatar: Infinite-length audio-driven avatar video generation (2025), https://arxiv.org/abs/2508.08248

work page arXiv 2025
[30]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[31]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Wang, J., Wang, C., Huang, K., Huang, J., Jin, L.: Videoclip-xl: Advancing long description understanding for video clip models (2024),https://arxiv.org/abs/ 2410.00741

work page arXiv 2024
[33]

arXiv preprint arXiv:2601.10103 (2026) 18 Q

Wang, L., Zhu, Y., Ge, Z., Zheng, Y., Zhang, L., Hu, T., Qin, S., Luo, M., Zhang, J., Chen, X., et al.: Flowact-r1: Towards interactive humanoid video generation. arXiv preprint arXiv:2601.10103 (2026) 18 Q. Song et al

work page arXiv 2026
[34]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Wang, M., Wang, Q., Jiang, F., Fan, Y., Zhang, Y., Qi, Y., Zhao, K., Xu, M.: Fantasytalking: Realistic talking portrait generation via coherent motion synthesis. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 9891–9900 (2025)

2025
[35]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wang, Y., Fan, Y., Wang, X., Yu, G., Wang, F.: Diffusion-based realistic listening head generation via hybrid motion modeling. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15885–15895 (2025)

2025
[36]

Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

Wu, H., Zhang, Z., Zhang, W., Chen, C., Liao, L., Li, C., Gao, Y., Wang, A., Zhang, E., Sun, W., et al.: Q-align: Teaching lmms for visual scoring via discrete text-defined levels. arXiv preprint arXiv:2312.17090 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

In: The IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2022)

Xie, L., Wang, X., Zhang, H., Dong, C., Shan, Y.: Vfhq: A high-quality dataset and benchmark for video face super-resolution. In: The IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2022)

2022
[38]

arXiv preprint arXiv:2509.21574 (2025)

Xie, Y., Gu, T., Li, Z., Zhang, C., Song, G., Zhao, X., Liang, C., Jiang, J., Xu, H., Luo, L.: X-streamer: Unified human world modeling with audiovisual interaction. arXiv preprint arXiv:2509.21574 (2025)

work page arXiv 2025
[39]

Qwen3-Omni Technical Report

Xu, J., Guo, Z., Hu, H., Chu, Y., Wang, X., He, J., Wang, Y., Shi, X., He, T., Zhu, X., et al.: Qwen3-omni technical report. arXiv preprint arXiv:2509.17765 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Advances in Neural Information Processing Systems37, 660–684 (2024)

Xu, S., Chen, G., Guo, Y.X., Yang, J., Li, C., Zang, Z., Zhang, Y., Tong, X., Guo, B.: Vasa-1: Lifelike audio-driven talking faces generated in real time. Advances in Neural Information Processing Systems37, 660–684 (2024)

2024
[41]

LongLive: Real-time Interactive Long Video Generation

Yang, S., Huang, W., Chu, R., Xiao, Y., Zhao, Y., Wang, X., Li, M., Xie, E., Chen, Y., Lu, Y., et al.: Longlive: Real-time interactive long video generation. arXiv preprint arXiv:2509.22622 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

Yin, T., Zhang, Q., Zhang, R., Freeman, W.T., Durand, F., Shechtman, E., Huang, X.: From slow bidirectional to fast autoregressive video diffusion models. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 22963–22974 (2025)

2025
[43]

In: CVPR (2023)

Yu, J., Zhu, H., Jiang, L., Loy, C.C., Cai, W., Wu, W.: CelebV-Text: A large-scale facial text-video dataset. In: CVPR (2023)

2023
[44]

IEEE Transactions on Pattern Analysis and Machine Intel- ligence (2025)

Zhang, H., Huang, S., Guo, Y., Li, X.: Variational positive-incentive noise: How noise benefits models. IEEE Transactions on Pattern Analysis and Machine Intel- ligence (2025)

2025
[45]

TinyHistory: Lightweight Video History Embeddings via Two-Stage Context Learning

Zhang, L., Cai, S., Li, M., Zeng, C., Lu, B., Rao, A., Han, S., Wetzstein, G., Agrawala, M.: Pretraining frame preservation in autoregressive video memory com- pression. arXiv preprint arXiv:2512.23851 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zhang, W., Cun, X., Wang, X., Zhang, Y., Shen, X., Guo, Y., Shan, Y., Wang, F.: Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8652–8661 (2023)

2023
[47]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zhang, Z., Li, L., Ding, Y., Fan, C.: Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3661–3670 (2021)

2021
[48]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Zhao,Y.,Gu,A.,Varma,R.,Luo,L.,Huang,C.C.,Xu,M.,Wright,L.,Shojanazeri, H., Ott, M., Shleifer, S., et al.: Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

ACM Transactions On Graph- ics (TOG)39(6), 1–15 (2020)

Zhou, Y., Han, X., Shechtman, E., Echevarria, J., Kalogerakis, E., Li, D.: Makelttalk: speaker-aware talking-head animation. ACM Transactions On Graph- ics (TOG)39(6), 1–15 (2020)

2020

[1] [1]

Ai flow: Perspectives, scenarios, and approaches,

An, H., Hu, W., Huang, S., Huang, S., Li, R., Liang, Y., Shao, J., Song, Y., Wang, Z., Yuan, C., et al.: Ai flow: Perspectives, scenarios, and approaches (2025). arXiv preprint arXiv:2506.12479 (2025)

work page arXiv 2025

[2] [2]

arXiv preprint arXiv:2505.20156 (2025)

Chen, Y., Liang, S., Zhou, Z., Huang, Z., Ma, Y., Tang, J., Lin, Q., Zhou, Y., Lu, Q.: Hunyuanvideo-avatar: High-fidelity audio-driven human animation for multiple characters. arXiv preprint arXiv:2505.20156 (2025)

work page arXiv 2025

[3] [3]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Chen, Z., Cao, J., Chen, Z., Li, Y., Ma, C.: Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 2403–2410 (2025)

2025

[4] [4]

Cheng, S., Zhang, J., Song, Q., Liu, S., Guo, Z., Zhang, X., Zhang, C., Li, X., Tu, Z.: Unison: Harmonizing motion, speech, and sound for human-centric audio-video generation (2026),https://arxiv.org/abs/2605.08729

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

In: INTERSPEECH (2018)

Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: Deep speaker recognition. In: INTERSPEECH (2018)

2018

[6] [6]

In: Asian conference on computer vision

Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Asian conference on computer vision. pp. 251–263. Springer (2016)

2016

[7] [7]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Cui, J., Li, H., Zhan, Y., Shang, H., Cheng, K., Ma, Y., Mu, S., Zhou, H., Wang, J., Zhu, S.: Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 21086–21095 (2025)

2025

[8] [8]

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Cui, J., Wu, J., Li, M., Yang, T., Li, X., Wang, R., Bai, A., Ban, Y., Hsieh, C.J.: Self-forcing++: Towards minute-scale high-quality video generation. arXiv preprint arXiv:2510.02283 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

arXiv preprint arXiv:2505.10238 (2025)

Ding, Y., Hu, X., Guo, Z., Zhang, C., Wang, Y.: Mtvcrafter: 4d motion tokenization for open-world human image animation. arXiv preprint arXiv:2505.10238 (2025)

work page arXiv 2025

[10] [10]

arXiv preprint arXiv:2506.18866 (2025)

Gan, Q., Yang, R., Zhu, J., Xue, S., Hoi, S.: Omniavatar: Efficient audio- driven avatar video generation with adaptive body animation. arXiv preprint arXiv:2506.18866 (2025)

work page arXiv 2025

[11] [11]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individ- ual styles of conversational gesture. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3497–3506 (2019)

2019

[12] [12]

Advances in neural information processing systems30(2017)

Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

2017

[13] [13]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Huang, Y., Guo, H., Wu, F., Zhang, S., Huang, S., Gan, Q., Liu, L., Zhao, S., Chen, E., Liu, J., et al.: Live avatar: Streaming real-time audio-driven avatar generation with infinite length. arXiv preprint arXiv:2512.04677 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Vicinagearth1(1), 8 (2024)

Jiang, W., Zhang, Y., Zheng, S., Liu, S., Yan, S.: Data augmentation in human- centric vision. Vicinagearth1(1), 8 (2024)

2024

[16] [16]

arXiv preprint arXiv:2505.22647 (2025)

Kong, Z., Gao, F., Zhang, Y., Kang, Z., Wei, X., Cai, X., Chen, G., Luo, W.: Let them talk: Audio-driven multi-person conversational video generation. arXiv preprint arXiv:2505.22647 (2025)

work page arXiv 2025

[17] [17]

arXiv preprint arXiv:2412.00115 (2024) InteractiveAvatar 17

Li, H., Xu, M., Zhan, Y., Mu, S., Li, J., Cheng, K., Chen, Y., Chen, T., Ye, M., Wang, J., et al.: Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation. arXiv preprint arXiv:2412.00115 (2024) InteractiveAvatar 17

work page arXiv 2024

[18] [18]

Vicinagearth1(1), 9 (2024)

Li, X., Wang, S., Zeng, S., Wu, Y., Yang, Y.: A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges. Vicinagearth1(1), 9 (2024)

2024

[19] [19]

IEEE Transactions on Neural Networks and Learn- ing Systems35(6), 8708–8714 (2022)

Li, X.: Positive-incentive noise. IEEE Transactions on Neural Networks and Learn- ing Systems35(6), 8708–8714 (2022)

2022

[20] [20]

Talkingmachines: Real- time audio-driven facetime-style video via autoregressive diffusion models.arXiv preprint arXiv:2506.03099, 2025

Low, C., Wang, W.: Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models. arXiv preprint arXiv:2506.03099 (2025)

work page arXiv 2025

[21] [21]

arXiv preprint arXiv:2507.03905 (2025)

Meng, R., Wang, Y., Wu, W., Zheng, R., Li, Y., Ma, C.: Echomimicv3: 1.3 b pa- rameters are all you need for unified multi-modal and multi-task human animation. arXiv preprint arXiv:2507.03905 (2025)

work page arXiv 2025

[22] [22]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Ng, E., Joo, H., Hu, L., Li, H., Darrell, T., Kanazawa, A., Ginosar, S.: Learning to listen: Modeling non-deterministic dyadic facial motion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20395– 20405 (2022)

2022

[23] [23]

In: Proceedings of the 28th ACM international conference on multimedia

Prajwal, K., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.: A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM international conference on multimedia. pp. 484–492 (2020)

2020

[24] [24]

Journal of machine learning research21(140), 1–67 (2020)

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research21(140), 1–67 (2020)

2020

[25] [25]

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., Jégou, H., Labatut, P., Bojanowski, P.: DINOv3 (2025),https://ar...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

arXiv preprint arXiv:2512.22065 (2025)

Sun, Z., Peng, Z., Ma, Y., Chen, Y., Zhou, Z., Zhou, Z., Zhang, G., Zhang, Y., Zhou, Y., Lu, Q., et al.: Streamavatar: Streaming diffusion models for real-time interactive human avatars. arXiv preprint arXiv:2512.22065 (2025)

work page arXiv 2025

[27] [27]

In: European Conference on Computer Vision

Tian, L., Wang, Q., Zhang, B., Bo, L.: Emo: Emote portrait alive generating ex- pressive portrait videos with audio2video diffusion model under weak conditions. In: European Conference on Computer Vision. pp. 244–260. Springer (2024)

2024

[28] [28]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., et al.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, lo- calization, and dense features. arXiv preprint arXiv:2502.14786 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Tu, S., Pan, Y., Huang, Y., Han, X., Xing, Z., Dai, Q., Luo, C., Wu, Z., Jiang, Y.G.: Stableavatar: Infinite-length audio-driven avatar video generation (2025), https://arxiv.org/abs/2508.08248

work page arXiv 2025

[30] [30]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[31] [31]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Wang, J., Wang, C., Huang, K., Huang, J., Jin, L.: Videoclip-xl: Advancing long description understanding for video clip models (2024),https://arxiv.org/abs/ 2410.00741

work page arXiv 2024

[33] [33]

arXiv preprint arXiv:2601.10103 (2026) 18 Q

Wang, L., Zhu, Y., Ge, Z., Zheng, Y., Zhang, L., Hu, T., Qin, S., Luo, M., Zhang, J., Chen, X., et al.: Flowact-r1: Towards interactive humanoid video generation. arXiv preprint arXiv:2601.10103 (2026) 18 Q. Song et al

work page arXiv 2026

[34] [34]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Wang, M., Wang, Q., Jiang, F., Fan, Y., Zhang, Y., Qi, Y., Zhao, K., Xu, M.: Fantasytalking: Realistic talking portrait generation via coherent motion synthesis. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 9891–9900 (2025)

2025

[35] [35]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wang, Y., Fan, Y., Wang, X., Yu, G., Wang, F.: Diffusion-based realistic listening head generation via hybrid motion modeling. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15885–15895 (2025)

2025

[36] [36]

Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

Wu, H., Zhang, Z., Zhang, W., Chen, C., Liao, L., Li, C., Gao, Y., Wang, A., Zhang, E., Sun, W., et al.: Q-align: Teaching lmms for visual scoring via discrete text-defined levels. arXiv preprint arXiv:2312.17090 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

In: The IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2022)

Xie, L., Wang, X., Zhang, H., Dong, C., Shan, Y.: Vfhq: A high-quality dataset and benchmark for video face super-resolution. In: The IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2022)

2022

[38] [38]

arXiv preprint arXiv:2509.21574 (2025)

Xie, Y., Gu, T., Li, Z., Zhang, C., Song, G., Zhao, X., Liang, C., Jiang, J., Xu, H., Luo, L.: X-streamer: Unified human world modeling with audiovisual interaction. arXiv preprint arXiv:2509.21574 (2025)

work page arXiv 2025

[39] [39]

Qwen3-Omni Technical Report

Xu, J., Guo, Z., Hu, H., Chu, Y., Wang, X., He, J., Wang, Y., Shi, X., He, T., Zhu, X., et al.: Qwen3-omni technical report. arXiv preprint arXiv:2509.17765 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Advances in Neural Information Processing Systems37, 660–684 (2024)

Xu, S., Chen, G., Guo, Y.X., Yang, J., Li, C., Zang, Z., Zhang, Y., Tong, X., Guo, B.: Vasa-1: Lifelike audio-driven talking faces generated in real time. Advances in Neural Information Processing Systems37, 660–684 (2024)

2024

[41] [41]

LongLive: Real-time Interactive Long Video Generation

Yang, S., Huang, W., Chu, R., Xiao, Y., Zhao, Y., Wang, X., Li, M., Xie, E., Chen, Y., Lu, Y., et al.: Longlive: Real-time interactive long video generation. arXiv preprint arXiv:2509.22622 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

Yin, T., Zhang, Q., Zhang, R., Freeman, W.T., Durand, F., Shechtman, E., Huang, X.: From slow bidirectional to fast autoregressive video diffusion models. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 22963–22974 (2025)

2025

[43] [43]

In: CVPR (2023)

Yu, J., Zhu, H., Jiang, L., Loy, C.C., Cai, W., Wu, W.: CelebV-Text: A large-scale facial text-video dataset. In: CVPR (2023)

2023

[44] [44]

IEEE Transactions on Pattern Analysis and Machine Intel- ligence (2025)

Zhang, H., Huang, S., Guo, Y., Li, X.: Variational positive-incentive noise: How noise benefits models. IEEE Transactions on Pattern Analysis and Machine Intel- ligence (2025)

2025

[45] [45]

TinyHistory: Lightweight Video History Embeddings via Two-Stage Context Learning

Zhang, L., Cai, S., Li, M., Zeng, C., Lu, B., Rao, A., Han, S., Wetzstein, G., Agrawala, M.: Pretraining frame preservation in autoregressive video memory com- pression. arXiv preprint arXiv:2512.23851 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zhang, W., Cun, X., Wang, X., Zhang, Y., Shen, X., Guo, Y., Shan, Y., Wang, F.: Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8652–8661 (2023)

2023

[47] [47]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zhang, Z., Li, L., Ding, Y., Fan, C.: Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3661–3670 (2021)

2021

[48] [48]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Zhao,Y.,Gu,A.,Varma,R.,Luo,L.,Huang,C.C.,Xu,M.,Wright,L.,Shojanazeri, H., Ott, M., Shleifer, S., et al.: Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

ACM Transactions On Graph- ics (TOG)39(6), 1–15 (2020)

Zhou, Y., Han, X., Shechtman, E., Echevarria, J., Kalogerakis, E., Li, D.: Makelttalk: speaker-aware talking-head animation. ACM Transactions On Graph- ics (TOG)39(6), 1–15 (2020)

2020