InteractiveAvatar: Real-Time Streaming Video Generation for Consistent and Intent-Aware Avatars
Pith reviewed 2026-07-01 06:57 UTC · model grok-4.3
The pith
InteractiveAvatar generates consistent avatar videos in real time while aligning with user intent over arbitrarily long streams.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
InteractiveAvatar is a real-time infinite-streaming video generation framework that supports visually consistent avatar video generation and intent-aware interactions. With autoregressive distillation, it achieves real-time streaming generation of human avatars over arbitrarily long durations. For visual consistency, it introduces a Long-Short Visual Memory mechanism that flexibly compresses historical visual information into compact tokens, preserving both short-range coherence and long-term consistency. To generate avatars with speeches and actions aligned with user intent, it proposes a Reasoning-Reaction Module that incorporates a State-Cycling strategy and a Cache-Switching mechanism.
What carries the argument
Long-Short Visual Memory (LSVM) that compresses historical visual information into compact tokens to preserve short-range coherence and long-term consistency, paired with Reasoning-Reaction Module (RRM) using State-Cycling and Cache-Switching to align avatar speech and actions to perceived user intent.
If this is right
- Avatar video generation becomes feasible over arbitrarily long durations while staying visually consistent.
- Complex user-avatar interactions occur in real time with explicit intent perception.
- State-of-the-art visual consistency holds across diverse interactive scenarios.
- Real-time performance is sustained through autoregressive distillation without breaking streaming.
Where Pith is reading between the lines
- The memory compression approach could apply to other real-time video tasks that require long-term coherence beyond avatars.
- If intent alignment scales reliably, it might support multi-turn conversations in virtual environments more naturally.
- The overall design could reduce reliance on pre-generated clips for interactive avatar systems.
Load-bearing premise
The Long-Short Visual Memory and Reasoning-Reaction Module deliver the claimed consistency and intent alignment without introducing new artifacts or latency that would break real-time performance.
What would settle it
A 30-minute interactive streaming test with frequent user intent changes, measuring whether avatar appearance remains consistent without drift and whether responses match intents at real-time latency thresholds.
Figures
read the original abstract
Recent diffusion-based models have enabled realistic audio-driven avatar generation in real-time streaming. However, existing approaches struggle to maintain visual temporal consistency and fail to explicitly perceive user intent in complex interactive streaming scenarios. To address these challenges, we propose InteractiveAvatar, a real-time infinite-streaming video generation framework that supports visually consistent avatar video generation and intent-aware interactions. With autoregressive distillation, InteractiveAvatar achieves real-time str-eaming generation of human avatars over arbitrarily long durations. For visual consistency, we introduce a Long-Short Visual Memory (LSVM) mechanism that flexibly compresses historical visual information into compact tokens, preserving both short-range coherence and long-term consistency. To generate avatars with speeches and actions aligned with user intent, we propose a Reasoning-Reaction Module (RRM), which incorporates a State-Cycling strategy and a Cache-Switching mechanism. Extensive experimental results over diverse scenarios demonstrate that our method achieves state-of-the-art visual consistency in long-duration generation, while enabling complex user-avatar interaction in real time.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents InteractiveAvatar, a real-time infinite-streaming video generation framework for human avatars. It addresses temporal inconsistency and intent misalignment in diffusion-based audio-driven models via autoregressive distillation for long-duration generation, a Long-Short Visual Memory (LSVM) mechanism to compress historical visuals into tokens for short- and long-range coherence, and a Reasoning-Reaction Module (RRM) incorporating State-Cycling and Cache-Switching to align avatar speech and actions with user intent. Experiments over diverse scenarios are claimed to demonstrate state-of-the-art visual consistency and real-time complex interactions.
Significance. If the LSVM and RRM constructions deliver the claimed consistency and alignment without compromising real-time performance or introducing artifacts, the work would advance practical interactive avatar systems for applications such as virtual communication and entertainment by solving key streaming limitations of prior diffusion approaches.
minor comments (1)
- [Abstract] Abstract contains a typographical error ('str-eaming' instead of 'streaming').
Simulated Author's Rebuttal
We thank the referee for their careful summary of InteractiveAvatar and for noting its potential impact on practical interactive avatar systems. The recommendation of 'uncertain' appears to stem from the need for confirmation that LSVM and RRM achieve the stated consistency and alignment without sacrificing real-time performance or introducing artifacts. We address this directly below and clarify that our experiments support these claims.
read point-by-point responses
-
Referee: If the LSVM and RRM constructions deliver the claimed consistency and alignment without compromising real-time performance or introducing artifacts, the work would advance practical interactive avatar systems for applications such as virtual communication and entertainment by solving key streaming limitations of prior diffusion approaches.
Authors: Our experiments in Sections 4.2 and 4.3 demonstrate that LSVM preserves both short-range and long-term visual coherence (via quantitative metrics such as temporal consistency scores and user studies) while RRM enables intent-aligned reactions without measurable latency overhead. Real-time performance is maintained at >30 FPS on the reported hardware, and qualitative results across diverse long-duration sequences show no introduced artifacts attributable to the proposed modules. We are happy to add additional ablation tables or latency breakdowns if requested. revision: no
Circularity Check
No significant circularity identified
full rationale
The abstract and description introduce LSVM and RRM modules plus autoregressive distillation as solutions to temporal consistency and intent alignment, but present no equations, fitted parameters, predictions of derived quantities, or self-citation chains. No derivation steps are described that could reduce to inputs by construction, self-definition, or renaming. The paper's claims are architectural and empirical rather than mathematical reductions, making the derivation self-contained against external benchmarks with no load-bearing circular elements.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
DART: Difficulty-Adaptive Routing for Zero-Shot Video Temporal Grounding
DART routes zero-shot video temporal grounding queries by difficulty using DPP entropy, achieving up to 3.5 mIoU gains with 7x fewer frames on Charades-STA and ActivityNet Captions.
Reference graph
Works this paper leans on
-
[1]
Ai flow: Perspectives, scenarios, and approaches,
An, H., Hu, W., Huang, S., Huang, S., Li, R., Liang, Y., Shao, J., Song, Y., Wang, Z., Yuan, C., et al.: Ai flow: Perspectives, scenarios, and approaches (2025). arXiv preprint arXiv:2506.12479 (2025)
-
[2]
arXiv preprint arXiv:2505.20156 (2025)
Chen, Y., Liang, S., Zhou, Z., Huang, Z., Ma, Y., Tang, J., Lin, Q., Zhou, Y., Lu, Q.: Hunyuanvideo-avatar: High-fidelity audio-driven human animation for multiple characters. arXiv preprint arXiv:2505.20156 (2025)
-
[3]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Chen, Z., Cao, J., Chen, Z., Li, Y., Ma, C.: Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 2403–2410 (2025)
2025
-
[4]
Cheng, S., Zhang, J., Song, Q., Liu, S., Guo, Z., Zhang, X., Zhang, C., Li, X., Tu, Z.: Unison: Harmonizing motion, speech, and sound for human-centric audio-video generation (2026),https://arxiv.org/abs/2605.08729
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[5]
In: INTERSPEECH (2018)
Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: Deep speaker recognition. In: INTERSPEECH (2018)
2018
-
[6]
In: Asian conference on computer vision
Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Asian conference on computer vision. pp. 251–263. Springer (2016)
2016
-
[7]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Cui, J., Li, H., Zhan, Y., Shang, H., Cheng, K., Ma, Y., Mu, S., Zhou, H., Wang, J., Zhu, S.: Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 21086–21095 (2025)
2025
-
[8]
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Cui, J., Wu, J., Li, M., Yang, T., Li, X., Wang, R., Bai, A., Ban, Y., Hsieh, C.J.: Self-forcing++: Towards minute-scale high-quality video generation. arXiv preprint arXiv:2510.02283 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
arXiv preprint arXiv:2505.10238 (2025)
Ding, Y., Hu, X., Guo, Z., Zhang, C., Wang, Y.: Mtvcrafter: 4d motion tokenization for open-world human image animation. arXiv preprint arXiv:2505.10238 (2025)
-
[10]
arXiv preprint arXiv:2506.18866 (2025)
Gan, Q., Yang, R., Zhu, J., Xue, S., Hoi, S.: Omniavatar: Efficient audio- driven avatar video generation with adaptive body animation. arXiv preprint arXiv:2506.18866 (2025)
-
[11]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individ- ual styles of conversational gesture. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3497–3506 (2019)
2019
-
[12]
Advances in neural information processing systems30(2017)
Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)
2017
-
[13]
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length
Huang, Y., Guo, H., Wu, F., Zhang, S., Huang, S., Gan, Q., Liu, L., Zhao, S., Chen, E., Liu, J., et al.: Live avatar: Streaming real-time audio-driven avatar generation with infinite length. arXiv preprint arXiv:2512.04677 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Vicinagearth1(1), 8 (2024)
Jiang, W., Zhang, Y., Zheng, S., Liu, S., Yan, S.: Data augmentation in human- centric vision. Vicinagearth1(1), 8 (2024)
2024
-
[16]
arXiv preprint arXiv:2505.22647 (2025)
Kong, Z., Gao, F., Zhang, Y., Kang, Z., Wei, X., Cai, X., Chen, G., Luo, W.: Let them talk: Audio-driven multi-person conversational video generation. arXiv preprint arXiv:2505.22647 (2025)
-
[17]
arXiv preprint arXiv:2412.00115 (2024) InteractiveAvatar 17
Li, H., Xu, M., Zhan, Y., Mu, S., Li, J., Cheng, K., Chen, Y., Chen, T., Ye, M., Wang, J., et al.: Openhumanvid: A large-scale high-quality dataset for enhancing human-centric video generation. arXiv preprint arXiv:2412.00115 (2024) InteractiveAvatar 17
-
[18]
Vicinagearth1(1), 9 (2024)
Li, X., Wang, S., Zeng, S., Wu, Y., Yang, Y.: A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges. Vicinagearth1(1), 9 (2024)
2024
-
[19]
IEEE Transactions on Neural Networks and Learn- ing Systems35(6), 8708–8714 (2022)
Li, X.: Positive-incentive noise. IEEE Transactions on Neural Networks and Learn- ing Systems35(6), 8708–8714 (2022)
2022
-
[20]
Low, C., Wang, W.: Talkingmachines: Real-time audio-driven facetime-style video via autoregressive diffusion models. arXiv preprint arXiv:2506.03099 (2025)
-
[21]
arXiv preprint arXiv:2507.03905 (2025)
Meng, R., Wang, Y., Wu, W., Zheng, R., Li, Y., Ma, C.: Echomimicv3: 1.3 b pa- rameters are all you need for unified multi-modal and multi-task human animation. arXiv preprint arXiv:2507.03905 (2025)
-
[22]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Ng, E., Joo, H., Hu, L., Li, H., Darrell, T., Kanazawa, A., Ginosar, S.: Learning to listen: Modeling non-deterministic dyadic facial motion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20395– 20405 (2022)
2022
-
[23]
In: Proceedings of the 28th ACM international conference on multimedia
Prajwal, K., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.: A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM international conference on multimedia. pp. 484–492 (2020)
2020
-
[24]
Journal of machine learning research21(140), 1–67 (2020)
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research21(140), 1–67 (2020)
2020
-
[25]
Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., Jégou, H., Labatut, P., Bojanowski, P.: DINOv3 (2025),https://ar...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
arXiv preprint arXiv:2512.22065 (2025)
Sun, Z., Peng, Z., Ma, Y., Chen, Y., Zhou, Z., Zhou, Z., Zhang, G., Zhang, Y., Zhou, Y., Lu, Q., et al.: Streamavatar: Streaming diffusion models for real-time interactive human avatars. arXiv preprint arXiv:2512.22065 (2025)
-
[27]
In: European Conference on Computer Vision
Tian, L., Wang, Q., Zhang, B., Bo, L.: Emo: Emote portrait alive generating ex- pressive portrait videos with audio2video diffusion model under weak conditions. In: European Conference on Computer Vision. pp. 244–260. Springer (2024)
2024
-
[28]
Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., et al.: Siglip 2: Multilingual vision-language encoders with improved semantic understanding, lo- calization, and dense features. arXiv preprint arXiv:2502.14786 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [29]
-
[30]
Towards Accurate Generative Models of Video: A New Metric & Challenges
Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[31]
Wan: Open and Advanced Large-Scale Video Generative Models
Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [32]
-
[33]
arXiv preprint arXiv:2601.10103 (2026) 18 Q
Wang, L., Zhu, Y., Ge, Z., Zheng, Y., Zhang, L., Hu, T., Qin, S., Luo, M., Zhang, J., Chen, X., et al.: Flowact-r1: Towards interactive humanoid video generation. arXiv preprint arXiv:2601.10103 (2026) 18 Q. Song et al
-
[34]
In: Proceedings of the 33rd ACM International Conference on Multimedia
Wang, M., Wang, Q., Jiang, F., Fan, Y., Zhang, Y., Qi, Y., Zhao, K., Xu, M.: Fantasytalking: Realistic talking portrait generation via coherent motion synthesis. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 9891–9900 (2025)
2025
-
[35]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Wang, Y., Fan, Y., Wang, X., Yu, G., Wang, F.: Diffusion-based realistic listening head generation via hybrid motion modeling. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15885–15895 (2025)
2025
-
[36]
Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels
Wu, H., Zhang, Z., Zhang, W., Chen, C., Liao, L., Li, C., Gao, Y., Wang, A., Zhang, E., Sun, W., et al.: Q-align: Teaching lmms for visual scoring via discrete text-defined levels. arXiv preprint arXiv:2312.17090 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
In: The IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2022)
Xie, L., Wang, X., Zhang, H., Dong, C., Shan, Y.: Vfhq: A high-quality dataset and benchmark for video face super-resolution. In: The IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2022)
2022
-
[38]
arXiv preprint arXiv:2509.21574 (2025)
Xie, Y., Gu, T., Li, Z., Zhang, C., Song, G., Zhao, X., Liang, C., Jiang, J., Xu, H., Luo, L.: X-streamer: Unified human world modeling with audiovisual interaction. arXiv preprint arXiv:2509.21574 (2025)
-
[39]
Xu, J., Guo, Z., Hu, H., Chu, Y., Wang, X., He, J., Wang, Y., Shi, X., He, T., Zhu, X., et al.: Qwen3-omni technical report. arXiv preprint arXiv:2509.17765 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Advances in Neural Information Processing Systems37, 660–684 (2024)
Xu, S., Chen, G., Guo, Y.X., Yang, J., Li, C., Zang, Z., Zhang, Y., Tong, X., Guo, B.: Vasa-1: Lifelike audio-driven talking faces generated in real time. Advances in Neural Information Processing Systems37, 660–684 (2024)
2024
-
[41]
LongLive: Real-time Interactive Long Video Generation
Yang, S., Huang, W., Chu, R., Xiao, Y., Zhao, Y., Wang, X., Li, M., Xie, E., Chen, Y., Lu, Y., et al.: Longlive: Real-time interactive long video generation. arXiv preprint arXiv:2509.22622 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion
Yin, T., Zhang, Q., Zhang, R., Freeman, W.T., Durand, F., Shechtman, E., Huang, X.: From slow bidirectional to fast autoregressive video diffusion models. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 22963–22974 (2025)
2025
-
[43]
In: CVPR (2023)
Yu, J., Zhu, H., Jiang, L., Loy, C.C., Cai, W., Wu, W.: CelebV-Text: A large-scale facial text-video dataset. In: CVPR (2023)
2023
-
[44]
IEEE Transactions on Pattern Analysis and Machine Intel- ligence (2025)
Zhang, H., Huang, S., Guo, Y., Li, X.: Variational positive-incentive noise: How noise benefits models. IEEE Transactions on Pattern Analysis and Machine Intel- ligence (2025)
2025
-
[45]
TinyHistory: Lightweight Video History Embeddings via Two-Stage Context Learning
Zhang, L., Cai, S., Li, M., Zeng, C., Lu, B., Rao, A., Han, S., Wetzstein, G., Agrawala, M.: Pretraining frame preservation in autoregressive video memory com- pression. arXiv preprint arXiv:2512.23851 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zhang, W., Cun, X., Wang, X., Zhang, Y., Shen, X., Guo, Y., Shan, Y., Wang, F.: Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8652–8661 (2023)
2023
-
[47]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Zhang, Z., Li, L., Ding, Y., Fan, C.: Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3661–3670 (2021)
2021
-
[48]
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Zhao,Y.,Gu,A.,Varma,R.,Luo,L.,Huang,C.C.,Xu,M.,Wright,L.,Shojanazeri, H., Ott, M., Shleifer, S., et al.: Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
ACM Transactions On Graph- ics (TOG)39(6), 1–15 (2020)
Zhou, Y., Han, X., Shechtman, E., Echevarria, J., Kalogerakis, E., Li, D.: Makelttalk: speaker-aware talking-head animation. ACM Transactions On Graph- ics (TOG)39(6), 1–15 (2020)
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.