EchoAvatar: Real-time Generative Avatar Animation from Audio Streams

Bohong Chen; Kun Zhou; Yanlin Weng; Yinglin Xu; Youyi Zheng; Yumeng Li

arxiv: 2605.28272 · v1 · pith:TFJBKYUKnew · submitted 2026-05-27 · 💻 cs.CV

EchoAvatar: Real-time Generative Avatar Animation from Audio Streams

Bohong Chen , Yumeng Li , Yinglin Xu , Youyi Zheng , Yanlin Weng , Kun Zhou This is my paper

Pith reviewed 2026-06-29 13:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords real-time avatar animationaudio-driven motion synthesisstreaming generative modelfull-body animationspeech and music generalizationreinforcement learningLLM tool integrationinteractive virtual avatars

0 comments

The pith

A single streaming model generates continuous full-body avatar motion from live audio streams of speech or music.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework that synthesizes coherent 3D character animations in real time from ongoing audio inputs. It addresses the restriction of prior methods to complete offline sequences or single domains by using one architecture that processes audio incrementally. A sympathetic reader would care because this enables responsive virtual avatars that can work with voice agents or live music without separate models or manual switches. The work combines the streaming design with reinforcement learning for smoother output and an interface for large language models to add semantic instructions.

Core claim

The central claim is that a unified streaming architecture, supported by a robust training strategy enforcing strong audio dependency, produces continuous and coherent full-body motion from incremental audio inputs and generalizes across conversational speech and rhythmic music without explicit domain labels or mode switching, while reinforcement learning improves online quality and a tool-call interface adds controllability from upstream language models.

What carries the argument

The unified streaming architecture that converts incremental audio inputs into continuous coherent full-body motion.

If this is right

The system can act as a plug-and-play component that turns voice agents into interactive humanoid avatars.
It produces higher motion quality and better audio synchronization than existing real-time methods.
Reinforcement learning can be applied to improve the quality of the online motion generation process.
The tool-call interface lets large language models supply explicit semantic control on top of the audio-driven output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same architecture might accept mixed audio such as speech layered over background music without retraining.
It could combine with other sensor streams like video or motion capture to create multi-modal avatar control.
Deployment on edge devices might allow low-latency avatar animation in consumer virtual-reality applications.

Load-bearing premise

The training strategy that enforces strong audio dependency is enough for the model to handle both speech and music seamlessly without any domain labels or switches.

What would settle it

Feeding the model a single audio stream that switches midway from speech to music and checking whether the generated body motion remains coherent and synchronized or breaks into mismatches and artifacts.

Figures

Figures reproduced from arXiv: 2605.28272 by Bohong Chen, Kun Zhou, Yanlin Weng, Yinglin Xu, Youyi Zheng, Yumeng Li.

**Figure 1.** Figure 1: Given streaming audio input, our method generates avatar animation in a streaming manner. The four poses shown above are sampled from a [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: The structure of our motion generation model. Our model is capa [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Architecture of our Attention-based Causal Motion Tokenizer with [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: O/G denotes Gesture Only, which training exclusively on the speech-gesture dataset. As shown, our model trained jointly on both speech-gesture and [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: W/o C denotes training without Hierarchical Token Corruption. Given the same audio and initial motion input, our method generates natural motions [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Real-time Deployment. Our system comprises three components: the user host machine, a voice agent, and the Motion Generator. The host machine [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Motion Quality Reward Model Evaluation. The four plots demonstrate the performance of our motion quality reward model on the validation set under [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Screenshot of the data collection interface used for DPO training. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Screenshot of the web interface used for the user study. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

read the original abstract

Real-time synthesis of high-fidelity 3D character motion from audio is a pivotal component for next-generation interactive avatars and virtual assistants. However, most existing approaches are limited to offline processing of complete audio sequences or are constrained to specific domains, rarely handling both speech and music effectively. In this paper, we introduce a novel framework designed to generate continuous, coherent full-body motion from streaming speech and music with low latency. Central to our approach is a unified streaming architecture capable of synthesizing continuous motion from incremental audio inputs. We employ a robust training strategy that enforces strong audio dependency, allowing the model to seamlessly generalize across conversational speech and rhythmic music without requiring explicit domain labels or mode switching. Additionally, we explored Reinforcement Learning to refine the quality of online generation. Furthermore, we bridge reactive animation with intent-driven behavior via a tool-call interface that allows upstream Large Language Models to inject explicit semantic control. By combining this controllability with stream audio-driven synthesis, our framework serves as a plug-and-play solution for transforming voice agents into interactive humanoid avatars. Extensive experiments demonstrate that our method outperforms state-of-the-art realtime baselines in motion quality and synchronization while maintaining the flexibility required for live deployment. Our code, pre-trained models, and videos are available at https://robinwitch.github.io/EchoAvatar-Page.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims a unified streaming model for audio-driven full-body animation across speech and music without domain labels, but the key training strategy remains undescribed.

read the letter

The punchline is that this paper describes a streaming system for real-time full-body avatar animation from audio that tries to handle both speech and music in one go without domain labels, using RL and an LLM hook.

What stands out as new is the combination of incremental audio processing with cross-domain generalization and the tool-call interface for LLM control. Releasing code and models is a plus for anyone wanting to build on it.

The paper does well at positioning the work as a practical plug-and-play for turning voice agents into avatars.

The soft spot is the generalization. The claim that a single model works seamlessly across conversational speech and rhythmic music without labels or switching rests on an asserted "robust training strategy that enforces strong audio dependency." No mechanism, loss function, or experiment is described in the abstract to show how it avoids using domain-specific audio features as shortcuts. The stress-test note captures this exactly. If the full paper has ablations or derivations that address it, that would change things, but as presented the central result looks more like an assumption than a demonstrated outcome.

The experiments are said to show outperformance, but without data details it's hard to judge.

This paper is for graphics and HCI researchers focused on real-time animation systems. A reader looking for ideas on streaming generative models or LLM-integrated animation would find it worth a look.

It deserves serious referee time because the application area is active and the architecture is concrete enough to review, even with the gaps in the training description.

Referee Report

2 major / 1 minor

Summary. The paper introduces EchoAvatar, a framework for real-time generative avatar animation from streaming audio. It proposes a unified streaming architecture that synthesizes continuous full-body motion from incremental audio inputs, a robust training strategy claimed to enforce strong audio dependency for seamless generalization across conversational speech and rhythmic music without domain labels or mode switching, reinforcement learning to refine online generation quality, and a tool-call interface enabling upstream LLMs to inject semantic control. The work claims to outperform state-of-the-art real-time baselines in motion quality and synchronization while supporting live deployment, with code and models released.

Significance. If the generalization mechanism and real-time performance claims hold under rigorous validation, the work would advance interactive avatar systems by offering a single model for cross-domain audio-driven animation with LLM controllability, addressing a practical gap in voice-agent-to-avatar pipelines.

major comments (2)

[Abstract] Abstract: the central generalization claim (seamless cross-domain performance without explicit labels or mode switching) rests entirely on the assertion of a 'robust training strategy that enforces strong audio dependency.' No loss formulation, data-mixing protocol, feature-ablation term, or curriculum is supplied, so the property is an untested assumption rather than a derived result; this is load-bearing for the unified-architecture contribution.
[Abstract] Abstract: the claim of outperformance over SOTA real-time baselines in motion quality and synchronization is stated without any quantitative metrics, dataset details, ablation results, or baseline descriptions, preventing verification of the experimental support for the core claims.

minor comments (1)

The provision of code, pre-trained models, and videos at the project page is a positive step toward reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We address each point below and will revise the manuscript to improve clarity and support for the claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central generalization claim (seamless cross-domain performance without explicit labels or mode switching) rests entirely on the assertion of a 'robust training strategy that enforces strong audio dependency.' No loss formulation, data-mixing protocol, feature-ablation term, or curriculum is supplied, so the property is an untested assumption rather than a derived result; this is load-bearing for the unified-architecture contribution.

Authors: We agree that the abstract would benefit from greater self-containment on this point. The loss formulation, data-mixing protocol, and related training details are presented in Section 3.2, with supporting ablations in Section 4.3. We will revise the abstract to include a concise description of the key training components that enforce audio dependency. revision: yes
Referee: [Abstract] Abstract: the claim of outperformance over SOTA real-time baselines in motion quality and synchronization is stated without any quantitative metrics, dataset details, ablation results, or baseline descriptions, preventing verification of the experimental support for the core claims.

Authors: We acknowledge the observation. The quantitative metrics, dataset details, baseline descriptions, and ablation results are reported in Section 4 and Tables 1–3. We will revise the abstract to incorporate key numerical results and evaluation details to make the performance claims more verifiable from the abstract alone. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on architectural description and empirical results rather than self-referential derivations

full rationale

The paper describes a streaming architecture and a 'robust training strategy that enforces strong audio dependency' for cross-domain generalization, but supplies no equations, fitted parameters, or derivation chain that reduces to its own inputs by construction. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the abstract or described claims. The central assertions are presented as design choices validated by experiments, not as mathematical results forced by prior steps within the paper itself. This is the expected non-finding for an applied CV systems paper without visible analytic derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.1-grok · 5770 in / 954 out tokens · 35398 ms · 2026-06-29T13:38:07.338814+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Qwen2.5 Technical Report

Qwen2.5 Technical Report.arXiv preprint arXiv:2412.15115(2024). Quanwei Yang, Luying Huang, Kaisiyuan Wang, Jiazhi Guan, Shengyi He, Fengguo Li, Hang Zhou, Lingyun Yu, Yingying Li, Haocheng Feng, et al. 2025. GestureHYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generatio...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3581783.3612503 2024
[2]

Insights into deep non-linear filters for improved multi-channel speech enhancement,

SoundStream: An End-to-End Neural Audio Codec.IEEE/ACM Transactions on Audio, Speech, and Language Processing30 (2022), 495–507. doi:10.1109/TASLP. 2021.3129994 Juze Zhang, Changan Chen, Xin Chen, Heng Yu, Tiange Xiang, Ali Sartaz Khan, Shrinidhi Kowshika Lakshmikanth, and Ehsan Adeli. 2026b. ViBES: A Conver- sational Agent with Behaviorally-Intelligent 3...

work page doi:10.1109/taslp 2022
[3]

Gamify Instructions: Describe *why* you are doing an action
[4]

Music Bridge

The "Music Bridge": If the user mentions music, treat it as the climax. # Tools When the user asks to play music: - Use the play_music tool - If they mention a song name or keyword, pass that as the title When the user asks to stop music: - Use the stop_music tool When the user asks for an action or gesture: - Use the send_action tool - Available actions:...

2026

[1] [1]

Qwen2.5 Technical Report

Qwen2.5 Technical Report.arXiv preprint arXiv:2412.15115(2024). Quanwei Yang, Luying Huang, Kaisiyuan Wang, Jiazhi Guan, Shengyi He, Fengguo Li, Hang Zhou, Lingyun Yu, Yingying Li, Haocheng Feng, et al. 2025. GestureHYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generatio...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3581783.3612503 2024

[2] [2]

Insights into deep non-linear filters for improved multi-channel speech enhancement,

SoundStream: An End-to-End Neural Audio Codec.IEEE/ACM Transactions on Audio, Speech, and Language Processing30 (2022), 495–507. doi:10.1109/TASLP. 2021.3129994 Juze Zhang, Changan Chen, Xin Chen, Heng Yu, Tiange Xiang, Ali Sartaz Khan, Shrinidhi Kowshika Lakshmikanth, and Ehsan Adeli. 2026b. ViBES: A Conver- sational Agent with Behaviorally-Intelligent 3...

work page doi:10.1109/taslp 2022

[3] [3]

Gamify Instructions: Describe *why* you are doing an action

[4] [4]

Music Bridge

The "Music Bridge": If the user mentions music, treat it as the climax. # Tools When the user asks to play music: - Use the play_music tool - If they mention a song name or keyword, pass that as the title When the user asks to stop music: - Use the stop_music tool When the user asks for an action or gesture: - Use the send_action tool - Available actions:...

2026