Talker-T2AV achieves better lip-sync accuracy, video quality, and audio quality than dual-branch baselines by separating high-level shared autoregressive modeling from modality-specific low-level diffusion refinement in a joint audio-video generation framework.
A recipe for scaling up text-to-video generation with text-free videos
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CV 2representative citing papers
THEval proposes eight metrics for evaluating talking head videos on quality, naturalness, and synchronization, tested on 85,000 videos from 17 models with a new curated dataset.
citing papers explorer
-
Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling
Talker-T2AV achieves better lip-sync accuracy, video quality, and audio quality than dual-branch baselines by separating high-level shared autoregressive modeling from modality-specific low-level diffusion refinement in a joint audio-video generation framework.
-
THEval. Evaluation Framework for Talking Head Video Generation
THEval proposes eight metrics for evaluating talking head videos on quality, naturalness, and synchronization, tested on 85,000 videos from 17 models with a new curated dataset.