A recipe for scaling up text-to-video generation with text-free videos

Xiang Wang, Shiwei Zhang, Hangjie Yuan, Zhiwu Qing, Biao Gong, Yingya Zhang, Yujun Shen, Changxin Gao, Nong Sang · arXiv 2508.09959

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

representative citing papers

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

cs.CV · 2026-04-26 · unverdicted · novelty 7.0

Talker-T2AV achieves better lip-sync accuracy, video quality, and audio quality than dual-branch baselines by separating high-level shared autoregressive modeling from modality-specific low-level diffusion refinement in a joint audio-video generation framework.

THEval. Evaluation Framework for Talking Head Video Generation

cs.CV · 2025-11-06 · conditional · novelty 6.0

THEval proposes eight metrics for evaluating talking head videos on quality, naturalness, and synchronization, tested on 85,000 videos from 17 models with a new curated dataset.

citing papers explorer

Showing 2 of 2 citing papers.

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling cs.CV · 2026-04-26 · unverdicted · none · ref 24
Talker-T2AV achieves better lip-sync accuracy, video quality, and audio quality than dual-branch baselines by separating high-level shared autoregressive modeling from modality-specific low-level diffusion refinement in a joint audio-video generation framework.
THEval. Evaluation Framework for Talking Head Video Generation cs.CV · 2025-11-06 · conditional · none · ref 7
THEval proposes eight metrics for evaluating talking head videos on quality, naturalness, and synchronization, tested on 85,000 videos from 17 models with a new curated dataset.

A recipe for scaling up text-to-video generation with text-free videos

fields

years

verdicts

representative citing papers

citing papers explorer