hub Canonical reference

Wan-s2v: Audio-driven cinematic video generation

URL https: //arxiv · 2025 · arXiv 2508.18621

Canonical reference. 100% of citing Pith papers cite this work as background.

19 Pith papers citing it

Background 100% of classified citations

read on arXiv browse 19 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6

citation-polarity summary

background 6

representative citing papers

OmniDance: Multimodal Driven Dance Video Generation with Large-scale Internet Data

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

Introduces CIPE-Dance as the largest dance video dataset and OmniDance framework for unified text-music multimodal dance video generation achieving SOTA on TI2V, MI2V, and MTI2V tasks.

ReFree: Towards Realistic Co-Speech Video Generation via Reward-Free RL and Multilevel Speech Guidance

cs.CV · 2026-06-11 · unverdicted · novelty 7.0

ReFree-S2V applies multilevel speech guidance and reward-free reinforcement learning inside a flow-matching model built on a pretrained video generator to improve lip synchronization and natural expressivity in talking-head videos.

EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.

ASTRA: Let Arbitrary Subjects Transform in Video Editing

cs.CV · 2025-10-01 · unverdicted · novelty 7.0

ASTRA is a plug-and-play training-free method for precise multi-subject video editing that uses prompt-guided multimodal alignment and prior-based mask retargeting to avoid attention dilution and boundary issues.

Ink3D: Sculpting 3D Assets with Extremely Complex Textures via Video Generative Models

cs.CV · 2026-07-01 · unverdicted · novelty 6.0

Ink3D decouples geometry from texture by generating dense orbit videos with a conditional video model and baking them via a neural optimizer to produce complex 3D textures.

SyncCache: Exploiting Asymmetric Dynamics for Fast Audio-Driven Portrait Animation

cs.CV · 2026-06-29 · unverdicted · novelty 6.0

SyncCache accelerates DiT-based audio-driven portrait animation up to 4.12x via spatially-asymmetric probing and modality-decoupled caching while preserving near-lossless quality and audio sync.

StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

StreamChar decouples LLM-based orchestration from DiT denoising to achieve real-time long-horizon streaming character audio-video generation with reduced drift and misalignment.

VidSplat: Gaussian Splatting Reconstruction with Geometry-Guided Video Diffusion Priors

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

VidSplat iteratively synthesizes novel views with geometry-guided video diffusion to enable robust Gaussian splatting reconstruction from sparse or single-image inputs.

Generate Your Talking Avatar from Video Reference

cs.CV · 2026-04-30 · unverdicted · novelty 6.0

TAVR generates high-fidelity talking avatars from cross-scene video references via token selection and three-stage training (same-scene pretraining, cross-scene fine-tuning, identity RL), outperforming baselines on a new 158-pair benchmark.

Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

cs.CV · 2026-04-28 · unverdicted · novelty 6.0

Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.

LPM 1.0: Video-based Character Performance Model

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

LPM 1.0 generates infinite-length, identity-stable, real-time audio-visual conversational performances for single characters using a distilled causal diffusion transformer and a new benchmark.

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

cs.CV · 2025-12-04 · conditional · novelty 6.0

Live Avatar enables 45 FPS real-time streaming infinite-length audio-driven avatar generation from a 14B diffusion model via distillation and timestep-forcing pipeline parallelism.

Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation

cs.CV · 2026-05-17 · unverdicted · novelty 5.0

Omni-Customizer proposes an end-to-end framework using Omni-Context Fusion, Masked TTS Cross-Attention, Semantic-Anchored Multimodal RoPE, and specialized training curricula to achieve precise multimodal identity binding in joint audio-video generation.

Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence

cs.CV · 2026-04-10 · unverdicted · novelty 5.0

Tora3 uses shared object trajectories as kinematic priors to jointly guide visual motion and acoustic events in audio-video generation, improving realism and synchronization.

AUHead: Realistic Emotional Talking Head Generation via Action Units Control

cs.CV · 2026-02-10 · unverdicted · novelty 5.0

AUHead uses audio-language models to generate Action Unit sequences from speech and feeds them into a controllable diffusion model to synthesize realistic emotional talking-head videos.

LTX-2: Efficient Joint Audio-Visual Foundation Model

cs.CV · 2026-01-06 · conditional · novelty 5.0

LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.

EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation

cs.CV · 2026-02-14 · unverdicted · novelty 4.0

EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consistency and audio-lip sync.

LongCat-Video-Avatar 1.5 Technical Report

cs.CV · 2026-05-26 · unverdicted · novelty 3.0

LongCat-Video-Avatar 1.5 delivers an engineering-focused upgrade to audio-driven video generation with claimed competitive performance against closed-source systems on a 500-case benchmark.

citing papers explorer

Showing 19 of 19 citing papers.

OmniDance: Multimodal Driven Dance Video Generation with Large-scale Internet Data cs.CV · 2026-06-29 · unverdicted · none · ref 12
Introduces CIPE-Dance as the largest dance video dataset and OmniDance framework for unified text-music multimodal dance video generation achieving SOTA on TI2V, MI2V, and MTI2V tasks.
ReFree: Towards Realistic Co-Speech Video Generation via Reward-Free RL and Multilevel Speech Guidance cs.CV · 2026-06-11 · unverdicted · none · ref 4
ReFree-S2V applies multilevel speech guidance and reward-free reinforcement learning inside a flow-matching model built on a pretrained video generator to improve lip synchronization and natural expressivity in talking-head videos.
EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields cs.CV · 2026-05-07 · unverdicted · none · ref 5
EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.
ASTRA: Let Arbitrary Subjects Transform in Video Editing cs.CV · 2025-10-01 · unverdicted · none · ref 7
ASTRA is a plug-and-play training-free method for precise multi-subject video editing that uses prompt-guided multimodal alignment and prior-based mask retargeting to avoid attention dilution and boundary issues.
Ink3D: Sculpting 3D Assets with Extremely Complex Textures via Video Generative Models cs.CV · 2026-07-01 · unverdicted · none · ref 20
Ink3D decouples geometry from texture by generating dense orbit videos with a conditional video model and baking them via a neural optimizer to produce complex 3D textures.
SyncCache: Exploiting Asymmetric Dynamics for Fast Audio-Driven Portrait Animation cs.CV · 2026-06-29 · unverdicted · none · ref 12
SyncCache accelerates DiT-based audio-driven portrait animation up to 4.12x via spatially-asymmetric probing and modality-decoupled caching while preserving near-lossless quality and audio sync.
StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration cs.CV · 2026-05-25 · unverdicted · none · ref 8
StreamChar decouples LLM-based orchestration from DiT denoising to achieve real-time long-horizon streaming character audio-video generation with reduced drift and misalignment.
VidSplat: Gaussian Splatting Reconstruction with Geometry-Guided Video Diffusion Priors cs.CV · 2026-05-12 · unverdicted · none · ref 7
VidSplat iteratively synthesizes novel views with geometry-guided video diffusion to enable robust Gaussian splatting reconstruction from sparse or single-image inputs.
Generate Your Talking Avatar from Video Reference cs.CV · 2026-04-30 · unverdicted · none · ref 18
TAVR generates high-fidelity talking avatars from cross-scene video references via token selection and three-stage training (same-scene pretraining, cross-scene fine-tuning, identity RL), outperforming baselines on a new 158-pair benchmark.
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation cs.CV · 2026-04-28 · unverdicted · none · ref 13
Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation cs.CV · 2026-04-21 · unverdicted · none · ref 10
CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.
LPM 1.0: Video-based Character Performance Model cs.CV · 2026-04-09 · unverdicted · none · ref 33
LPM 1.0 generates infinite-length, identity-stable, real-time audio-visual conversational performances for single characters using a distilled causal diffusion transformer and a new benchmark.
Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length cs.CV · 2025-12-04 · conditional · none · ref 15
Live Avatar enables 45 FPS real-time streaming infinite-length audio-driven avatar generation from a 14B diffusion model via distillation and timestep-forcing pipeline parallelism.
Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation cs.CV · 2026-05-17 · unverdicted · none · ref 16
Omni-Customizer proposes an end-to-end framework using Omni-Context Fusion, Masked TTS Cross-Attention, Semantic-Anchored Multimodal RoPE, and specialized training curricula to achieve precise multimodal identity binding in joint audio-video generation.
Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence cs.CV · 2026-04-10 · unverdicted · none · ref 14
Tora3 uses shared object trajectories as kinematic priors to jointly guide visual motion and acoustic events in audio-video generation, improving realism and synchronization.
AUHead: Realistic Emotional Talking Head Generation via Action Units Control cs.CV · 2026-02-10 · unverdicted · none · ref 6
AUHead uses audio-language models to generate Action Unit sequences from speech and feeds them into a controllable diffusion model to synthesize realistic emotional talking-head videos.
LTX-2: Efficient Joint Audio-Visual Foundation Model cs.CV · 2026-01-06 · conditional · none · ref 7
LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.
EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation cs.CV · 2026-02-14 · unverdicted · none · ref 65
EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consistency and audio-lip sync.
LongCat-Video-Avatar 1.5 Technical Report cs.CV · 2026-05-26 · unverdicted · none · ref 4
LongCat-Video-Avatar 1.5 delivers an engineering-focused upgrade to audio-driven video generation with claimed competitive performance against closed-source systems on a 500-case benchmark.

Wan-s2v: Audio-driven cinematic video generation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer