In- terActHuman: Multi-concept human animation with layout- aligned audio conditions

Zhenzhi Wang, Jiaqi Yang, Jianwen Jiang, Chao Liang, Gaojie Lin, Zerong Zheng, Ceyuan Yang, Dahua Lin · 2025 · arXiv 2506.09984

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos

cs.CV · 2026-01-15 · unverdicted · novelty 7.0

CoMoVi co-generates 3D human motions and 2D videos synchronously in a single diffusion denoising loop using 3D-to-2D projection and dual-branch diffusion with 3D-2D cross attentions.

AVI-Edit: Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner

cs.CV · 2025-12-11 · unverdicted · novelty 7.0

AVI-Edit enables precise audio-synchronized instance-level video editing via a granularity-aware mask refiner, a self-feedback audio agent, and a new large-scale annotated dataset.

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.

EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation

cs.CV · 2026-02-14 · unverdicted · novelty 4.0

EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consistency and audio-lip sync.

citing papers explorer

Showing 4 of 4 citing papers.

CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos cs.CV · 2026-01-15 · unverdicted · none · ref 95
CoMoVi co-generates 3D human motions and 2D videos synchronously in a single diffusion denoising loop using 3D-to-2D projection and dual-branch diffusion with 3D-2D cross attentions.
AVI-Edit: Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner cs.CV · 2025-12-11 · unverdicted · none · ref 63
AVI-Edit enables precise audio-synchronized instance-level video editing via a granularity-aware mask refiner, a self-feedback audio agent, and a new large-scale annotated dataset.
OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation cs.CV · 2026-04-13 · unverdicted · none · ref 59
OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.
EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation cs.CV · 2026-02-14 · unverdicted · none · ref 71
EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consistency and audio-lip sync.

In- terActHuman: Multi-concept human animation with layout- aligned audio conditions

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer