Loki replaces RGB conditioning stacks with identity-orthogonal parametric face encodings rasterized for diffusion, achieving efficient cross-ID portrait animation without cross-ID training data.
hub
Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation
25 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x higher throughput.
Talker-T2AV achieves better lip-sync accuracy, video quality, and audio quality than dual-branch baselines by separating high-level shared autoregressive modeling from modality-specific low-level diffusion refinement in a joint audio-video generation framework.
A dual-path modulation technique injects independent emotion control into existing feed-forward single-image 3D head avatar pipelines while preserving reconstruction quality.
A replay method for continual face forgery detection condenses real-fake distribution discrepancies into compact maps and synthesizes compatible samples from current real faces to reduce forgetting under tight memory budgets without storing historical images.
Multi-head Gaussian kernels inject temporal scale discrepancy as inductive bias to enable full-duplex talking-listening avatar generation, supported by a new decoupled VoxHear dataset and claimed SOTA naturalness.
AvatarPointillist autoregressively generates adaptive 3D point clouds via Transformer for photorealistic 4D Gaussian avatars from one image, jointly predicting animation bindings and using a conditioned Gaussian decoder.
Phoneme-guided autoregressive framework for talking-head animation that reduces inter-frame flicker via causal keyframe generation and timestamp-aware interpolation, outperforming diffusion baselines on FVD and a new BG-Flicker metric.
SyncCache accelerates DiT-based audio-driven portrait animation up to 4.12x via spatially-asymmetric probing and modality-decoupled caching while preserving near-lossless quality and audio sync.
A causal VAE with variable reference guidance and a Rectified Flow Transformer enables real-time streamable high-quality talking portrait video generation from audio and images.
Reformulates evaluation of audio-driven talking head generation as a sequence alignment problem using Soft DTW, showing improved robustness and consistency across 20 methods and seven datasets.
A fine-tuning-free framework combines pretrained Stable Diffusion with IP-Adapter plus three parameter-free modules to achieve improved lip synchronization and visual quality in talking face generation.
TT-SAC is a parameter-free inference framework that uses a generator-encoder feedback loop to adapt conditioning representations and stabilize identity and motion in audio-driven talking-head videos.
Introduces Eulerian motion guidance with bidirectional geometric consistency to improve training speed and temporal quality in diffusion-based image animation.
AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.
TAVR generates high-fidelity talking avatars from cross-scene video references via token selection and three-stage training (same-scene pretraining, cross-scene fine-tuning, identity RL), outperforming baselines on a new 158-pair benchmark.
MeshLAM reconstructs high-fidelity animatable textured mesh head avatars from a single image via a feed-forward dual shape-texture architecture with iterative GRU decoding and reprojection-based guidance.
SyncBreaker jointly attacks image and audio streams with Multi-Interval Sampling and Cross-Attention Fooling to degrade speech-driven talking head generation more than single-modality baselines.
JAM-Flow introduces a unified flow-matching model with a Multi-Modal Diffusion Transformer that jointly synthesizes facial motion and speech from text, audio, or motion inputs.
LetsTalk combines a multimodal diffusion transformer, noise-regularized memory bank, deep compression autoencoder, and symbiotic/direct fusion schemes to achieve state-of-the-art quality and efficiency in long-duration talking video generation.
HighSync is a diffusion-based lip synchronization system that operates natively at 512x512 resolution by eliminating data leakage to enforce genuine audio dependence and reports state-of-the-art results on quality and sync metrics.
Pixel-level protective perturbations for portrait privacy are ineffective against common image transformations, and a low-cost purification framework can strip them out.
TurboTalk uses progressive distillation from 4 steps to 1 step with distribution matching and adversarial training to achieve 120x faster single-step audio-driven talking avatar video generation.
A survey that organizes diffusion image-to-video methods into a taxonomy, distills core designs in condition encoding, temporal modeling, noise prior, and upsampling, and discusses applications plus challenges.
citing papers explorer
-
Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels
Multi-head Gaussian kernels inject temporal scale discrepancy as inductive bias to enable full-duplex talking-listening avatar generation, supported by a new decoupled VoxHear dataset and claimed SOTA naturalness.