HunyuanVideo: A Systematic Framework For Large Video Generative Models

Aladdin Wang; Andong Wang; Bo Wu; Caesar Zhong (refer to the report for detailed contributions); Changlin Li; Dax Zhou; Di Wang; Duojun Huang; Fang Yang; Hao Tan

arxiv: 2412.03603 · v6 · submitted 2024-12-03 · 💻 cs.CV

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong , Qi Tian , Zijian Zhang , Rox Min , Zuozhuo Dai , Jin Zhou , Jiangfeng Xiong , Xin Li

show 44 more authors

Bo Wu Jianwei Zhang Kathrina Wu Qin Lin Junkun Yuan Yanxin Long Aladdin Wang Andong Wang Changlin Li Duojun Huang Fang Yang Hao Tan Hongmei Wang Jacob Song Jiawang Bai Jianbing Wu Jinbao Xue Joey Wang Kai Wang Mengyang Liu Pengyu Li Shuai Li Weiyan Wang Wenqing Yu Xinchi Deng Yang Li Yi Chen Yutao Cui Yuanbo Peng Zhentao Yu Zhiyu He Zhiyong Xu Zixiang Zhou Zunnan Xu Yangyu Tao Qinglin Lu Songtao Liu Dax Zhou Hongfa Wang Yong Yang Di Wang Yuhong Liu Jie Jiang Caesar Zhong (refer to the report for detailed contributions)

This is my paper

Pith reviewed 2026-05-23 07:39 UTC · model grok-4.3

classification 💻 cs.CV

keywords video generationopen-source modelgenerative AItext-to-videolarge-scale trainingfoundation modelAI video synthesis

0 comments

The pith

HunyuanVideo is an open-source video generation model with over 13 billion parameters that performs on par with or better than leading closed-source models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors introduce HunyuanVideo, a comprehensive open-source framework for building large video generative models. This framework covers data curation, model architecture, progressive scaling up to more than 13 billion parameters, and specialized infrastructure for training and inference. Through targeted designs, the model achieves high performance in visual quality, motion dynamics, text-video alignment, and filming techniques. Professional evaluations show it outperforming models such as Runway Gen-3 and Luma 1.6. By making the code public, the work seeks to narrow the gap between closed-source industry leaders and open-source capabilities.

Core claim

We trained a video generative model with over 13 billion parameters using a systematic framework that includes data curation, advanced architectural design, progressive model scaling and training, and efficient infrastructure, resulting in performance that matches or exceeds that of leading closed-source models according to professional evaluations.

What carries the argument

The comprehensive framework integrating data curation, advanced architectural design, progressive model scaling and training, and efficient infrastructure for large-scale model training and inference.

If this is right

Open access to a high-performing video model allows researchers and developers to experiment and innovate without proprietary restrictions.
The release enables community-driven improvements and applications in areas like content creation and simulation.
It establishes a baseline for future open-source video models to build upon in terms of scale and quality.
Bridging the performance gap encourages more collaborative development in the video generation field.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the model generalizes well, it could accelerate adoption of video AI in smaller organizations and individual creators.
Extensions might involve combining this with other AI tools for end-to-end video production pipelines.
Testing on more diverse and challenging prompts could reveal specific strengths and limitations not covered in the initial evaluations.

Load-bearing premise

The professional evaluations used consistent, unbiased protocols with comparable generation settings across all compared models.

What would settle it

A controlled experiment with identical prompts and inference parameters where independent raters find no performance advantage for HunyuanVideo over the compared models.

Figures

Figures reproduced from arXiv: 2412.03603 by Aladdin Wang, Andong Wang, Bo Wu, Caesar Zhong (refer to the report for detailed contributions), Changlin Li, Dax Zhou, Di Wang, Duojun Huang, Fang Yang, Hao Tan, Hongfa Wang, Hongmei Wang, Jacob Song, Jianbing Wu, Jiangfeng Xiong, Jianwei Zhang, Jiawang Bai, Jie Jiang, Jinbao Xue, Jin Zhou, Joey Wang, Junkun Yuan, Kai Wang, Kathrina Wu, Mengyang Liu, Pengyu Li, Qinglin Lu, Qin Lin, Qi Tian, Rox Min, Shuai Li, Songtao Liu, Weijie Kong, Weiyan Wang, Wenqing Yu, Xinchi Deng, Xin Li, Yang Li, Yangyu Tao, Yanxin Long, Yi Chen, Yong Yang, Yuanbo Peng, Yuhong Liu, Yutao Cui, Zhentao Yu, Zhiyong Xu, Zhiyu He, Zijian Zhang, Zixiang Zhou, Zunnan Xu, Zuozhuo Dai.

**Figure 2.** Figure 2: Left: Computation resources used for closed-source and open-source video generation [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: The overall training system for HunyuanVideo. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Our hierarchical data filtering pipeline. We employ various filters for data filtering and progressively increase their thresholds to build 4 training datasets, i.e., 256p, 360p, 540p, and 720p, while the final SFT dataset is built through manual annotation. This figure highlights some of the most important filters to use at each stage. A large portion of data will be removed at each stage, ranging from ha… view at source ↗

**Figure 5.** Figure 5: The overall architecture of HunyuanVideo. The model is trained on a spatial-temporally [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: The architecture of our 3DVAE. 4.1.1 Training In contrast to most previous work [67, 11, 104], we do not rely on a pre-trained image VAE for parameter initialization; instead, we train our model from scratch. To balance the reconstruction quality of videos and images, we mix video and image data at a ratio of 4 : 1. Besides the routinely used L1 reconstruction loss and KL loss Lkl, we also incorporate perc… view at source ↗

**Figure 7.** Figure 7: VAE reconstruction case comparison [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: The architecture of our HunyuanVideo Diffusion Backbone. [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Text encoder comparison between T5 XXL and the instruction-guided MLLM introduced [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Scaling laws of DiT-T2X model family. On the top-left (a) we show the loss curves of the [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 11.** Figure 11: (a) Different time-step schedulers. For our shifting stragty, we set a larger shifting factor [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt: A white cat sits on a white soft sofa like a person, while its long-haired male [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: High-quality videos generated by HunyuanVideo. [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 14.** Figure 14: High-motion dynamics videos generated by HunyuanVideo. [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

**Figure 15.** Figure 15: HunyuanVideo’s performance on concept generalization. The results of the three rows correspond to the text prompts (1) ‘In a distant galaxy, an astronaut floats on a shimmering, pink, gemstone-like lake that reflects the vibrant colors of the surrounding sky, creating a stunning scene. The astronaut gently drifts on the lake’s surface, the soft sounds of water whispering the planet’s secrets. He reaches o… view at source ↗

**Figure 16.** Figure 16: Prompt: The woman walks over and opens the red wooden door. As the door swings open, [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗

**Figure 17.** Figure 17: High text-video alignment videos generated by HunyuanVideo. Top row: Prompt: A [PITH_FULL_IMAGE:figures/full_fig_p018_17.png] view at source ↗

**Figure 18.** Figure 18: The architecture of sound effect and music generation model. [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗

**Figure 19.** Figure 19: HunyuanVideo-I2V Diffusion Backbone. Image-to-video (I2V) task is a common application in video generation tasks. It usually means that given an image and a caption, the model uses this image as the first frame to generate a video that matches the caption. Although the naïve HunyuanVideo is a text-to-video (T2V) model, it can be easily extended to an I2V model. As shown in [PITH_FULL_IMAGE:figures/full_f… view at source ↗

**Figure 20.** Figure 20: Sample results of the I2V pre-training model. [PITH_FULL_IMAGE:figures/full_fig_p021_20.png] view at source ↗

**Figure 21.** Figure 21: Sample results of our portrait I2V model. [PITH_FULL_IMAGE:figures/full_fig_p021_21.png] view at source ↗

**Figure 22.** Figure 22: Overview of Avatar Animation built on top of HunyuanVideo. We adopt 3D VAE to encode and inject reference and pose condition, and use additional cross-attention layers to inject audio and expression signals. Masks are employed to explicitly guide where they are affecting. 7.3.1 Upper-Body Talking Avatar Generation In recent years, audio-driven digital human algorithms have made significant progress, espec… view at source ↗

**Figure 23.** Figure 23: Audio-Driven. HunyuanVideo can generate vivid talking avatar videos. space. We then inject the driving signals to the model by element-wise add as zˆt + zpose. Note that zˆt contains the appearance information of reference image. We use full-parameters finetune with pretrained T2V weights as initialization. Expression-Driven We can also control the facial expressions of digital character using implicit ex… view at source ↗

**Figure 24.** Figure 24: Pose-Driven. HunyuanVideo can animate wide variety of characters with high quality and appearance consistency under various poses. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_24.png] view at source ↗

**Figure 25.** Figure 25: Expression-Driven. HunyuanVideo can accurately control facial movements of widevariety of avatar styles. Audio-Driven [PITH_FULL_IMAGE:figures/full_fig_p025_25.png] view at source ↗

**Figure 26.** Figure 26: Hybrid Condition-Driven. HunyuanVideo supports full control with multiple driving sources across various avatar characters. • High ID-Consistency. Our method maintains the ID-consistency well over the frames even with large poses, making it face-swapping free, thereby, could be used as real end-to-end animation solution. • Following Complex Poses Accurately. Our method is able to handle very complex poses… view at source ↗

read the original abstract

Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure tailored for large-scale model training and inference. As a result, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion dynamics, text-video alignment, and advanced filming techniques. According to evaluations by professionals, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models. By releasing the code for the foundation model and its applications, we aim to bridge the gap between closed-source and open-source communities. This initiative will empower individuals within the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem. The code is publicly available at https://github.com/Tencent/HunyuanVideo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HunyuanVideo is a 13B open video model release with code, but the outperformance claim has no metrics or evaluation details to support it.

read the letter

HunyuanVideo is a release of a 13B open video generative model with code, positioned as matching or exceeding closed-source leaders like Runway Gen-3 and Luma 1.6 per professional raters. The performance edge is the main selling point, but it comes with no supporting data in the paper itself. The contribution is the scaled model and the systematic framework that includes data curation, architecture, progressive scaling, and training infrastructure. Making the code public at this parameter count is useful and addresses the closed-source gap directly. The paper does well in describing how they handled motion dynamics, text-video alignment, and advanced filming techniques through specific designs. This engineering focus is practical and shows attention to real use cases. The soft spot is the evaluation. The abstract mentions extensive experiments and professional evaluations showing outperformance, yet supplies no metrics, ablations, baseline comparisons, or details on the raters, prompts, scoring rubric, or controls. Without those, the claim stays unverifiable. This paper is for the video generation community that wants reproducible large models to experiment with or improve. It is less relevant for readers seeking new algorithmic insights or rigorous benchmarks. I think it should go to peer review. The model and code release are important enough to warrant discussion, even if the write-up needs more evidence to stand on its own.

Referee Report

2 major / 0 minor

Summary. The paper introduces HunyuanVideo, an open-source video generative foundation model exceeding 13 billion parameters. It describes a systematic framework encompassing data curation, architectural design, progressive scaling and training, and large-scale infrastructure. The central claim is that the model achieves visual quality, motion dynamics, text-video alignment, and filming techniques comparable to or surpassing closed-source SOTA systems (Runway Gen-3, Luma 1.6, and three leading Chinese models) according to professional evaluations, with code released publicly to narrow the open/closed-source gap.

Significance. If the outperformance claim holds under reproducible conditions, the work would be significant as the largest open-source video generation model released to date, accompanied by code at https://github.com/Tencent/HunyuanVideo. This release constitutes a concrete contribution that could enable community experimentation and reduce the performance disparity with closed-source systems. The emphasis on a full training and inference pipeline for billion-parameter video models is a strength worth documenting.

major comments (2)

[Abstract] Abstract: The assertion that HunyuanVideo 'outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models' according to professional evaluations is the load-bearing claim of the manuscript, yet no details are supplied on rater count, selection criteria, scoring rubric, inter-rater reliability, prompt sampling method, inference compute parity across models, or statistical testing. Without these elements the comparison cannot be evaluated for bias or reproducibility.
[Abstract] Abstract: The statement that 'extensive experiments and a series of targeted designs' were used to achieve high visual quality, motion dynamics, text-video alignment, and advanced filming techniques is unsupported by any quantitative metrics, ablation tables, baseline comparisons, or error analysis in the manuscript. This omission prevents assessment of whether the described framework components are responsible for the claimed improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation claims. We address each major comment below and will revise the manuscript accordingly to improve clarity and reproducibility.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that HunyuanVideo 'outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models' according to professional evaluations is the load-bearing claim of the manuscript, yet no details are supplied on rater count, selection criteria, scoring rubric, inter-rater reliability, prompt sampling method, inference compute parity across models, or statistical testing. Without these elements the comparison cannot be evaluated for bias or reproducibility.

Authors: We agree that the professional evaluation details are essential for assessing reproducibility and potential biases. In the revised manuscript, we will add a new subsection under Experiments detailing the evaluation protocol. This will include the number of professional raters, their selection criteria and expertise, the scoring rubric, inter-rater reliability statistics (e.g., Cohen's kappa or similar), prompt sampling strategy, measures taken to align inference compute across models where feasible given closed-source constraints, and results of any statistical significance testing. revision: yes
Referee: [Abstract] Abstract: The statement that 'extensive experiments and a series of targeted designs' were used to achieve high visual quality, motion dynamics, text-video alignment, and advanced filming techniques is unsupported by any quantitative metrics, ablation tables, baseline comparisons, or error analysis in the manuscript. This omission prevents assessment of whether the described framework components are responsible for the claimed improvements.

Authors: We acknowledge that the current manuscript lacks quantitative metrics, ablation studies, and baseline comparisons to directly link specific design choices to the reported improvements. While the paper emphasizes the overall systematic framework, we will revise by adding an Experiments section with quantitative results, ablation tables for key components (e.g., data curation and architectural elements), baseline comparisons against prior models, and error analysis to better substantiate the contributions of the targeted designs. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical model release with no derivation chain

full rationale

The manuscript reports training a 13B-parameter video model and claims outperformance via professional evaluations. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the provided text. The central claim rests on external human ratings rather than any internal reduction to inputs. Absence of evaluation protocol details is a transparency concern but does not constitute circularity under the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on empirical training results and external human evaluations rather than mathematical axioms or new theoretical entities; no free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5982 in / 1137 out tokens · 26478 ms · 2026-05-23T07:39:55.309951+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
cs.CV 2026-05 unverdicted novelty 8.0

AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.
TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
cs.CV 2026-05 unverdicted novelty 8.0

TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
Flow-GRPO: Training Flow Matching Models via Online RL
cs.CV 2025-05 unverdicted novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
Geo-Align: Video Generation Alignment via Metric Geometry Reward
cs.CV 2026-05 unverdicted novelty 7.0

Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.
CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models
cs.CV 2026-05 unverdicted novelty 7.0

CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.
EM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

EM-Vid introduces an entity-centric latent patch memory bank with sparse token conditioning and budgeted updates for training-free consistent multi-shot video generation.
DFSAttn: Dynamic Fine-grained Sparse Attention for Efficient Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

DFSAttn is a training-free framework for dynamic fine-grained sparse attention in video DiTs that achieves up to 2.1x speedup while preserving generation quality via Hilbert reordering, hierarchical scoring, and adapt...
VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation
cs.CV 2026-05 unverdicted novelty 7.0

VDE accelerates rectified flow models like Flux by 3.22x with LPIPS of 0.069 via velocity decomposition into parallel/orthogonal components plus periodic full-pass anchoring.
CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration
cs.CV 2026-05 unverdicted novelty 7.0

CoMoGen generates controllable interactive video from mask sequences and images by encoding masks into MMDiT via MaskAdapter and LoRA on motion layers, claiming SOTA motion fidelity.
ORBIS: Output-Guided Token Reduction with Distribution-Aware Matching for Video Diffusion Acceleration
cs.CV 2026-05 unverdicted novelty 7.0

ORBIS uses output-guided token reduction and DATM to achieve 2x higher token reduction than AsymRnR, with up to 4.5x speedup and 79.3% energy savings versus A100 GPU for video DiT models.
Q-ARVD: Quantizing Autoregressive Video Diffusion Models
cs.CV 2026-05 unverdicted novelty 7.0

Q-ARVD introduces final-quality-aware frame weighting and outlier-aware adaptive dual-scale quantization to enable accurate low-bit inference for autoregressive video diffusion models.
What Semantics Survive the Connector? Diagnosing VLM-to-DiT Alignment in Video Editing
cs.CV 2026-05 unverdicted novelty 7.0

VLM-to-DiT alignment in video editing models acts as a semantic bottleneck that degrades fine-grained structural semantics, demonstrated via a new diagnostic dataset and protocol on relation-based edits.
MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation
cs.CV 2026-05 conditional novelty 7.0

MSAVBench is the first comprehensive benchmark for multi-shot audio-video generation, spanning video, audio, shot, and reference dimensions with an adaptive evaluation framework that reaches 91.5% Spearman correlation...
Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls
cs.CV 2026-05 unverdicted novelty 7.0

Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency sup...
PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

PRISM benchmark of over 10k pairs shows LLMs have a 41% average drop from code execution success to spatial correctness in programmatic video generation.
InstructAV2AV: Instruction-Guided Audio-Video Joint Editing
cs.CV 2026-05 unverdicted novelty 7.0

InstructAV2AV is an end-to-end instruction-guided audio-video joint editing model that adapts a pre-trained backbone with gated attention and two-stage training, outperforming prior methods on 11 metrics after buildin...
StreamingEffect: Real-Time Human-Centric Video Effect Generation
cs.CV 2026-05 unverdicted novelty 7.0

StreamingEffect enables real-time 720p human-centric video effect generation on one GPU via teacher-student distillation, keyframe control, and a new 130K video dataset.
Accelerating Rectified Flow Models via Trajectory-Aware Caching
cs.CV 2026-05 unverdicted novelty 7.0

TACache accelerates rectified flow sampling up to 4.14x for text-to-image and 2.11x for text-to-video via offline skip scheduling from cumulative variation thresholds and online velocity reconstruction using historica...
Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video ...
HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention
cs.CV 2026-05 unverdicted novelty 7.0

HASTE delivers up to 1.93x speedup on Wan2.1 video DiTs via head-wise adaptive sparse attention using temporal mask reuse and error-guided per-head calibration while preserving video quality.
TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion
cs.CV 2026-05 unverdicted novelty 7.0

TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.
Asymmetric Flow Models
cs.CV 2026-05 unverdicted novelty 7.0

Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finet...
OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.
MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics
cs.CV 2026-05 unverdicted novelty 7.0

MoCam unifies static and dynamic novel view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion denoising process.
MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics
cs.CV 2026-05 unverdicted novelty 7.0

MoCam uses structured denoising dynamics in diffusion models to temporally decouple geometric alignment from appearance refinement, enabling unified novel view synthesis that outperforms prior methods on imperfect poi...
CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating
cs.CV 2026-05 unverdicted novelty 7.0

CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...
HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation
cs.CV 2026-05 conditional novelty 7.0

HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.
Detecting Deception, Not Deepfakes: Why Media Forensics Needs Social Theories
cs.CY 2026-05 unverdicted novelty 7.0

Deepfake detection must shift from classifying media realism to detecting communicative deception by applying Speech Act Theory, Grice's Cooperative Principle, and Cialdini's influence principles.
From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

A kinematic-to-visual lifting paradigm combined with hierarchically routed control generates action-conditioned surgical videos with better faithfulness, fidelity, and efficiency.
OphEdit: Training-Free Text-Guided Editing of Ophthalmic Surgical Videos
cs.CV 2026-05 unverdicted novelty 7.0

OphEdit enables text-guided editing of eye surgery videos without training by injecting preserved attention value tensors into the diffusion denoising process to maintain anatomical structure.
DCR: Counterfactual Attractor Guidance for Rare Compositional Generation
cs.CV 2026-05 unverdicted novelty 7.0

DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.
FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction
cs.CV 2026-05 unverdicted novelty 7.0

FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
cs.CV 2026-05 unverdicted novelty 7.0

AniMatrix generates anime videos using a structured taxonomy of artistic production variables, dual-channel conditioning, a style-motion curriculum, and deformation-aware optimization to prioritize art over physics.
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
cs.CV 2026-05 unverdicted novelty 7.0

AniMatrix generates anime videos using a production knowledge taxonomy, dual-channel conditioning, style-motion curriculum, and deformation-aware preference optimization, outperforming baselines in animator evaluation...
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
cs.CV 2026-05 unverdicted novelty 7.0

AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional ani...
WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models
cs.CV 2026-05 unverdicted novelty 7.0

WorldJen is a new benchmark for generative video models that uses VLM-judged multi-dimensional Likert questionnaires validated against human preferences to achieve perfect tier agreement.
DirectEdit: Step-Level Accurate Inversion for Flow-Based Image Editing
cs.CV 2026-05 unverdicted novelty 7.0

DirectEdit achieves step-level accurate inversion for flow-based image editing by directly aligning forward paths, using attention feature injection and mask-guided noise blending to balance fidelity and editability w...
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
cs.CV 2026-05 unverdicted novelty 7.0

VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
cs.LG 2026-05 unverdicted novelty 7.0

AsymTalker maintains identity consistency in long-term diffusion talking-head videos by encoding temporal references from a static image and training a student model under inference-like conditions via asymmetric dist...
Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation
cs.GR 2026-04 unverdicted novelty 7.0

Cutscene Agent uses a multi-agent LLM system and a new toolkit for game engine control to automate end-to-end 3D cutscene generation, evaluated on the introduced CutsceneBench.
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

MuSS is a movie-derived dataset and benchmark that enables AI models to generate multi-shot videos with coherent narratives and preserved subject identity across shots.
FlowAnchor: Stabilizing the Editing Signal for Inversion-Free Video Editing
cs.CV 2026-04 unverdicted novelty 7.0

FlowAnchor stabilizes editing signals in flow-based inversion-free video editing via spatial-aware attention refinement and adaptive magnitude modulation for improved faithfulness and temporal coherence.
WorldMark: A Unified Benchmark Suite for Interactive Video World Models
cs.CV 2026-04 unverdicted novelty 7.0

WorldMark is the first public benchmark that standardizes scenes, trajectories, and control interfaces across heterogeneous interactive image-to-video world models.
Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.
AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe
cs.MM 2026-04 unverdicted novelty 7.0

AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both a...
ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis
cs.CV 2026-04 unverdicted novelty 7.0

ReImagine decouples human appearance from temporal consistency via pretrained image backbones, SMPL-X motion guidance, and training-free video diffusion refinement to generate high-quality controllable videos.
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
cs.CV 2026-04 unverdicted novelty 7.0

MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
cs.CV 2026-04 unverdicted novelty 7.0

UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
Efficient Video Diffusion Models: Advancements and Challenges
cs.CV 2026-04 unverdicted novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs
cs.CV 2026-04 unverdicted novelty 7.0

UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Flow of Truth is the first proactive temporal forensics framework for image-to-video generation that uses a learnable forensic template following pixel motion and a template-guided flow module to decouple motion from content.
Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Flow of Truth introduces a learnable forensic template and template-guided flow module that follows pixel motion to enable temporal tracing in image-to-video generation.
SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models
cs.LG 2026-04 unverdicted novelty 7.0

SOAR is a reward-free on-policy method that supplies dense per-timestep supervision to correct exposure bias in diffusion model denoising trajectories, raising GenEval from 0.70 to 0.78 and OCR from 0.64 to 0.67 over ...
LottieGPT: Tokenizing Vector Animation for Autoregressive Generation
cs.CV 2026-04 unverdicted novelty 7.0

LottieGPT tokenizes Lottie animations into compact sequences and fine-tunes Qwen-VL to autoregressively generate coherent vector animations from natural language or visual prompts, outperforming prior SVG models.
LayerCache: Exploiting Layer-wise Velocity Heterogeneity for Efficient Flow Matching Inference
cs.CV 2026-04 unverdicted novelty 7.0

LayerCache enables per-layer-group caching in flow matching models via adaptive JVP span selection and greedy 3D scheduling, delivering 1.37x speedup with PSNR 37.46 dB, SSIM 0.9834, and LPIPS 0.0178 on Qwen-Image.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
cs.CV 2026-04 unverdicted novelty 7.0

A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
Immune2V: Image Immunization Against Dual-Stream Image-to-Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Immune2V immunizes images against dual-stream I2V generation by enforcing temporally balanced latent divergence and aligning generative features to a precomputed collapse trajectory, yielding stronger persistent degra...
Camera Artist: A Multi-Agent Framework for Cinematic Language Storytelling Video Generation
cs.AI 2026-04 unverdicted novelty 7.0

Camera Artist is a multi-agent framework introducing a Cinematography Shot Agent with recursive storyboard generation and cinematic language injection to improve narrative consistency and film quality in AI-generated ...

Reference graph

Works this paper leans on

104 extracted references · 104 canonical work pages · cited by 247 Pith papers · 16 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

PaLM 2 Technical Report

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023. 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

All are worth words: A vit backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22669–22679, 2023. 9

work page 2023
[4]

Improving image generation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023. 4

work page 2023
[5]

Stable video diffusion: Scaling latent video diffusion models to large datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint, 2023. 2, 27

work page 2023
[6]

Large Scale GAN Training for High Fidelity Natural Image Synthesis

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Homes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Wing Yin Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. 4, 7, 27

work page 2024
[8]

Language Models are Few-Shot Learners

Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165,

work page internal anchor Pith review Pith/arXiv arXiv 2005
[9]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Sharegpt4video: Improving video understanding and generation with better captions

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understanding and generation with better captions. arXiv preprint arXiv:2406.04325, 2024. 4

work page arXiv 2024
[11]

Od-vae: An omni-dimensional video compressor for improving latent video diffusion model

Liuhan Chen, Zongjian Li, Bin Lin, Bin Zhu, Qian Wang, Shenghai Yuan, Xing Zhou, Xinghua Cheng, and Li Yuan. Od-vae: An omni-dimensional video compressor for improving latent video diffusion model. arXiv preprint arXiv:2409.01199, 2024. 6

work page arXiv 2024
[12]

Follow-your-canvas: Higher-resolution video outpainting with extensive content generation

Qihua Chen, Yue Ma, Hongfa Wang, Junkun Yuan, Wenzhe Zhao, Qi Tian, Hongmei Wang, Shaobo Min, Qifeng Chen, and Wei Liu. Follow-your-canvas: Higher-resolution video outpainting with extensive content generation. arXiv preprint arXiv:2409.01055, 2024. 27

work page arXiv 2024
[13]

Neural ordinary differential equations

Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018. 10

work page 2018
[14]

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-Wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, and Sergey Tulyakov. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , page 13320–13331. IEEE, ...

work page 2024
[15]

Seine: Short-to-long video diffusion model for generative transition and prediction

Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint, 2023. 27

work page 2023
[16]

EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Condi- tions,

Zhiyuan Chen, Jiajiong Cao, Zhiquan Chen, Yuming Li, and Chenguang Ma. Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions. arXiv preprint arXiv:2407.08136, 2024. 23 29

work page arXiv 2024
[17]

Xtuner: A toolkit for efficiently fine-tuning llm

XTuner Contributors. Xtuner: A toolkit for efficiently fine-tuning llm. https://github. com/InternLM/xtuner, 2023. 8

work page 2023
[18]

OpenCV Developers. Opencv. https://opencv.org/. 3

work page
[19]

Pyscenedetect

PySceneDetect Developers. Pyscenedetect. https://www.scenedetect.com/. 3

work page
[20]

Lp-musiccaps: Llm-based pseudo music captioning

SeungHeon Doh, Keunwoo Choi, Jongpil Lee, and Juhan Nam. Lp-musiccaps: Llm-based pseudo music captioning. arXiv preprint arXiv:2307.16372, 2023. 19

work page arXiv 2023
[21]

Scaling rectified flow transform- ers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024. 2, 8, 9, 10

work page 2024
[22]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 6

work page 2021
[23]

Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers

Peng Gao, Le Zhuo, Ziyi Lin, Chris Liu, Junsong Chen, Ruoyi Du, Enze Xie, Xu Luo, Longtian Qiu, Yuhang Zhang, et al. Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers. arXiv preprint arXiv:2405.05945, 2024. 9

work page arXiv 2024
[24]

YOLOX: Exceeding YOLO Series in 2021

Z Ge. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021
[25]

Emu video: Factorizing text-to-video generation by explicit image conditioning

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709,

work page arXiv
[26]

Chatglm: A family of large language models from glm-130b to glm-4 all tools, 2024

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, ...

work page 2024
[27]

Sparsectrl: Adding sparse controls to text-to-video diffusion models

Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Sparsectrl: Adding sparse controls to text-to-video diffusion models. arXiv preprint, 2023. 27

work page 2023
[28]

Animatediff: Animate your personalized text-to-image diffusion models without specific tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. ICLR, 2024. 27

work page 2024
[29]

Taming data and transformers for audio generation

Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Guha Balakrishnan, Sergey Tulyakov, and Vicente Ordonez. Taming data and transformers for audio generation. arXiv preprint arXiv:2406.19388, 2024. 19

work page arXiv 2024
[30]

Ring Attention with Blockwise Transformers for Near-Infinite Context

Pieter Abbeel Hao Liu, Matei Zaharia. Ring attention with blockwise transformers for near- infinite context. arXiv preprint arXiv:2310.01889, 2023. 14

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Animate-a-story: Storytelling with retrieval-augmented video generation

Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan, et al. Animate-a-story: Storytelling with retrieval-augmented video generation. arXiv preprint, 2023. 27

work page 2023
[32]

Video diffusion models

J Ho, T Salimans, A Gritsenko, W Chan, M Norouzi, and DJ Fleet. Video diffusion models. arxiv 2022. arXiv preprint, 2022. 27

work page 2022
[33]

Imagen video: High definition video generation with diffusion models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint, 2022. 27 30

work page 2022
[34]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020. 10, 27

work page 2020
[35]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint, 2022. 13

work page 2022
[36]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022. 8, 9, 10

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Animate anyone: Consistent and controllable image-to-video synthesis for character animation

Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, and Liefeng Bo. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. arXiv preprint,

work page
[38]

A large tv dataset for speech and music activity detection

Yun-Ning Hung, Chih-Wei Wu, Iroro Orife, Aaron Hipple, William Wolcott, and Alexander Lerch. A large tv dataset for speech and music activity detection. EURASIP Journal on Audio, Speech, and Music Processing, 2022(1):21, 2022. 18

work page 2022
[39]

General data protection regulation (gdpr), n.d

Investopedia. General data protection regulation (gdpr), n.d. Accessed October 10, 2023. 3

work page 2023
[40]

Text2performer: Text-driven human video generation

Yuming Jiang, Shuai Yang, Tong Liang Koh, Wayne Wu, Chen Change Loy, and Ziwei Liu. Text2performer: Text-driven human video generation. arXiv preprint, 2023. 27

work page 2023
[41]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. 8, 9

work page internal anchor Pith review Pith/arXiv arXiv 2001
[42]

Computational tradeoffs in image synthesis: Diffusion, masked-token, and next-token prediction

Maciej Kilian, Varun Japan, and Luke Zettlemoyer. Computational tradeoffs in image synthesis: Diffusion, masked-token, and next-token prediction. arXiv preprint arXiv:2405.13218, 2024. 9

work page arXiv 2024
[43]

Re-ex: Revising after explanation reduces the factual errors in llm responses, 2024

Juyeon Kim, Jeongeun Lee, Yoonho Chang, Chanyeol Choi, Junseong Kim, and Jy yong Sohn. Re-ex: Revising after explanation reduces the factual errors in llm responses, 2024. 12

work page 2024
[44]

Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis

Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33:17022–17033, 2020. 19

work page 2020
[45]

Reducing activation recomputation in large transformer models

Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5:341–353, 2023. 13

work page 2023
[46]

Open-sora-plan, April 2024

PKU-Yuan Lab and Tuzhan AI etc. Open-sora-plan, April 2024. 27

work page 2024
[47]

Flux, 2024

Black Forest Labs. Flux, 2024. 2, 6, 8, 27

work page 2024
[48]

Tccl: Co-optimizing collective communication and traffic routing for gpu-centric clusters

Baojia Li, Xiaoliang Wang, Jingzhu Wang, Yifan Liu, Yuanyuan Gong, Hao Lu, Weizhen Dang, Weifeng Zhang, Xiaojie Huang, Mingzhuo Chen, et al. Tccl: Co-optimizing collective communication and traffic routing for gpu-centric clusters. In Proceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing, pages 48–53, 2024. 13

work page 2024
[49]

On the scalability of diffusion- based text-to-image generation

Hao Li, Yang Zou, Ying Wang, Orchid Majumder, Yusheng Xie, R Manmatha, Ashwin Swaminathan, Zhuowen Tu, Stefano Ermon, and Stefano Soatto. On the scalability of diffusion- based text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9400–9409, 2024. 9

work page 2024
[50]

Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022. 4

work page 2022
[51]

Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding, 2024

Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, 31 Jinbao X...

work page 2024
[52]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022. 2, 10

work page internal anchor Pith review Pith/arXiv arXiv 2022
[53]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. 8

work page 2024
[54]

Diff-foley: Synchronized video- to-audio synthesis with latent diffusion models

Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. Diff-foley: Synchronized video- to-audio synthesis with latent diffusion models. Advances in Neural Information Processing Systems, 36, 2024. 19

work page 2024
[55]

Exploring the role of large language models in prompt encoding for diffusion models

Bingqi Ma, Zhuofan Zong, Guanglu Song, Hongsheng Li, and Yu Liu. Exploring the role of large language models in prompt encoding for diffusion models. arXiv preprint arXiv:2406.11831, 2024. 8

work page arXiv 2024
[56]

Follow your pose: Pose-guided text-to-video generation using pose-free videos

Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Ying Shan, Xiu Li, and Qifeng Chen. Follow your pose: Pose-guided text-to-video generation using pose-free videos. arXiv preprint,

work page
[57]

Follow-your-click: Open-domain regional image animation via short prompts

Yue Ma, Yingqing He, Hongfa Wang, Andong Wang, Chenyang Qi, Chengfei Cai, Xiu Li, Zhifeng Li, Heung-Yeung Shum, Wei Liu, et al. Follow-your-click: Open-domain regional image animation via short prompts. arXiv preprint arXiv:2403.08268, 2024. 27

work page arXiv 2024
[58]

Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation

Yue Ma, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, Wei Liu, et al. Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation. arXiv preprint arXiv:2406.01900, 2024. 23, 27

work page arXiv 2024
[59]

Some methods for classification and analysis of multivariate observations

J MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability/University of California Press, 1967. 3

work page 1967
[60]

On distillation of guided diffusion models

Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14297–14306,

work page
[61]

Conditional image-to-video generation with latent flow diffusion models

Haomiao Ni, Changhao Shi, Kai Li, Sharon X Huang, and Martin Renqiang Min. Conditional image-to-video generation with latent flow diffusion models. In CVPR, 2023. 27

work page 2023
[62]

Angel-ptm: A scalable and economical large-scale pre-training system in tencent

Xiaonan Nie, Yi Liu, Fangcheng Fu, Jinbao Xue, Dian Jiao, Xupeng Miao, Yangyu Tao, and Bin Cui. Angel-ptm: A scalable and economical large-scale pre-training system in tencent. arXiv preprint arXiv:2303.02868, 2023. 13

work page arXiv 2023
[63]

Context parallelism overview

NVIDIA. Context parallelism overview. 2024. 13

work page 2024
[64]

Cosmos-tokenizer, 2024

NVIDIA. Cosmos-tokenizer, 2024. 6

work page 2024
[65]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023. 2, 9

work page 2023
[66]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint, 2023. 8, 11

work page 2023
[67]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720, 2024. 2, 5, 6, 7, 13, 19, 27

work page internal anchor Pith review Pith/arXiv arXiv 2024
[68]

A lip sync expert is all you need for speech to lip generation in the wild

KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM international conference on multimedia, pages 484–492, 2020. 22 32

work page 2020
[69]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021. 8

work page 2021
[70]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021. 19

work page 2021
[71]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020. 8, 10, 19

work page 2020
[72]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022. 2, 27

work page 2022
[73]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022. 10

work page internal anchor Pith review Pith/arXiv arXiv 2022
[74]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019. 13

work page internal anchor Pith review Pith/arXiv arXiv 1909
[75]

Make-a-video: Text-to-video generation without text-video data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint, 2022. 27

work page 2022
[76]

arXiv preprint arXiv:2008.04838 (2020)

Tomáš Souˇcek and Jakub Lokoˇc. Transnet v2: An effective deep network architecture for fast shot transition detection. arXiv preprint arXiv:2008.04838, 2020. 3

work page arXiv 2008
[77]

Roformer: Enhanced transformer with rotary position embedding, 2023

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023. 8

work page 2023
[78]

Hunyuan-large: An open-source moe model with 52 billion activated parameters by tencent, 2024

Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, Zhongzhi Chen, Xuemeng Huang, Fengzong Lian, Saiyong Yang, Jianfeng Yan, Yuyuan Zeng, Xiaoqin Ren, Chao Yu, Lulu Wu, Yue Mao, Tao Yang, Suncong Zheng, Kan Wu, Dian Jiao, Jinbao Xue, Xipeng Zhang, Decheng Wu, Kai Liu, Dengpe...

work page 2024
[79]

Mochi 1: A new sota in open-source video generation

Genmo Team. Mochi 1: A new sota in open-source video generation. https://github. com/genmoai/models, 2024. 7, 27

work page 2024
[80]

Emo: Emote portrait alive – generating expressive portrait videos with audio2video diffusion model under weak conditions, 2024

Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. Emo: Emote portrait alive – generating expressive portrait videos with audio2video diffusion model under weak conditions, 2024. 22

work page 2024

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 9

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

PaLM 2 Technical Report

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023. 9

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

All are worth words: A vit backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22669–22679, 2023. 9

work page 2023

[4] [4]

Improving image generation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023. 4

work page 2023

[5] [5]

Stable video diffusion: Scaling latent video diffusion models to large datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint, 2023. 2, 27

work page 2023

[6] [6]

Large Scale GAN Training for High Fidelity Natural Image Synthesis

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Homes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Wing Yin Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. 4, 7, 27

work page 2024

[8] [8]

Language Models are Few-Shot Learners

Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165,

work page internal anchor Pith review Pith/arXiv arXiv 2005

[9] [9]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Sharegpt4video: Improving video understanding and generation with better captions

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. Sharegpt4video: Improving video understanding and generation with better captions. arXiv preprint arXiv:2406.04325, 2024. 4

work page arXiv 2024

[11] [11]

Od-vae: An omni-dimensional video compressor for improving latent video diffusion model

Liuhan Chen, Zongjian Li, Bin Lin, Bin Zhu, Qian Wang, Shenghai Yuan, Xing Zhou, Xinghua Cheng, and Li Yuan. Od-vae: An omni-dimensional video compressor for improving latent video diffusion model. arXiv preprint arXiv:2409.01199, 2024. 6

work page arXiv 2024

[12] [12]

Follow-your-canvas: Higher-resolution video outpainting with extensive content generation

Qihua Chen, Yue Ma, Hongfa Wang, Junkun Yuan, Wenzhe Zhao, Qi Tian, Hongmei Wang, Shaobo Min, Qifeng Chen, and Wei Liu. Follow-your-canvas: Higher-resolution video outpainting with extensive content generation. arXiv preprint arXiv:2409.01055, 2024. 27

work page arXiv 2024

[13] [13]

Neural ordinary differential equations

Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018. 10

work page 2018

[14] [14]

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-Wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, and Sergey Tulyakov. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , page 13320–13331. IEEE, ...

work page 2024

[15] [15]

Seine: Short-to-long video diffusion model for generative transition and prediction

Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint, 2023. 27

work page 2023

[16] [16]

EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Condi- tions,

Zhiyuan Chen, Jiajiong Cao, Zhiquan Chen, Yuming Li, and Chenguang Ma. Echomimic: Lifelike audio-driven portrait animations through editable landmark conditions. arXiv preprint arXiv:2407.08136, 2024. 23 29

work page arXiv 2024

[17] [17]

Xtuner: A toolkit for efficiently fine-tuning llm

XTuner Contributors. Xtuner: A toolkit for efficiently fine-tuning llm. https://github. com/InternLM/xtuner, 2023. 8

work page 2023

[18] [18]

OpenCV Developers. Opencv. https://opencv.org/. 3

work page

[19] [19]

Pyscenedetect

PySceneDetect Developers. Pyscenedetect. https://www.scenedetect.com/. 3

work page

[20] [20]

Lp-musiccaps: Llm-based pseudo music captioning

SeungHeon Doh, Keunwoo Choi, Jongpil Lee, and Juhan Nam. Lp-musiccaps: Llm-based pseudo music captioning. arXiv preprint arXiv:2307.16372, 2023. 19

work page arXiv 2023

[21] [21]

Scaling rectified flow transform- ers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024. 2, 8, 9, 10

work page 2024

[22] [22]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021. 6

work page 2021

[23] [23]

Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers

Peng Gao, Le Zhuo, Ziyi Lin, Chris Liu, Junsong Chen, Ruoyi Du, Enze Xie, Xu Luo, Longtian Qiu, Yuhang Zhang, et al. Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers. arXiv preprint arXiv:2405.05945, 2024. 9

work page arXiv 2024

[24] [24]

YOLOX: Exceeding YOLO Series in 2021

Z Ge. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021

[25] [25]

Emu video: Factorizing text-to-video generation by explicit image conditioning

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709,

work page arXiv

[26] [26]

Chatglm: A family of large language models from glm-130b to glm-4 all tools, 2024

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, ...

work page 2024

[27] [27]

Sparsectrl: Adding sparse controls to text-to-video diffusion models

Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Sparsectrl: Adding sparse controls to text-to-video diffusion models. arXiv preprint, 2023. 27

work page 2023

[28] [28]

Animatediff: Animate your personalized text-to-image diffusion models without specific tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. ICLR, 2024. 27

work page 2024

[29] [29]

Taming data and transformers for audio generation

Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Guha Balakrishnan, Sergey Tulyakov, and Vicente Ordonez. Taming data and transformers for audio generation. arXiv preprint arXiv:2406.19388, 2024. 19

work page arXiv 2024

[30] [30]

Ring Attention with Blockwise Transformers for Near-Infinite Context

Pieter Abbeel Hao Liu, Matei Zaharia. Ring attention with blockwise transformers for near- infinite context. arXiv preprint arXiv:2310.01889, 2023. 14

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Animate-a-story: Storytelling with retrieval-augmented video generation

Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan, et al. Animate-a-story: Storytelling with retrieval-augmented video generation. arXiv preprint, 2023. 27

work page 2023

[32] [32]

Video diffusion models

J Ho, T Salimans, A Gritsenko, W Chan, M Norouzi, and DJ Fleet. Video diffusion models. arxiv 2022. arXiv preprint, 2022. 27

work page 2022

[33] [33]

Imagen video: High definition video generation with diffusion models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint, 2022. 27 30

work page 2022

[34] [34]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020. 10, 27

work page 2020

[35] [35]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint, 2022. 13

work page 2022

[36] [36]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022. 8, 9, 10

work page internal anchor Pith review Pith/arXiv arXiv 2022

[37] [37]

Animate anyone: Consistent and controllable image-to-video synthesis for character animation

Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, and Liefeng Bo. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. arXiv preprint,

work page

[38] [38]

A large tv dataset for speech and music activity detection

Yun-Ning Hung, Chih-Wei Wu, Iroro Orife, Aaron Hipple, William Wolcott, and Alexander Lerch. A large tv dataset for speech and music activity detection. EURASIP Journal on Audio, Speech, and Music Processing, 2022(1):21, 2022. 18

work page 2022

[39] [39]

General data protection regulation (gdpr), n.d

Investopedia. General data protection regulation (gdpr), n.d. Accessed October 10, 2023. 3

work page 2023

[40] [40]

Text2performer: Text-driven human video generation

Yuming Jiang, Shuai Yang, Tong Liang Koh, Wayne Wu, Chen Change Loy, and Ziwei Liu. Text2performer: Text-driven human video generation. arXiv preprint, 2023. 27

work page 2023

[41] [41]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. 8, 9

work page internal anchor Pith review Pith/arXiv arXiv 2001

[42] [42]

Computational tradeoffs in image synthesis: Diffusion, masked-token, and next-token prediction

Maciej Kilian, Varun Japan, and Luke Zettlemoyer. Computational tradeoffs in image synthesis: Diffusion, masked-token, and next-token prediction. arXiv preprint arXiv:2405.13218, 2024. 9

work page arXiv 2024

[43] [43]

Re-ex: Revising after explanation reduces the factual errors in llm responses, 2024

Juyeon Kim, Jeongeun Lee, Yoonho Chang, Chanyeol Choi, Junseong Kim, and Jy yong Sohn. Re-ex: Revising after explanation reduces the factual errors in llm responses, 2024. 12

work page 2024

[44] [44]

Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis

Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33:17022–17033, 2020. 19

work page 2020

[45] [45]

Reducing activation recomputation in large transformer models

Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5:341–353, 2023. 13

work page 2023

[46] [46]

Open-sora-plan, April 2024

PKU-Yuan Lab and Tuzhan AI etc. Open-sora-plan, April 2024. 27

work page 2024

[47] [47]

Flux, 2024

Black Forest Labs. Flux, 2024. 2, 6, 8, 27

work page 2024

[48] [48]

Tccl: Co-optimizing collective communication and traffic routing for gpu-centric clusters

Baojia Li, Xiaoliang Wang, Jingzhu Wang, Yifan Liu, Yuanyuan Gong, Hao Lu, Weizhen Dang, Weifeng Zhang, Xiaojie Huang, Mingzhuo Chen, et al. Tccl: Co-optimizing collective communication and traffic routing for gpu-centric clusters. In Proceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing, pages 48–53, 2024. 13

work page 2024

[49] [49]

On the scalability of diffusion- based text-to-image generation

Hao Li, Yang Zou, Ying Wang, Orchid Majumder, Yusheng Xie, R Manmatha, Ashwin Swaminathan, Zhuowen Tu, Stefano Ermon, and Stefano Soatto. On the scalability of diffusion- based text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9400–9409, 2024. 9

work page 2024

[50] [50]

Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022. 4

work page 2022

[51] [51]

Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding, 2024

Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, 31 Jinbao X...

work page 2024

[52] [52]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022. 2, 10

work page internal anchor Pith review Pith/arXiv arXiv 2022

[53] [53]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. 8

work page 2024

[54] [54]

Diff-foley: Synchronized video- to-audio synthesis with latent diffusion models

Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. Diff-foley: Synchronized video- to-audio synthesis with latent diffusion models. Advances in Neural Information Processing Systems, 36, 2024. 19

work page 2024

[55] [55]

Exploring the role of large language models in prompt encoding for diffusion models

Bingqi Ma, Zhuofan Zong, Guanglu Song, Hongsheng Li, and Yu Liu. Exploring the role of large language models in prompt encoding for diffusion models. arXiv preprint arXiv:2406.11831, 2024. 8

work page arXiv 2024

[56] [56]

Follow your pose: Pose-guided text-to-video generation using pose-free videos

Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Ying Shan, Xiu Li, and Qifeng Chen. Follow your pose: Pose-guided text-to-video generation using pose-free videos. arXiv preprint,

work page

[57] [57]

Follow-your-click: Open-domain regional image animation via short prompts

Yue Ma, Yingqing He, Hongfa Wang, Andong Wang, Chenyang Qi, Chengfei Cai, Xiu Li, Zhifeng Li, Heung-Yeung Shum, Wei Liu, et al. Follow-your-click: Open-domain regional image animation via short prompts. arXiv preprint arXiv:2403.08268, 2024. 27

work page arXiv 2024

[58] [58]

Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation

Yue Ma, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, Wei Liu, et al. Follow-your-emoji: Fine-controllable and expressive freestyle portrait animation. arXiv preprint arXiv:2406.01900, 2024. 23, 27

work page arXiv 2024

[59] [59]

Some methods for classification and analysis of multivariate observations

J MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability/University of California Press, 1967. 3

work page 1967

[60] [60]

On distillation of guided diffusion models

Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14297–14306,

work page

[61] [61]

Conditional image-to-video generation with latent flow diffusion models

Haomiao Ni, Changhao Shi, Kai Li, Sharon X Huang, and Martin Renqiang Min. Conditional image-to-video generation with latent flow diffusion models. In CVPR, 2023. 27

work page 2023

[62] [62]

Angel-ptm: A scalable and economical large-scale pre-training system in tencent

Xiaonan Nie, Yi Liu, Fangcheng Fu, Jinbao Xue, Dian Jiao, Xupeng Miao, Yangyu Tao, and Bin Cui. Angel-ptm: A scalable and economical large-scale pre-training system in tencent. arXiv preprint arXiv:2303.02868, 2023. 13

work page arXiv 2023

[63] [63]

Context parallelism overview

NVIDIA. Context parallelism overview. 2024. 13

work page 2024

[64] [64]

Cosmos-tokenizer, 2024

NVIDIA. Cosmos-tokenizer, 2024. 6

work page 2024

[65] [65]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023. 2, 9

work page 2023

[66] [66]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint, 2023. 8, 11

work page 2023

[67] [67]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720, 2024. 2, 5, 6, 7, 13, 19, 27

work page internal anchor Pith review Pith/arXiv arXiv 2024

[68] [68]

A lip sync expert is all you need for speech to lip generation in the wild

KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM international conference on multimedia, pages 484–492, 2020. 22 32

work page 2020

[69] [69]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021. 8

work page 2021

[70] [70]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021. 19

work page 2021

[71] [71]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020. 8, 10, 19

work page 2020

[72] [72]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022. 2, 27

work page 2022

[73] [73]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022. 10

work page internal anchor Pith review Pith/arXiv arXiv 2022

[74] [74]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019. 13

work page internal anchor Pith review Pith/arXiv arXiv 1909

[75] [75]

Make-a-video: Text-to-video generation without text-video data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint, 2022. 27

work page 2022

[76] [76]

arXiv preprint arXiv:2008.04838 (2020)

Tomáš Souˇcek and Jakub Lokoˇc. Transnet v2: An effective deep network architecture for fast shot transition detection. arXiv preprint arXiv:2008.04838, 2020. 3

work page arXiv 2008

[77] [77]

Roformer: Enhanced transformer with rotary position embedding, 2023

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023. 8

work page 2023

[78] [78]

Hunyuan-large: An open-source moe model with 52 billion activated parameters by tencent, 2024

Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, Zhongzhi Chen, Xuemeng Huang, Fengzong Lian, Saiyong Yang, Jianfeng Yan, Yuyuan Zeng, Xiaoqin Ren, Chao Yu, Lulu Wu, Yue Mao, Tao Yang, Suncong Zheng, Kan Wu, Dian Jiao, Jinbao Xue, Xipeng Zhang, Decheng Wu, Kai Liu, Dengpe...

work page 2024

[79] [79]

Mochi 1: A new sota in open-source video generation

Genmo Team. Mochi 1: A new sota in open-source video generation. https://github. com/genmoai/models, 2024. 7, 27

work page 2024

[80] [80]

Emo: Emote portrait alive – generating expressive portrait videos with audio2video diffusion model under weak conditions, 2024

Linrui Tian, Qi Wang, Bang Zhang, and Liefeng Bo. Emo: Emote portrait alive – generating expressive portrait videos with audio2video diffusion model under weak conditions, 2024. 22

work page 2024