arxiv: 2604.11804 · v2 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

Chi-Wing Fu, Cunjian Chen, Donghao Zhou, Guisheng Liu, Hao Yang, Jiatong Li, Jingyu Lin, Pheng-Ann Heng, Shilei Wen, Xiaohu Huang, Xin Gao, Yichen Liu

Pith reviewed 2026-05-10 15:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords human-object interaction video generationmultimodal video synthesistext-to-videopose-guided videoaudio-conditioned generationvideo generation frameworkbenchmark for HOIVG

0 comments

The pith

A unified model can generate high-quality videos of human-object interactions while following instructions from text, reference images, audio, and body poses simultaneously.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OmniShow as a framework designed to generate videos depicting people interacting with objects, taking inputs from text, images, audio and poses all at once. Previous approaches could not handle every condition together, leading to either poor quality or poor control. OmniShow introduces specific techniques to inject images and poses efficiently, synchronize audio precisely, and train on limited data by combining datasets in stages. If successful, this would allow practical uses like creating demo videos automatically for products or entertainment. It also creates a new benchmark to measure progress in this area.

Core claim

OmniShow is an end-to-end framework for Human-Object Interaction Video Generation that harmonizes multimodal conditions including text, reference images, audio, and pose by means of Unified Channel-wise Conditioning for image and pose injection, Gated Local-Context Attention for audio-visual synchronization, and a Decoupled-Then-Joint Training strategy that uses model merging on heterogeneous datasets, while also introducing the HOIVG-Bench evaluation benchmark and demonstrating state-of-the-art results.

What carries the argument

The combination of Unified Channel-wise Conditioning for efficient image and pose injection, Gated Local-Context Attention for precise audio-visual synchronization, and Decoupled-Then-Joint Training with model merging to integrate heterogeneous data sources.

If this is right

Generates videos that obey all provided conditions without sacrificing quality or synchronization.
Outperforms prior methods across different combinations of input modalities.
Effectively uses multiple existing datasets through a staged training and merging process.
Provides a standard benchmark for measuring performance in human-object interaction video generation.
Supports practical applications such as automated content creation for e-commerce and entertainment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The methods could potentially apply to generating videos with other types of interactions or additional control signals.
Future work might test how well the model performs when some input conditions conflict with each other.
The benchmark dataset could encourage more standardized comparisons in multimodal video synthesis research.
Model merging techniques might help address data limitations in other conditional generation tasks.

Load-bearing premise

The techniques for conditioning and training can successfully combine all the different input types without creating conflicts that reduce video quality or timing accuracy.

What would settle it

Experiments on the HOIVG-Bench showing that OmniShow does not achieve better results than existing specialized models when all modalities are provided together or in specific single-modality cases.

read the original abstract

In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OmniShow unifies text-image-audio-pose conditioning for HOI video generation with new modules and a benchmark, but its SOTA claims lack visible quantitative support.

read the letter

The main thing to know is that this paper introduces OmniShow as an end-to-end model for human-object interaction video generation that accepts text, reference images, audio, and pose together, along with a new HOIVG-Bench for evaluation. It targets a gap where earlier methods handled only subsets of these conditions. What is actually new are the Unified Channel-wise Conditioning for efficient image and pose injection, Gated Local-Context Attention for audio-visual sync, the Decoupled-Then-Joint Training with model merging to handle data scarcity across sub-tasks, and the benchmark itself. These are concrete engineering choices aimed at practical content creation tasks. The paper does a reasonable job framing why separate models fall short and sketching how the components could work together without immediate design contradictions. The soft spots sit in the empirical side. The abstract asserts overall SOTA performance and no controllability-quality trade-offs, yet supplies no metrics, ablations, or error breakdowns to show it. The model-merging step in the training strategy is the weakest link because merging networks trained on heterogeneous datasets like text-to-video and pose-conditioned ones commonly produces interference or forgetting, and nothing described guarantees the gated attention and channel conditioning survive intact or preserve synchronization. If the full experiments include clear before-and-after numbers and controls for that, the case strengthens; otherwise the central claim stays under-supported. This is for computer vision researchers focused on controllable video synthesis and multimodal fusion. Readers looking for benchmark construction ideas or conditioning patterns could extract useful details. It deserves peer review because the task is timely, the proposals are specific enough to evaluate, and referees can check the results and suggest fixes on the evaluation gaps.

Referee Report

2 major / 1 minor

Summary. The paper introduces OmniShow, an end-to-end framework for Human-Object Interaction Video Generation (HOIVG) conditioned on text, reference images, audio, and pose. It proposes Unified Channel-wise Conditioning for efficient image/pose injection, Gated Local-Context Attention for audio-visual synchronization, and a Decoupled-Then-Joint Training strategy with model merging to address data scarcity from heterogeneous sub-task datasets. A new benchmark HOIVG-Bench is introduced, and the work claims overall state-of-the-art performance across multimodal conditioning settings while overcoming controllability-quality trade-offs.

Significance. If substantiated, this would advance the emerging HOIVG task by unifying four modalities in a single model and providing the first dedicated benchmark, with potential value for applications like e-commerce and short-form video. The Decoupled-Then-Joint Training approach, if shown to avoid interference, offers a practical way to leverage existing sub-task data; the new benchmark fills a clear evaluation gap and could become a standard reference.

major comments (2)

[Abstract] Abstract: the central claim that 'extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance' is unsupported by any quantitative metrics, ablation tables, error analysis, or implementation details, which is load-bearing for validating the assertion that the proposed components overcome controllability-quality trade-offs.
[Decoupled-Then-Joint Training] Decoupled-Then-Joint Training strategy: the model-merging step is described without any quantification of interference, forgetting, or synchronization degradation across the text-to-video, image-conditioned, audio-sync, and pose sub-task models; this directly affects the weakest assumption that the Unified Channel-wise Conditioning and Gated Local-Context Attention remain effective post-merging under simultaneous multimodal conditions.

minor comments (1)

[Abstract] The phrase 'industry-grade performance' is imprecise and should be replaced with concrete metrics or explicit comparisons to prior work.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the referee's thorough review and valuable comments on our paper. We address each major comment point by point below, providing clarifications and committing to revisions where appropriate to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: the central claim that 'extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance' is unsupported by any quantitative metrics, ablation tables, error analysis, or implementation details, which is load-bearing for validating the assertion that the proposed components overcome controllability-quality trade-offs.

Authors: We thank the referee for highlighting this issue with the abstract. While the abstract serves as a concise overview, the full manuscript includes comprehensive quantitative evaluations in the Experiments section, featuring comparison tables against state-of-the-art methods, ablation studies on Unified Channel-wise Conditioning and Gated Local-Context Attention, and analysis of the controllability-quality trade-off. To make the abstract's claim more robust and self-contained, we will revise it to incorporate brief mentions of key metrics or direct references to the supporting experimental results. revision: yes
Referee: Decoupled-Then-Joint Training strategy: the model-merging step is described without any quantification of interference, forgetting, or synchronization degradation across the text-to-video, image-conditioned, audio-sync, and pose sub-task models; this directly affects the weakest assumption that the Unified Channel-wise Conditioning and Gated Local-Context Attention remain effective post-merging under simultaneous multimodal conditions.

Authors: We agree with the referee that quantifying the effects of the model-merging step is important for validating the Decoupled-Then-Joint Training strategy. The current description focuses on the overall performance gains, but lacks specific measurements of interference or degradation. In the revised manuscript, we will include additional experiments that compare model performance before and after merging, using relevant metrics for quality, controllability, and synchronization across the different conditioning modalities. This will provide direct evidence that the proposed conditioning mechanisms remain effective. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on new components and empirical validation on a new benchmark

full rationale

The paper proposes three new technical elements (Unified Channel-wise Conditioning, Gated Local-Context Attention, and Decoupled-Then-Joint Training with model merging) plus a new evaluation benchmark (HOIVG-Bench). Its central claim is that these elements together produce SOTA results on multimodal HOIVG tasks. No equation or derivation reduces a predicted quantity to a fitted parameter by construction, no uniqueness theorem is imported from prior self-work, and no ansatz is smuggled via self-citation. The performance statements are grounded in external experiments on the newly introduced benchmark rather than in any self-referential re-labeling of inputs. This is the normal non-circular case for an applied CV architecture paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no extractable free parameters, axioms, or invented entities; all technical details on model architecture and training are absent.

pith-pipeline@v0.9.0 · 5557 in / 1254 out tokens · 101138 ms · 2026-05-10T15:02:38.237365+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation
cs.CV 2026-04 unverdicted novelty 6.0

CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.

Reference graph

Works this paper leans on

79 extracted references · 46 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

wav2vec 2.0: A framework for self- supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self- supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020

2020
[2]

Pyscenedetect: Python and opencv-based scene cut/transition detection program & library

Brandon Castellano. Pyscenedetect: Python and opencv-based scene cut/transition detection program & library. Online Repository, 2025. URLhttps://github.com/Breakthrough/PySceneDetect. Accessed: 2026-01-26

2025
[3]

Virtualmodel: Generating object-id-retentive human-object interaction image by diffusion model for e-commerce marketing.arXiv preprint arXiv:2405.09985, 2024

Binghui Chen, Chongyang Zhong, Wangmeng Xiang, Yifeng Geng, and Xuansong Xie. Virtualmodel: Generating object-id-retentive human-object interaction image by diffusion model for e-commerce marketing.arXiv preprint arXiv:2405.09985, 2024

work page arXiv 2024
[4]

HuMo: Human-centric video generation via collaborative multi-modal conditioning.arXiv preprint arXiv:2509.08519, 2025

Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, and Zhiyong Wu. Humo: Human-centric video generation via collaborative multi-modal conditioning.arXiv preprint arXiv:2509.08519, 2025

work page arXiv 2025
[5]

Posteromni: Generalized artistic poster creation via task distillation and unified reward feedback.arXiv preprint arXiv:2602.12127, 2026

Sixiang Chen, Jianyu Lai, Jialin Gao, Hengyu Shi, Zhongying Liu, Tian Ye, Junfeng Luo, Xiaoming Wei, and Lei Zhu. Posteromni: Generalized artistic poster creation via task distillation and unified reward feedback.arXiv preprint arXiv:2602.12127, 2026

work page arXiv 2026
[6]

Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

Siyan Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Xuyan Chi, Jian Cong, Qinpeng Cui, Qide Dong, Junliang Fan, et al. Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

work page arXiv 2025
[7]

HunyuanVideo-Avatar: High-fidelity audio-driven human animation for multiple characters.arXiv preprint arXiv:2505.20156, 2025

Yi Chen, Sen Liang, Zixiang Zhou, Ziyao Huang, Yifeng Ma, Junshu Tang, Qin Lin, Yuan Zhou, and Qinglin Lu. Hunyuanvideo-avatar: High-fidelity audio-driven human animation for multiple characters.arXiv preprint arXiv:2505.20156, 2025

work page arXiv 2025
[8]

Out of time: automated lip sync in the wild

Joon Son Chung and Andrew Zisserman. Out of time: automated lip sync in the wild. InAsian conference on computer vision, pages 251–263. Springer, 2016

2016
[9]

Hallo4: High-fidelity dynamic portrait animation via direct preference optimization and temporal motion modulation.arXiv preprint arXiv:2505.23525, 2025

Jiahao Cui, Yan Chen, Mingwang Xu, Hanlin Shang, Yuxuan Chen, Yun Zhan, Zilong Dong, Yao Yao, Jingdong Wang, and Siyu Zhu. Hallo4: High-fidelity dynamic portrait animation via direct preference optimization and temporal motion modulation.arXiv preprint arXiv:2505.23525, 2025

work page arXiv 2025
[10]

Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer

Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, and Siyu Zhu. Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 21086–21095, 2025

2025
[11]

Cg-hoi: Contact-guided 3d human-object interaction generation

Christian Diller and Angela Dai. Cg-hoi: Contact-guided 3d human-object interaction generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19888–19901, 2024

2024
[12]

Elevenlabs: The most realistic voice ai platform

ElevenLabs. Elevenlabs: The most realistic voice ai platform. Online Platform, 2026. URLhttps://elevenlabs. io/. Accessed: 2026-01-26

2026
[13]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

2024
[14]

Re-hold: Video hand object interaction reenactment via adaptive layout-instructed diffusion model

Yingying Fan, Quanwei Yang, Kaisiyuan Wang, Hang Zhou, Yingying Li, Haocheng Feng, Errui Ding, Yu Wu, and Jingdong Wang. Re-hold: Video hand object interaction reenactment via adaptive layout-instructed diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17550–17560, June 2025

2025
[15]

Two-frame motion estimation based on polynomial expansion

Gunnar Farnebäck. Two-frame motion estimation based on polynomial expansion. InScandinavian conference on Image analysis, pages 363–370. Springer, 2003

2003
[16]

Skyreels-a2: Compose anything in video diffusion transformers

Zhengcong Fei, Debang Li, Di Qiu, Jiahua Wang, Yikun Dou, Rui Wang, Jingtao Xu, Mingyuan Fan, Guibin Chen, Yang Li, et al. Skyreels-a2: Compose anything in video diffusion transformers.arXiv preprint arXiv:2504.02436, 2025

work page arXiv 2025
[17]

Humandit: Pose-guided diffusion transformer for long-form human motion video generation.arXiv preprint arXiv:2502.04847, 2025

Qijun Gan, Yi Ren, Chen Zhang, Zhenhui Ye, Pan Xie, Xiang Yin, Zehuan Yuan, Bingyue Peng, and Jianke Zhu. Humandit: Pose-guided diffusion transformer for long-form human motion video generation.arXiv preprint arXiv:2502.04847, 2025. 13

work page arXiv 2025
[18]

OmniAvatar: Efficient audio-driven avatar video generation with adaptive body animation.arXiv preprint arXiv:2506.18866, 2025

Qijun Gan, Ruizi Yang, Jianke Zhu, Shaofei Xue, and Steven Hoi. Omniavatar: Efficient audio-driven avatar video generation with adaptive body animation.arXiv preprint arXiv:2506.18866, 2025

work page arXiv 2025
[19]

Nano banana

Google. Nano banana. Online Documentation, January 2026. URLhttps://ai.google.dev/gemini-api/docs/ image-generation. Accessed: 2026-01-26

2026
[20]

LTX-2: Efficient Joint Audio-Visual Foundation Model

Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model. arXiv preprint arXiv:2601.03233, 2026

work page Pith review arXiv 2026
[21]

Hunyuancustom: A multimodal-driven architecture for customized video generation.arXiv preprint arXiv:2505.04512, 2025

Teng Hu, Zhentao Yu, Zhengguang Zhou, Sen Liang, Yuan Zhou, Qin Lin, and Qinglin Lu. Hunyuancustom: A multimodal-driven architecture for customized video generation.arXiv preprint arXiv:2505.04512, 2025

work page arXiv 2025
[22]

Magicfight: Personalized martial arts combat video generation

Jiancheng Huang, Mingfu Yan, Songyan Chen, Yi Huang, and Shifeng Chen. Magicfight: Personalized martial arts combat video generation. InProceedings of the 32nd ACM International Conference on Multimedia, pages 10833–10842, 2024

2024
[23]

Dual-schedule inversion: Training-and tuning-free inversion for real image editing

Jiancheng Huang, Yi Huang, Jianzhuang Liu, Donghao Zhou, Yifan Liu, and Shifeng Chen. Dual-schedule inversion: Training-and tuning-free inversion for real image editing. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 660–669. IEEE, 2025

2025
[24]

M4v: Multi-modal mamba for text-to-video generation.arXiv preprint arXiv:2506.10915, 2025

Jiancheng Huang, Gengwei Zhang, Zequn Jie, Siyu Jiao, Yinlong Qian, Ling Chen, Yunchao Wei, and Lin Ma. M4v: Multi-modal mamba for text-to-video generation.arXiv preprint arXiv:2506.10915, 2025

work page arXiv 2025
[25]

Jova: Unified multimodal learning for joint video-audio generation.arXiv preprint arXiv:2512.13677, 2025

Xiaohu Huang, Hao Zhou, Qiangpeng Yang, Shilei Wen, and Kai Han. Jova: Unified multimodal learning for joint video-audio generation.arXiv preprint arXiv:2512.13677, 2025

work page arXiv 2025
[26]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024
[27]

Hunyuanvideo-homa: Generic human-object interaction in multimodal driven human animation

Ziyao Huang, Zixiang Zhou, Juan Cao, Yifeng Ma, Yi Chen, Zejing Rao, Zhiyong Xu, Hongmei Wang, Qin Lin, Yuan Zhou, et al. Hunyuanvideo-homa: Generic human-object interaction in multimodal driven human animation. arXiv preprint arXiv:2506.08797, 2025

work page arXiv 2025
[28]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models.arXiv preprint arXiv:2309.14509, 2023

work page internal anchor Pith review arXiv 2023
[30]

Loopy: Taming audio-driven portrait avatar with long-term motion dependency.arXiv preprint arXiv:2409.02634, 2024

Jianwen Jiang, Chao Liang, Jiaqi Yang, Gaojie Lin, Tianyun Zhong, and Yanbo Zheng. Loopy: Taming audio-driven portrait avatar with long-term motion dependency.arXiv preprint arXiv:2409.02634, 2024

work page arXiv 2024
[31]

Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025

work page arXiv 2025
[32]

Fulldit: Video generative foundation models with multimodal control via full attention

Xuan Ju, Weicai Ye, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, and Qiang Xu. Fulldit: Video generative foundation models with multimodal control via full attention. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15737–15747, 2025

2025
[33]

Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025

Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, et al. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025

work page arXiv 2025
[34]

Let them talk: Audio-driven multi-person conversational video generation.arXiv preprint arXiv:2505.22647, 2025

Zhe Kong, Feng Gao, Yong Zhang, Zhuoliang Kang, Xiaoming Wei, Xunliang Cai, Guanying Chen, and Wenhan Luo. Let them talk: Audio-driven multi-person conversational video generation.arXiv preprint arXiv:2505.22647, 2025

work page arXiv 2025
[35]

Latentsync: Taming audio-conditioned latent diffusion models for lip sync with syncnet supervision.arXiv preprint arXiv:2412.09262, 2024

Chunyu Li, Chao Zhang, Weikai Xu, Jingyu Lin, Jinghui Xie, Weiguo Feng, Bingyue Peng, Cunjian Chen, and Weiwei Xing. Latentsync: Taming audio-conditioned latent diffusion models for lip sync with syncnet supervision. arXiv preprint arXiv:2412.09262, 2024. 14

work page arXiv 2024
[36]

Omnihuman-1: Rethinking the scaling-up of one-stage conditioned human animation models

Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, Chao Liang, Yuan Zhang, and Jingtuo Liu. Omnihuman-1: Rethinking the scaling-up of one-stage conditioned human animation models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13847–13858, 2025

2025
[37]

Apoavatar: Expressive audio-driven avatar generation via refocused audio-pose priors

Jingyu Lin, Chao Zhang, Wei Feng, Donghao Zhou, Shilei Wen, Lan Du, and Cunjian Chen. Apoavatar: Expressive audio-driven avatar generation via refocused audio-pose priors
[38]

Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

Yunlong Lin, Linqing Wang, Kunjie Lin, Zixu Lin, Kaixiong Gong, Wenbo Li, Bin Lin, Zhenxi Li, Shiyi Zhang, Yuyang Peng, et al. Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

work page arXiv 2025
[39]

Mofu: Scale-aware modulation and fourier fusion for multi-subject video generation

Run Ling, Ke Cao, Jian Lu, Ao Ma, Haowei Liu, Runze He, Changwei Wang, Rongtao Xu, Yihua Shao, Zhanjie Zhang, et al. Mofu: Scale-aware modulation and fourier fusion for multi-subject video generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 7033–7041, 2026

2026
[40]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[41]

Improving Video Generation with Human Feedback

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025

work page internal anchor Pith review arXiv 2025
[42]

Phantom: Subject-consistent video generation via cross-modal alignment.arXiv preprint arXiv:2502.11079, 2025

Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Gen Li, Siyu Zhou, Qian He, and Xinglong Wu. Phantom: Subject-consistent video generation via cross-modal alignment.arXiv preprint arXiv:2502.11079, 2025

work page arXiv 2025
[43]

HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images

Yichen Liu, Donghao Zhou, Jie Wang, Xin Gao, Guisheng Liu, Jiatong Li, Quanwei Zhang, Qiang Lyu, Lanqing Guo, ShileiWen, etal. Hifi-inpaint: Towardshigh-fidelityreference-basedinpaintingforgeneratingdetail-preserving human-product images.arXiv preprint arXiv:2603.02210, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[44]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[45]

Ovi: Twin backbone cross-modal fusion for audio-video generation

Chetwin Low, Weimin Wang, and Calder Katyal. Ovi: Twin backbone cross-modal fusion for audio-video generation.arXiv preprint arXiv:2510.01284, 2025

work page arXiv 2025
[46]

Echomimicv2: Towards striking, simplified, and semi-body human animation

Rang Meng, Xingyu Zhang, Yuming Li, and Chenguang Ma. Echomimicv2: Towards striking, simplified, and semi-body human animation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5489–5498, 2025

2025
[47]

Hoi-diff: Text-driven synthesis of 3d human-object interactions using diffusion models

Xiaogang Peng, Yiming Xie, Zizhao Wu, Varun Jampani, Deqing Sun, and Huaizu Jiang. Hoi-diff: Text-driven synthesis of 3d human-object interactions using diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2878–2888, 2025

2025
[48]

Innoads-composer: Efficient condition composition for e-commerce poster generation.arXiv preprint arXiv:2603.05898, 2026

Yuxin Qin, Ke Cao, Haowei Liu, Ao Ma, Fengheng Li, Honghe Zhu, Zheng Zhang, Run Ling, Wei Feng, Xuanhua He, et al. Innoads-composer: Efficient condition composition for e-commerce poster generation.arXiv preprint arXiv:2603.05898, 2026

work page arXiv 2026
[49]

Magicdistillation: Weak-to-strong video distillation for large-scale few-step synthesis.arXiv preprint arXiv:2503.13319, 2025

Shitong Shao, Hongwei Yi, Hanzhong Guo, Tian Ye, Daquan Zhou, Michael Lingelbach, Zhiqiang Xu, and Zeke Xie. Magicdistillation: Weak-to-strong video distillation for large-scale few-step synthesis.arXiv preprint arXiv:2503.13319, 2025

work page arXiv 2025
[50]

Hero: Hierarchical extrapolation and refresh for efficient world models,

Quanjian Song, Xinyu Wang, Donghao Zhou, Jingyu Lin, Cunjian Chen, Yue Ma, and Xiu Li. Hero: Hierarchical extrapolation and refresh for efficient world models.arXiv preprint arXiv:2508.17588, 2025

work page arXiv 2025
[51]

Scenedecorator: Towards scene-oriented story generation with scene planning and scene consistency.arXiv preprint arXiv:2510.22994, 2025

Quanjian Song, Donghao Zhou, Jingyu Lin, Fei Shen, Jiaze Wang, Xiaowei Hu, Cunjian Chen, and Pheng-Ann Heng. Scenedecorator: Towards scene-oriented story generation with scene planning and scene consistency.arXiv preprint arXiv:2510.22994, 2025

work page arXiv 2025
[52]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[53]

Ominicontrol: Minimal and universal control for diffusion transformer

Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14940–14950, 2025. 15

2025
[54]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Wisa: World simulator assistant for physics-aware text-to-video generation.arXiv preprint arXiv:2503.08153, 2025

Jing Wang, Ao Ma, Ke Cao, Jun Zheng, Zhanjie Zhang, Jiasong Feng, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, et al. Wisa: World simulator assistant for physics-aware text-to-video generation.arXiv preprint arXiv:2503.08153, 2025

work page arXiv 2025
[56]

Language model based text-to-audio generation: Anti-causally aligned collaborative residual transformers

Juncheng Wang, Chao Xu, Cheng Yu, Zhe Hu, Haoyu Xie, Guoqi Yu, Lei Shang, and Shujun Wang. Language model based text-to-audio generation: Anti-causally aligned collaborative residual transformers. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 26036–26054, 2025

2025
[57]

Dreamactor-h1: High-fidelity human-product demonstration video generation via motion-designed diffusion transformers.arXiv preprint arXiv:2506.10568, 2025

Lizhen Wang, Zhurong Xia, Tianshu Hu, Pengrui Wang, Pengfei Wei, Zerong Zheng, Ming Zhou, Yuan Zhang, and Mingyuan Gao. Dreamactor-h1: High-fidelity human-product demonstration video generation via motion-designed diffusion transformers.arXiv preprint arXiv:2506.10568, 2025

work page arXiv 2025
[58]

Fantasytalking: Realistic talking portrait generation via coherent motion synthesis

Mengchao Wang, Qiang Wang, Fan Jiang, Yaqi Fan, Yunpeng Zhang, Yonggang Qi, Kun Zhao, and Mu Xu. Fantasytalking: Realistic talking portrait generation via coherent motion synthesis. InProceedings of the 33rd ACM International Conference on Multimedia, pages 9891–9900, 2025

2025
[59]

Interacthuman: Multi-concept human animation with layout-aligned audio conditions.arXiv preprint arXiv:2506.09984, 2025

Zhenzhi Wang, Jiaqi Yang, Jianwen Jiang, Chao Liang, Gaojie Lin, Zerong Zheng, Ceyuan Yang, and Dahua Lin. Interacthuman: Multi-concept human animation with layout-aligned audio conditions.arXiv preprint arXiv:2506.09984, 2025

work page arXiv 2025
[60]

Mocha: Towards movie-grade talking character generation

Cong Wei, Bo Sun, Haoyu Ma, Ji Hou, Felix Juefei-Xu, Zecheng He, Xiaoliang Dai, Luxin Zhang, Kunpeng Li, Tingbo Hou, et al. Mocha: Towards movie-grade talking character generation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
[61]

Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025a

Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025

work page arXiv 2025
[62]

D3d-hoi: Dynamic 3d human-object interactions from videos.arXiv preprint arXiv:2108.08420, 2021

Xiang Xu, Hanbyul Joo, Greg Mori, and Manolis Savva. D3d-hoi: Dynamic 3d human-object interactions from videos.arXiv preprint arXiv:2108.08420, 2021

work page arXiv 2021
[63]

Magicanimate: Temporally consistent human image animation using diffusion model

Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. Magicanimate: Temporally consistent human image animation using diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1481–1490, 2024

2024
[64]

Anchorcrafter: Animate cyberanchors saling your products via human-object interacting video generation.arXiv preprint arXiv:2411.17383, 2024

Ziyi Xu, Ziyao Huang, Juan Cao, Yong Zhang, Xiaodong Cun, Qing Shuai, Yuchen Wang, Linchao Bao, Jintao Li, and Fan Tang. Anchorcrafter: Animate cyberanchors saling your products via human-object interacting video generation.arXiv preprint arXiv:2411.17383, 2024

work page arXiv 2024
[65]

Follow-your-pose v2: Multiple-condition guided character image animation for stable pose control.arXiv e-prints, pages arXiv–2406, 2024

Jingyun Xue, Hongfa Wang, Qi Tian, Yue Ma, Andong Wang, Zhiyuan Zhao, Shaobo Min, Wenzhe Zhao, Kaihao Zhang, Heung-Yeung Shum, et al. Follow-your-pose v2: Multiple-condition guided character image animation for stable pose control.arXiv e-prints, pages arXiv–2406, 2024

2024
[66]

Hoi-swap: Swapping objects in videos with hand-object interaction awareness.Advances in Neural Information Processing Systems, 37:77132–77164, 2024

Zihui Sherry Xue, Romy Luo, Changan Chen, and Kristen Grauman. Hoi-swap: Swapping objects in videos with hand-object interaction awareness.Advances in Neural Information Processing Systems, 37:77132–77164, 2024

2024
[67]

Effective whole-body pose estimation with two-stages distillation

Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li. Effective whole-body pose estimation with two-stages distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4210–4220, 2023

2023
[68]

Diffusion-guided reconstruction of everyday hand-object interaction clips

Yufei Ye, Poorvi Hebbar, Abhinav Gupta, and Shubham Tulsiani. Diffusion-guided reconstruction of everyday hand-object interaction clips. InProceedings of the IEEE/CVF international conference on computer vision, pages 19717–19728, 2023

2023
[69]

Magicinfinite: Generating infinite talking videos with your words and voice,

Hongwei Yi, Tian Ye, Shitong Shao, Xuancheng Yang, Jiantong Zhao, Hanzhong Guo, Terrance Wang, Qingyu Yin, Zeke Xie, Lei Zhu, et al. Magicinfinite: Generating infinite talking videos with your words and voice.arXiv preprint arXiv:2503.05978, 2025

work page arXiv 2025
[70]

Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation

Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, and Li Yuan. Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation.arXiv preprint arXiv:2505.20292, 2025. 16

work page arXiv 2025
[71]

Identity- preserving text-to-video generation by frequency decomposition

Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. Identity- preserving text-to-video generation by frequency decomposition. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12978–12988, 2025

2025
[72]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

2023
[73]

Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation

Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8652–8661, 2023

2023
[74]

Waver: Wave your way to lifelike video generation,

Yifu Zhang, Hao Yang, Yuqi Zhang, Yifei Hu, Fengda Zhu, Chuang Lin, Xiaofeng Mei, Yi Jiang, Bingyue Peng, and Zehuan Yuan. Waver: Wave your way to lifelike video generation.arXiv preprint arXiv:2508.15761, 2025

work page arXiv 2025
[75]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023

work page internal anchor Pith review arXiv 2023
[76]

Anytalker: Scaling multi-person talking video generation with interactivity refinement.arXiv preprint arXiv:2511.23475, 2025

Zhizhou Zhong, Yicheng Ji, Zhe Kong, Yiying Liu, Jiarui Wang, Jiasun Feng, Lupeng Liu, Xiangyi Wang, Yanjia Li, Yuqing She, et al. Anytalker: Scaling multi-person talking video generation with interactivity refinement. arXiv preprint arXiv:2511.23475, 2025

work page arXiv 2025
[77]

Magictailor: Component-controllable personalization in text-to-image diffusion models.arXiv preprint arXiv:2410.13370, 2024

Donghao Zhou, Jiancheng Huang, Jinbin Bai, Jiaze Wang, Hao Chen, Guangyong Chen, Xiaowei Hu, and Pheng-Ann Heng. Magictailor: Component-controllable personalization in text-to-image diffusion models.arXiv preprint arXiv:2410.13370, 2024

work page arXiv 2024
[78]

Identitystory: Taming your identity-preserving generator for human-centric story generation

Donghao Zhou, Jingyu Lin, Guibao Shen, Quande Liu, Jialin Gao, Lihao Liu, Lan Du, Cunjian Chen, Chi-Wing Fu, Xiaowei Hu, et al. Identitystory: Taming your identity-preserving generator for human-centric story generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 13593–13601, 2026

2026
[79]

Scaling zero-shot reference-to-video generation.arXiv preprint arXiv:2512.06905,

Zijian Zhou, Shikun Liu, Haozhe Liu, Haonan Qiu, Zhaochong An, Weiming Ren, Zhiheng Liu, Xiaoke Huang, Kam Woh Ng, Tian Xie, et al. Scaling zero-shot reference-to-video generation.arXiv preprint arXiv:2512.06905, 2025. 17 A Training Data Collection A video generation model’s potential is mainly bounded by the richness, diversity, and scale of the data upo...

work page arXiv 2025