Recognition: unknown
OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation
Pith reviewed 2026-05-10 15:02 UTC · model grok-4.3
The pith
A unified model can generate high-quality videos of human-object interactions while following instructions from text, reference images, audio, and body poses simultaneously.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OmniShow is an end-to-end framework for Human-Object Interaction Video Generation that harmonizes multimodal conditions including text, reference images, audio, and pose by means of Unified Channel-wise Conditioning for image and pose injection, Gated Local-Context Attention for audio-visual synchronization, and a Decoupled-Then-Joint Training strategy that uses model merging on heterogeneous datasets, while also introducing the HOIVG-Bench evaluation benchmark and demonstrating state-of-the-art results.
What carries the argument
The combination of Unified Channel-wise Conditioning for efficient image and pose injection, Gated Local-Context Attention for precise audio-visual synchronization, and Decoupled-Then-Joint Training with model merging to integrate heterogeneous data sources.
If this is right
- Generates videos that obey all provided conditions without sacrificing quality or synchronization.
- Outperforms prior methods across different combinations of input modalities.
- Effectively uses multiple existing datasets through a staged training and merging process.
- Provides a standard benchmark for measuring performance in human-object interaction video generation.
- Supports practical applications such as automated content creation for e-commerce and entertainment.
Where Pith is reading between the lines
- The methods could potentially apply to generating videos with other types of interactions or additional control signals.
- Future work might test how well the model performs when some input conditions conflict with each other.
- The benchmark dataset could encourage more standardized comparisons in multimodal video synthesis research.
- Model merging techniques might help address data limitations in other conditional generation tasks.
Load-bearing premise
The techniques for conditioning and training can successfully combine all the different input types without creating conflicts that reduce video quality or timing accuracy.
What would settle it
Experiments on the HOIVG-Bench showing that OmniShow does not achieve better results than existing specialized models when all modalities are provided together or in specific single-modality cases.
read the original abstract
In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OmniShow, an end-to-end framework for Human-Object Interaction Video Generation (HOIVG) conditioned on text, reference images, audio, and pose. It proposes Unified Channel-wise Conditioning for efficient image/pose injection, Gated Local-Context Attention for audio-visual synchronization, and a Decoupled-Then-Joint Training strategy with model merging to address data scarcity from heterogeneous sub-task datasets. A new benchmark HOIVG-Bench is introduced, and the work claims overall state-of-the-art performance across multimodal conditioning settings while overcoming controllability-quality trade-offs.
Significance. If substantiated, this would advance the emerging HOIVG task by unifying four modalities in a single model and providing the first dedicated benchmark, with potential value for applications like e-commerce and short-form video. The Decoupled-Then-Joint Training approach, if shown to avoid interference, offers a practical way to leverage existing sub-task data; the new benchmark fills a clear evaluation gap and could become a standard reference.
major comments (2)
- [Abstract] Abstract: the central claim that 'extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance' is unsupported by any quantitative metrics, ablation tables, error analysis, or implementation details, which is load-bearing for validating the assertion that the proposed components overcome controllability-quality trade-offs.
- [Decoupled-Then-Joint Training] Decoupled-Then-Joint Training strategy: the model-merging step is described without any quantification of interference, forgetting, or synchronization degradation across the text-to-video, image-conditioned, audio-sync, and pose sub-task models; this directly affects the weakest assumption that the Unified Channel-wise Conditioning and Gated Local-Context Attention remain effective post-merging under simultaneous multimodal conditions.
minor comments (1)
- [Abstract] The phrase 'industry-grade performance' is imprecise and should be replaced with concrete metrics or explicit comparisons to prior work.
Simulated Author's Rebuttal
Thank you for the referee's thorough review and valuable comments on our paper. We address each major comment point by point below, providing clarifications and committing to revisions where appropriate to strengthen the manuscript.
read point-by-point responses
-
Referee: Abstract: the central claim that 'extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance' is unsupported by any quantitative metrics, ablation tables, error analysis, or implementation details, which is load-bearing for validating the assertion that the proposed components overcome controllability-quality trade-offs.
Authors: We thank the referee for highlighting this issue with the abstract. While the abstract serves as a concise overview, the full manuscript includes comprehensive quantitative evaluations in the Experiments section, featuring comparison tables against state-of-the-art methods, ablation studies on Unified Channel-wise Conditioning and Gated Local-Context Attention, and analysis of the controllability-quality trade-off. To make the abstract's claim more robust and self-contained, we will revise it to incorporate brief mentions of key metrics or direct references to the supporting experimental results. revision: yes
-
Referee: Decoupled-Then-Joint Training strategy: the model-merging step is described without any quantification of interference, forgetting, or synchronization degradation across the text-to-video, image-conditioned, audio-sync, and pose sub-task models; this directly affects the weakest assumption that the Unified Channel-wise Conditioning and Gated Local-Context Attention remain effective post-merging under simultaneous multimodal conditions.
Authors: We agree with the referee that quantifying the effects of the model-merging step is important for validating the Decoupled-Then-Joint Training strategy. The current description focuses on the overall performance gains, but lacks specific measurements of interference or degradation. In the revised manuscript, we will include additional experiments that compare model performance before and after merging, using relevant metrics for quality, controllability, and synchronization across the different conditioning modalities. This will provide direct evidence that the proposed conditioning mechanisms remain effective. revision: yes
Circularity Check
No circularity: claims rest on new components and empirical validation on a new benchmark
full rationale
The paper proposes three new technical elements (Unified Channel-wise Conditioning, Gated Local-Context Attention, and Decoupled-Then-Joint Training with model merging) plus a new evaluation benchmark (HOIVG-Bench). Its central claim is that these elements together produce SOTA results on multimodal HOIVG tasks. No equation or derivation reduces a predicted quantity to a fitted parameter by construction, no uniqueness theorem is imported from prior self-work, and no ansatz is smuggled via self-citation. The performance statements are grounded in external experiments on the newly introduced benchmark rather than in any self-referential re-labeling of inputs. This is the normal non-circular case for an applied CV architecture paper.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation
CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.
Reference graph
Works this paper leans on
-
[1]
wav2vec 2.0: A framework for self- supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self- supervised learning of speech representations.Advances in neural information processing systems, 33:12449–12460, 2020
2020
-
[2]
Pyscenedetect: Python and opencv-based scene cut/transition detection program & library
Brandon Castellano. Pyscenedetect: Python and opencv-based scene cut/transition detection program & library. Online Repository, 2025. URLhttps://github.com/Breakthrough/PySceneDetect. Accessed: 2026-01-26
2025
-
[3]
Binghui Chen, Chongyang Zhong, Wangmeng Xiang, Yifeng Geng, and Xuansong Xie. Virtualmodel: Generating object-id-retentive human-object interaction image by diffusion model for e-commerce marketing.arXiv preprint arXiv:2405.09985, 2024
-
[4]
Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, and Zhiyong Wu. Humo: Human-centric video generation via collaborative multi-modal conditioning.arXiv preprint arXiv:2509.08519, 2025
-
[5]
Sixiang Chen, Jianyu Lai, Jialin Gao, Hengyu Shi, Zhongying Liu, Tian Ye, Junfeng Luo, Xiaoming Wei, and Lei Zhu. Posteromni: Generalized artistic poster creation via task distillation and unified reward feedback.arXiv preprint arXiv:2602.12127, 2026
-
[6]
Siyan Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Xuyan Chi, Jian Cong, Qinpeng Cui, Qide Dong, Junliang Fan, et al. Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025
-
[7]
Yi Chen, Sen Liang, Zixiang Zhou, Ziyao Huang, Yifeng Ma, Junshu Tang, Qin Lin, Yuan Zhou, and Qinglin Lu. Hunyuanvideo-avatar: High-fidelity audio-driven human animation for multiple characters.arXiv preprint arXiv:2505.20156, 2025
-
[8]
Out of time: automated lip sync in the wild
Joon Son Chung and Andrew Zisserman. Out of time: automated lip sync in the wild. InAsian conference on computer vision, pages 251–263. Springer, 2016
2016
-
[9]
Jiahao Cui, Yan Chen, Mingwang Xu, Hanlin Shang, Yuxuan Chen, Yun Zhan, Zilong Dong, Yao Yao, Jingdong Wang, and Siyu Zhu. Hallo4: High-fidelity dynamic portrait animation via direct preference optimization and temporal motion modulation.arXiv preprint arXiv:2505.23525, 2025
-
[10]
Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer
Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, and Siyu Zhu. Hallo3: Highly dynamic and realistic portrait image animation with video diffusion transformer. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 21086–21095, 2025
2025
-
[11]
Cg-hoi: Contact-guided 3d human-object interaction generation
Christian Diller and Angela Dai. Cg-hoi: Contact-guided 3d human-object interaction generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19888–19901, 2024
2024
-
[12]
Elevenlabs: The most realistic voice ai platform
ElevenLabs. Elevenlabs: The most realistic voice ai platform. Online Platform, 2026. URLhttps://elevenlabs. io/. Accessed: 2026-01-26
2026
-
[13]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024
2024
-
[14]
Re-hold: Video hand object interaction reenactment via adaptive layout-instructed diffusion model
Yingying Fan, Quanwei Yang, Kaisiyuan Wang, Hang Zhou, Yingying Li, Haocheng Feng, Errui Ding, Yu Wu, and Jingdong Wang. Re-hold: Video hand object interaction reenactment via adaptive layout-instructed diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17550–17560, June 2025
2025
-
[15]
Two-frame motion estimation based on polynomial expansion
Gunnar Farnebäck. Two-frame motion estimation based on polynomial expansion. InScandinavian conference on Image analysis, pages 363–370. Springer, 2003
2003
-
[16]
Skyreels-a2: Compose anything in video diffusion transformers
Zhengcong Fei, Debang Li, Di Qiu, Jiahua Wang, Yikun Dou, Rui Wang, Jingtao Xu, Mingyuan Fan, Guibin Chen, Yang Li, et al. Skyreels-a2: Compose anything in video diffusion transformers.arXiv preprint arXiv:2504.02436, 2025
-
[17]
Qijun Gan, Yi Ren, Chen Zhang, Zhenhui Ye, Pan Xie, Xiang Yin, Zehuan Yuan, Bingyue Peng, and Jianke Zhu. Humandit: Pose-guided diffusion transformer for long-form human motion video generation.arXiv preprint arXiv:2502.04847, 2025. 13
-
[18]
Qijun Gan, Ruizi Yang, Jianke Zhu, Shaofei Xue, and Steven Hoi. Omniavatar: Efficient audio-driven avatar video generation with adaptive body animation.arXiv preprint arXiv:2506.18866, 2025
-
[19]
Nano banana
Google. Nano banana. Online Documentation, January 2026. URLhttps://ai.google.dev/gemini-api/docs/ image-generation. Accessed: 2026-01-26
2026
-
[20]
LTX-2: Efficient Joint Audio-Visual Foundation Model
Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model. arXiv preprint arXiv:2601.03233, 2026
work page Pith review arXiv 2026
-
[21]
Teng Hu, Zhentao Yu, Zhengguang Zhou, Sen Liang, Yuan Zhou, Qin Lin, and Qinglin Lu. Hunyuancustom: A multimodal-driven architecture for customized video generation.arXiv preprint arXiv:2505.04512, 2025
-
[22]
Magicfight: Personalized martial arts combat video generation
Jiancheng Huang, Mingfu Yan, Songyan Chen, Yi Huang, and Shifeng Chen. Magicfight: Personalized martial arts combat video generation. InProceedings of the 32nd ACM International Conference on Multimedia, pages 10833–10842, 2024
2024
-
[23]
Dual-schedule inversion: Training-and tuning-free inversion for real image editing
Jiancheng Huang, Yi Huang, Jianzhuang Liu, Donghao Zhou, Yifan Liu, and Shifeng Chen. Dual-schedule inversion: Training-and tuning-free inversion for real image editing. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 660–669. IEEE, 2025
2025
-
[24]
M4v: Multi-modal mamba for text-to-video generation.arXiv preprint arXiv:2506.10915, 2025
Jiancheng Huang, Gengwei Zhang, Zequn Jie, Siyu Jiao, Yinlong Qian, Ling Chen, Yunchao Wei, and Lin Ma. M4v: Multi-modal mamba for text-to-video generation.arXiv preprint arXiv:2506.10915, 2025
-
[25]
Xiaohu Huang, Hao Zhou, Qiangpeng Yang, Shilei Wen, and Kai Han. Jova: Unified multimodal learning for joint video-audio generation.arXiv preprint arXiv:2512.13677, 2025
-
[26]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024
2024
-
[27]
Hunyuanvideo-homa: Generic human-object interaction in multimodal driven human animation
Ziyao Huang, Zixiang Zhou, Juan Cao, Yifeng Ma, Yi Chen, Zejing Rao, Zhiyong Xu, Hongmei Wang, Qin Lin, Yuan Zhou, et al. Hunyuanvideo-homa: Generic human-object interaction in multimodal driven human animation. arXiv preprint arXiv:2506.08797, 2025
-
[28]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models.arXiv preprint arXiv:2309.14509, 2023
work page internal anchor Pith review arXiv 2023
-
[30]
Jianwen Jiang, Chao Liang, Jiaqi Yang, Gaojie Lin, Tianyun Zhong, and Yanbo Zheng. Loopy: Taming audio-driven portrait avatar with long-term motion dependency.arXiv preprint arXiv:2409.02634, 2024
-
[31]
Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025
Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing.arXiv preprint arXiv:2503.07598, 2025
-
[32]
Fulldit: Video generative foundation models with multimodal control via full attention
Xuan Ju, Weicai Ye, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, and Qiang Xu. Fulldit: Video generative foundation models with multimodal control via full attention. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15737–15747, 2025
2025
-
[33]
Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025
Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, et al. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025
-
[34]
Zhe Kong, Feng Gao, Yong Zhang, Zhuoliang Kang, Xiaoming Wei, Xunliang Cai, Guanying Chen, and Wenhan Luo. Let them talk: Audio-driven multi-person conversational video generation.arXiv preprint arXiv:2505.22647, 2025
-
[35]
Chunyu Li, Chao Zhang, Weikai Xu, Jingyu Lin, Jinghui Xie, Weiguo Feng, Bingyue Peng, Cunjian Chen, and Weiwei Xing. Latentsync: Taming audio-conditioned latent diffusion models for lip sync with syncnet supervision. arXiv preprint arXiv:2412.09262, 2024. 14
-
[36]
Omnihuman-1: Rethinking the scaling-up of one-stage conditioned human animation models
Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, Chao Liang, Yuan Zhang, and Jingtuo Liu. Omnihuman-1: Rethinking the scaling-up of one-stage conditioned human animation models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13847–13858, 2025
2025
-
[37]
Apoavatar: Expressive audio-driven avatar generation via refocused audio-pose priors
Jingyu Lin, Chao Zhang, Wei Feng, Donghao Zhou, Shilei Wen, Lan Du, and Cunjian Chen. Apoavatar: Expressive audio-driven avatar generation via refocused audio-pose priors
-
[38]
Yunlong Lin, Linqing Wang, Kunjie Lin, Zixu Lin, Kaixiong Gong, Wenbo Li, Bin Lin, Zhenxi Li, Shiyi Zhang, Yuyang Peng, et al. Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025
-
[39]
Mofu: Scale-aware modulation and fourier fusion for multi-subject video generation
Run Ling, Ke Cao, Jian Lu, Ao Ma, Haowei Liu, Runze He, Changwei Wang, Rongtao Xu, Yihua Shao, Zhanjie Zhang, et al. Mofu: Scale-aware modulation and fourier fusion for multi-subject video generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 7033–7041, 2026
2026
-
[40]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[41]
Improving Video Generation with Human Feedback
Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025
work page internal anchor Pith review arXiv 2025
-
[42]
Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Gen Li, Siyu Zhou, Qian He, and Xinglong Wu. Phantom: Subject-consistent video generation via cross-modal alignment.arXiv preprint arXiv:2502.11079, 2025
-
[43]
Yichen Liu, Donghao Zhou, Jie Wang, Xin Gao, Guisheng Liu, Jiatong Li, Quanwei Zhang, Qiang Lyu, Lanqing Guo, ShileiWen, etal. Hifi-inpaint: Towardshigh-fidelityreference-basedinpaintingforgeneratingdetail-preserving human-product images.arXiv preprint arXiv:2603.02210, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[44]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[45]
Ovi: Twin backbone cross-modal fusion for audio-video generation
Chetwin Low, Weimin Wang, and Calder Katyal. Ovi: Twin backbone cross-modal fusion for audio-video generation.arXiv preprint arXiv:2510.01284, 2025
-
[46]
Echomimicv2: Towards striking, simplified, and semi-body human animation
Rang Meng, Xingyu Zhang, Yuming Li, and Chenguang Ma. Echomimicv2: Towards striking, simplified, and semi-body human animation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5489–5498, 2025
2025
-
[47]
Hoi-diff: Text-driven synthesis of 3d human-object interactions using diffusion models
Xiaogang Peng, Yiming Xie, Zizhao Wu, Varun Jampani, Deqing Sun, and Huaizu Jiang. Hoi-diff: Text-driven synthesis of 3d human-object interactions using diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2878–2888, 2025
2025
-
[48]
Yuxin Qin, Ke Cao, Haowei Liu, Ao Ma, Fengheng Li, Honghe Zhu, Zheng Zhang, Run Ling, Wei Feng, Xuanhua He, et al. Innoads-composer: Efficient condition composition for e-commerce poster generation.arXiv preprint arXiv:2603.05898, 2026
-
[49]
Shitong Shao, Hongwei Yi, Hanzhong Guo, Tian Ye, Daquan Zhou, Michael Lingelbach, Zhiqiang Xu, and Zeke Xie. Magicdistillation: Weak-to-strong video distillation for large-scale few-step synthesis.arXiv preprint arXiv:2503.13319, 2025
-
[50]
Hero: Hierarchical extrapolation and refresh for efficient world models,
Quanjian Song, Xinyu Wang, Donghao Zhou, Jingyu Lin, Cunjian Chen, Yue Ma, and Xiu Li. Hero: Hierarchical extrapolation and refresh for efficient world models.arXiv preprint arXiv:2508.17588, 2025
-
[51]
Quanjian Song, Donghao Zhou, Jingyu Lin, Fei Shen, Jiaze Wang, Xiaowei Hu, Cunjian Chen, and Pheng-Ann Heng. Scenedecorator: Towards scene-oriented story generation with scene planning and scene consistency.arXiv preprint arXiv:2510.22994, 2025
-
[52]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
2024
-
[53]
Ominicontrol: Minimal and universal control for diffusion transformer
Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14940–14950, 2025. 15
2025
-
[54]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Jing Wang, Ao Ma, Ke Cao, Jun Zheng, Zhanjie Zhang, Jiasong Feng, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, et al. Wisa: World simulator assistant for physics-aware text-to-video generation.arXiv preprint arXiv:2503.08153, 2025
-
[56]
Language model based text-to-audio generation: Anti-causally aligned collaborative residual transformers
Juncheng Wang, Chao Xu, Cheng Yu, Zhe Hu, Haoyu Xie, Guoqi Yu, Lei Shang, and Shujun Wang. Language model based text-to-audio generation: Anti-causally aligned collaborative residual transformers. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 26036–26054, 2025
2025
-
[57]
Lizhen Wang, Zhurong Xia, Tianshu Hu, Pengrui Wang, Pengfei Wei, Zerong Zheng, Ming Zhou, Yuan Zhang, and Mingyuan Gao. Dreamactor-h1: High-fidelity human-product demonstration video generation via motion-designed diffusion transformers.arXiv preprint arXiv:2506.10568, 2025
-
[58]
Fantasytalking: Realistic talking portrait generation via coherent motion synthesis
Mengchao Wang, Qiang Wang, Fan Jiang, Yaqi Fan, Yunpeng Zhang, Yonggang Qi, Kun Zhao, and Mu Xu. Fantasytalking: Realistic talking portrait generation via coherent motion synthesis. InProceedings of the 33rd ACM International Conference on Multimedia, pages 9891–9900, 2025
2025
-
[59]
Zhenzhi Wang, Jiaqi Yang, Jianwen Jiang, Chao Liang, Gaojie Lin, Zerong Zheng, Ceyuan Yang, and Dahua Lin. Interacthuman: Multi-concept human animation with layout-aligned audio conditions.arXiv preprint arXiv:2506.09984, 2025
-
[60]
Mocha: Towards movie-grade talking character generation
Cong Wei, Bo Sun, Haoyu Ma, Ji Hou, Felix Juefei-Xu, Zecheng He, Xiaoliang Dai, Luxin Zhang, Kunpeng Li, Tingbo Hou, et al. Mocha: Towards movie-grade talking character generation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
-
[61]
Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025a
Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025
-
[62]
D3d-hoi: Dynamic 3d human-object interactions from videos.arXiv preprint arXiv:2108.08420, 2021
Xiang Xu, Hanbyul Joo, Greg Mori, and Manolis Savva. D3d-hoi: Dynamic 3d human-object interactions from videos.arXiv preprint arXiv:2108.08420, 2021
-
[63]
Magicanimate: Temporally consistent human image animation using diffusion model
Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. Magicanimate: Temporally consistent human image animation using diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1481–1490, 2024
2024
-
[64]
Ziyi Xu, Ziyao Huang, Juan Cao, Yong Zhang, Xiaodong Cun, Qing Shuai, Yuchen Wang, Linchao Bao, Jintao Li, and Fan Tang. Anchorcrafter: Animate cyberanchors saling your products via human-object interacting video generation.arXiv preprint arXiv:2411.17383, 2024
-
[65]
Follow-your-pose v2: Multiple-condition guided character image animation for stable pose control.arXiv e-prints, pages arXiv–2406, 2024
Jingyun Xue, Hongfa Wang, Qi Tian, Yue Ma, Andong Wang, Zhiyuan Zhao, Shaobo Min, Wenzhe Zhao, Kaihao Zhang, Heung-Yeung Shum, et al. Follow-your-pose v2: Multiple-condition guided character image animation for stable pose control.arXiv e-prints, pages arXiv–2406, 2024
2024
-
[66]
Hoi-swap: Swapping objects in videos with hand-object interaction awareness.Advances in Neural Information Processing Systems, 37:77132–77164, 2024
Zihui Sherry Xue, Romy Luo, Changan Chen, and Kristen Grauman. Hoi-swap: Swapping objects in videos with hand-object interaction awareness.Advances in Neural Information Processing Systems, 37:77132–77164, 2024
2024
-
[67]
Effective whole-body pose estimation with two-stages distillation
Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li. Effective whole-body pose estimation with two-stages distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4210–4220, 2023
2023
-
[68]
Diffusion-guided reconstruction of everyday hand-object interaction clips
Yufei Ye, Poorvi Hebbar, Abhinav Gupta, and Shubham Tulsiani. Diffusion-guided reconstruction of everyday hand-object interaction clips. InProceedings of the IEEE/CVF international conference on computer vision, pages 19717–19728, 2023
2023
-
[69]
Magicinfinite: Generating infinite talking videos with your words and voice,
Hongwei Yi, Tian Ye, Shitong Shao, Xuancheng Yang, Jiantong Zhao, Hanzhong Guo, Terrance Wang, Qingyu Yin, Zeke Xie, Lei Zhu, et al. Magicinfinite: Generating infinite talking videos with your words and voice.arXiv preprint arXiv:2503.05978, 2025
-
[70]
Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation
Shenghai Yuan, Xianyi He, Yufan Deng, Yang Ye, Jinfa Huang, Bin Lin, Jiebo Luo, and Li Yuan. Opens2v-nexus: A detailed benchmark and million-scale dataset for subject-to-video generation.arXiv preprint arXiv:2505.20292, 2025. 16
-
[71]
Identity- preserving text-to-video generation by frequency decomposition
Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. Identity- preserving text-to-video generation by frequency decomposition. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12978–12988, 2025
2025
-
[72]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023
2023
-
[73]
Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation
Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8652–8661, 2023
2023
-
[74]
Waver: Wave your way to lifelike video generation,
Yifu Zhang, Hao Yang, Yuqi Zhang, Yifei Hu, Fengda Zhu, Chuang Lin, Xiaofeng Mei, Yi Jiang, Bingyue Peng, and Zehuan Yuan. Waver: Wave your way to lifelike video generation.arXiv preprint arXiv:2508.15761, 2025
-
[75]
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023
work page internal anchor Pith review arXiv 2023
-
[76]
Zhizhou Zhong, Yicheng Ji, Zhe Kong, Yiying Liu, Jiarui Wang, Jiasun Feng, Lupeng Liu, Xiangyi Wang, Yanjia Li, Yuqing She, et al. Anytalker: Scaling multi-person talking video generation with interactivity refinement. arXiv preprint arXiv:2511.23475, 2025
-
[77]
Donghao Zhou, Jiancheng Huang, Jinbin Bai, Jiaze Wang, Hao Chen, Guangyong Chen, Xiaowei Hu, and Pheng-Ann Heng. Magictailor: Component-controllable personalization in text-to-image diffusion models.arXiv preprint arXiv:2410.13370, 2024
-
[78]
Identitystory: Taming your identity-preserving generator for human-centric story generation
Donghao Zhou, Jingyu Lin, Guibao Shen, Quande Liu, Jialin Gao, Lihao Liu, Lan Du, Cunjian Chen, Chi-Wing Fu, Xiaowei Hu, et al. Identitystory: Taming your identity-preserving generator for human-centric story generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 13593–13601, 2026
2026
-
[79]
Scaling zero-shot reference-to-video generation.arXiv preprint arXiv:2512.06905,
Zijian Zhou, Shikun Liu, Haozhe Liu, Haonan Qiu, Zhaochong An, Weiming Ren, Zhiheng Liu, Xiaoke Huang, Kam Woh Ng, Tian Xie, et al. Scaling zero-shot reference-to-video generation.arXiv preprint arXiv:2512.06905, 2025. 17 A Training Data Collection A video generation model’s potential is mainly bounded by the richness, diversity, and scale of the data upo...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.