pith. the verified trust layer for science. sign in

arxiv: 2506.19840 · v2 · submitted 2025-06-24 · 💻 cs.CV

GenHSI: Controllable Generation of Human-Scene Interaction Videos

Pith reviewed 2026-05-19 07:22 UTC · model grok-4.3

classification 💻 cs.CV
keywords human-scene interactionvideo generationtraining-freediffusion models3D keyframesview canonicalizationcontrollable synthesislong video generation
0
0 comments X p. Extension

The pith

A training-free pipeline generates long videos of humans interacting with scenes from a single image reference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that long sequences of human actions interacting with a fixed scene can be produced without training any new models, simply by chaining together existing image and video generators in a structured way. It starts from one scene photo and one character photo plus a text description of a series of actions, then breaks the description into simple steps, creates key 3D poses, and finally animates them. A reader would care because this removes the usual barriers of collecting massive training data or running long fine-tuning jobs while still aiming to keep the person's appearance and the scene's layout intact across many seconds of video. If the approach works, it would lower the cost of creating realistic interaction footage for simulation, games, or pre-visualization work.

Core claim

The authors claim that GenHSI is the first method to synthesize a long video sequence containing a chain of human-scene interaction actions without any training, using only image references of the scene and the character. The pipeline divides synthesis into script writing that turns complex prompts into atomic actions, pre-visualization that produces 3D keyframes by first generating 2D interactions through inpainting after view canonicalization and then lifting them to 3D via iterative optimization guided by contact cues and vision-language model reasoning, and finally animation that feeds those keyframes into a pre-trained video diffusion model to produce consistent motion with plausible 3D

What carries the argument

View canonicalization that lets pre-trained 2D inpainting models create interaction poses without multi-view fitting, followed by iterative optimization that uses contact cues and VLM reasoning to turn those poses into usable 3D keyframes.

If this is right

  • Long chains of multiple human-scene interactions become feasible while preserving character identity and scene content from the input images.
  • Plausible dynamics and scene affordances appear because the video diffusion model is conditioned on explicit 3D keyframes rather than text alone.
  • Complex descriptions of chained actions are managed by first reducing them to a sequence of atomic steps during script writing.
  • The entire process runs from single reference images without needing upfront 3D scene reconstructions or multiple camera views.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same script-plus-keyframe-plus-animation structure could be tried for videos that involve several people or changing scene elements by extending the atomic action list and the keyframe stage.
  • If the 2D-to-3D lifting step proves reliable, it may reduce the need for full multi-view capture when creating 3D-consistent animations in other settings.
  • Applying the method to scenes whose affordances are less obvious, such as deformable objects or slippery floors, would test how far the contact and reasoning cues can stretch.

Load-bearing premise

The iterative optimization step guided by contact cues and vision-language model reasoning will reliably produce 3D poses from 2D inpaints that the video diffusion model can then animate into coherent and plausible long sequences.

What would settle it

Run the method on a sequence such as walking across a room, pulling out a chair, and sitting down, then inspect the output frames to see whether the person stays the same, feet and hands make correct contact with the floor and furniture, and no body parts pass through objects or float in air.

Figures

Figures reproduced from arXiv: 2506.19840 by Daniel Ritchie, Rahul Sajnani, Rui Zhou, Srinath Sridhar, Xiaoyan Cong, Zekun Li.

Figure 1
Figure 1. Figure 1: GenHSI is a user-friendly automatic generation system for controllable human-scene interaction video generation. We mimic the real-world filmmaking procedure, i.e., scriptwriting, previsualization, and animation, to generate an extendable cinematic video clip with arbitrary chain-of-actions in a consistent camera setting based on 3D-aware keyframes. Based on the user-specified image-based scene and 3D char… view at source ↗
Figure 2
Figure 2. Figure 2: Script Writing Stage: Complex high-level text descriptions from users do not provide a detailed scene and task understanding for the desired long video generation. The script writing stage first identifies and segments objects that the human can interact with in the scene. These objects, along with the given human prompt, are used to perform text-based motion planning from a VLM [65] that provides us with … view at source ↗
Figure 3
Figure 3. Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Failure cases when inpainting a human from different [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Human-Object Interaction from Inpainting Models. We progressively update the human mask (bottom) while denois￾ing the inpainting result (top). ambiguity in the absence of depth. Hence, we formulate an optimization framework that corrects the scale sh and translation th of the human to resolve inaccurate scale & translation in 3D. This optimization (1) minimizes the pen￾etration between the human and the co… view at source ↗
Figure 6
Figure 6. Figure 6: Long Video Generation Qualitative Results: GenHSI (ours) produces best results with subject identity preservation and good human-object contacts. GenHSI (DKF) only uses two keyframes (start and end frame) and often changes the identity of the subject. Kling Elements drastically change the scene and character. Each video displays the keyframes highlighted with a red bounding box. more details in the video r… view at source ↗
Figure 7
Figure 7. Figure 7: Harmonization (green) significantly improves our appearance. C. Human Body Surface Partition Definition To encourage ChatGPT outputs faithful and executable re￾sults, we divide the surface of the human body into 15 parts based on SMPL-X [69] template, i.e., “head”, “left upper arm”, “right upper arm”, “left forearm”, “right forearm”, “left hand”, “right hand”, “back”, “buttocks”, “left thigh”, “right thigh… view at source ↗
Figure 9
Figure 9. Figure 9: 3D Human-Object Interactions GenHSI performs improved human object interactions even when we don’t have access to accurate scene geometry. Our work also produces more plausible poses for lying, sitting, and standing. Prior works like GenZI have inconsistent multiview inpainting resulting in diverse but uncomfortable human poses as seen in lying down and sitting on table. 15 [PITH_FULL_IMAGE:figures/full_f… view at source ↗
read the original abstract

Large-scale pre-trained video diffusion models have exhibited remarkable capabilities in diverse video generation. However, existing solutions face several challenges in generating long videos with rich human-scene interactions (HSI), including unrealistic dynamics and affordance, lack of subject identity preservation, and the need for expensive training. To this end, we propose GenHSI, a training-free method for controllable generation of long HSI videos with 3D awareness. Taking inspiration from movie animation, we subdivide the video synthesis into three stages: (1) script writing, (2) pre-visualization, and (3) animation. Given an image of a scene and a character with a user description, we use these three stages to generate long videos that preserve human identity and provide rich and plausible HSI. Script writing converts a complex text prompt involving a chain of HSI into simple atomic actions that are used in the pre-visualization stage to generate 3D keyframes. To synthesize plausible human interaction poses in 3D keyframes, we utilize pre-trained 2D inpainting diffusion models to generate plausible 2D human interactions based on view canonicalization, which eliminates the need for multi-view fitting in previous works. We then extend these interactions to 3D using robust iterative optimization, informed by contact cues and reasoning from VLMs. Prompted by these 3D keyframes, the pretrained video diffusion models can better generate consistent long videos with plausible dynamics and affordance in a 3D-aware manner. We are the first to synthesize a long video sequence with a chain of HSI actions without training based on the image references of the scene and character. Experiments demonstrate that our method can generate HSI videos that effectively preserve scene content and character identity with plausible human-scene interaction from a single image scene.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents GenHSI, a training-free pipeline for controllable generation of long human-scene interaction (HSI) videos from a single scene image, character reference, and text description. It decomposes synthesis into three stages: script writing (decomposing complex HSI prompts into atomic actions), pre-visualization (generating 3D keyframes via 2D inpainting diffusion models after view canonicalization, followed by iterative 3D optimization using contact cues and VLM reasoning), and animation (conditioning pre-trained video diffusion models on the resulting keyframes). The central claim is that this produces long video sequences with chains of plausible HSI actions while preserving scene content and character identity, and that the method is the first to achieve this without training.

Significance. If the central claims hold, the work would be significant for enabling training-free, controllable synthesis of long-duration HSI videos by composing existing pre-trained models in a movie-animation-inspired pipeline. It targets open challenges in realistic dynamics, affordance, and identity preservation for applications in animation and simulation. The avoidance of multi-view fitting via view canonicalization and the integration of VLM reasoning for optimization represent potentially useful engineering contributions, though their impact depends on empirical validation.

major comments (2)
  1. [Experiments] Experiments section: the abstract and method description state that experiments demonstrate preservation of scene content and identity with plausible interactions, yet no quantitative metrics (e.g., FID, CLIP similarity, contact accuracy), ablation studies, or failure-case analysis are reported. This absence is load-bearing for the central claim of reliable long-sequence HSI generation.
  2. [Pre-visualization stage] Pre-visualization stage (method description): the iterative 3D optimization step that extends 2D inpainted poses (post-view-canonicalization) to keyframes using contact cues and VLM reasoning is asserted to be robust, but no convergence criteria, quantitative 3D-pose metrics (MPJPE, penetration rate), or analysis of local-minima escape on chained actions are supplied. This step is load-bearing for the claim that the frozen video diffusion model will produce consistent dynamics and affordance.
minor comments (2)
  1. [Related Work] The claim of being 'the first' to synthesize long chained HSI videos without training would benefit from a more explicit comparison table against prior training-free HSI or video-generation methods in the related-work section.
  2. [Method] Notation for 'view canonicalization' is introduced without a formal equation or diagram; a short definition or reference to the exact transformation would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We appreciate the positive assessment of the potential impact of GenHSI and the constructive criticism regarding empirical validation. We address each major comment in detail below, committing to revisions that enhance the manuscript's rigor.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the abstract and method description state that experiments demonstrate preservation of scene content and identity with plausible interactions, yet no quantitative metrics (e.g., FID, CLIP similarity, contact accuracy), ablation studies, or failure-case analysis are reported. This absence is load-bearing for the central claim of reliable long-sequence HSI generation.

    Authors: We agree that quantitative evaluation is important for substantiating the claims of scene and identity preservation as well as plausible interactions. The current version of the manuscript relies on qualitative demonstrations through visual results. In the revision, we will introduce quantitative metrics including CLIP-based similarity scores for character identity preservation, contact accuracy measures for human-scene interactions, and FID scores for video quality where feasible. Additionally, we will conduct ablation studies on the script writing, pre-visualization, and animation stages, and include an analysis of failure cases. These additions will be presented in an expanded Experiments section. revision: yes

  2. Referee: [Pre-visualization stage] Pre-visualization stage (method description): the iterative 3D optimization step that extends 2D inpainted poses (post-view-canonicalization) to keyframes using contact cues and VLM reasoning is asserted to be robust, but no convergence criteria, quantitative 3D-pose metrics (MPJPE, penetration rate), or analysis of local-minima escape on chained actions are supplied. This step is load-bearing for the claim that the frozen video diffusion model will produce consistent dynamics and affordance.

    Authors: We acknowledge the need for more details and validation on the pre-visualization stage, particularly the iterative 3D optimization. In the revised manuscript, we will specify the convergence criteria used in the optimization process. We will also report quantitative 3D pose metrics such as Mean Per Joint Position Error (MPJPE) and penetration rates, evaluated against ground-truth poses where available or through proxy measures. Furthermore, we will provide analysis on how the method handles local minima in the context of chained actions, including examples and VLM reasoning steps. This will better support the claim regarding consistent dynamics in the subsequent animation stage. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method composes external pre-trained models

full rationale

The paper describes a three-stage pipeline that chains existing pre-trained 2D inpainting diffusion models, VLM reasoning, iterative 3D optimization with contact cues, and a frozen video diffusion model. No equations, fitted parameters, or self-referential definitions appear in the abstract or claimed contributions. The central claim of training-free long HSI video synthesis rests on the composition of these external components rather than any derivation that reduces to its own inputs by construction. The optimization step is presented as a practical engineering choice informed by external cues, not as a mathematical result derived from the paper's own fitted quantities or prior self-citations that would create circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method depends on the capabilities of several external pre-trained models and on the effectiveness of an iterative 3D optimization step whose success is not independently verified in the provided abstract.

axioms (2)
  • domain assumption Pre-trained 2D inpainting diffusion models can generate plausible 2D human interactions when conditioned on view-canonicalized inputs
    Invoked in the pre-visualization stage to create 2D poses before 3D lifting.
  • domain assumption VLM reasoning plus contact cues can guide iterative optimization to produce usable 3D human-scene interaction poses
    Central to extending 2D results to 3D keyframes without multi-view fitting.

pith-pipeline@v0.9.0 · 5873 in / 1554 out tokens · 59010 ms · 2026-05-19T07:22:24.208247+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos

    cs.CV 2026-01 unverdicted novelty 7.0

    CoMoVi co-generates 3D human motions and 2D videos synchronously in a single diffusion denoising loop using 3D-to-2D projection and dual-branch diffusion with 3D-2D cross attentions.

  2. VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification

    cs.CV 2025-12 unverdicted novelty 6.0

    VHOI densifies sparse trajectories into color-encoded HOI mask sequences and conditions a fine-tuned video diffusion model on them to produce controllable human-object interaction videos, including full navigation sequences.

Reference graph

Works this paper leans on

109 extracted references · 109 canonical work pages · cited by 2 Pith papers · 9 internal anchors

  1. [1]

    https://huggingface

    Realistic vision inpainting. https://huggingface. co / Uminosachi / realisticVisionV51 _ v51VAE-inpainting. 5

  2. [2]

    https : / / klingai

    Kling ai 1.6 elements. https : / / klingai . com / image-to-video/multi-id/ , . 2, 7

  3. [3]

    https : / / klingai

    Kling ai 1.6 frames. https : / / klingai . com / image-to-video/frame-mode/ , . 2

  4. [4]

    Circle: Capture in rich contextual environments

    Joao Pedro Ara ´ujo, Jiaman Li, Karthik Vetrivel, Rishi Agarwal, Jiajun Wu, Deepak Gopinath, Alexander William Clegg, and Karen Liu. Circle: Capture in rich contextual environments. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition , pages 21211–21221, 2023. 3

  5. [5]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023. 2

  6. [6]

    Align your latents: High-resolution video syn- thesis with latent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video syn- thesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22563–22575, 2023. 2

  7. [7]

    Go-with-the-flow: Motion- controllable video diffusion models using real-time warped noise

    Ryan Burgert, Yuancheng Xu, Wenqi Xian, Oliver Pilarski, Pascal Clausen, Mingming He, Li Ma, Yitong Deng, Lingx- iao Li, Mohsen Mousavi, et al. Go-with-the-flow: Motion- controllable video diffusion models using real-time warped noise. arXiv preprint arXiv:2501.08331, 2025. 3

  8. [8]

    Ditctrl: Exploring attention control in multi-modal diffusion transformer for tuning-free multi-prompt longer video generation

    Minghong Cai, Xiaodong Cun, Xiaoyu Li, Wenze Liu, Zhaoyang Zhang, Yong Zhang, Ying Shan, and Xiangyu Yue. Ditctrl: Exploring attention control in multi-modal diffusion transformer for tuning-free multi-prompt longer video generation. ArXiv, abs/2412.18597, 2024. 2

  9. [9]

    Wang, and Gordon Wet- zstein

    Shengqu Cai, Duygu Ceylan, Matheus Gadelha, Chun- Hao Paul Huang, Tuanfeng Y . Wang, and Gordon Wet- zstein. Generative rendering: Controllable 4d-guided video generation with 2d diffusion models. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7611–7620, 2023. 2

  10. [10]

    Gen- erating human motion in 3d scenes from text descriptions

    Zhi Cen, Huaijin Pi, Sida Peng, Zehong Shen, Minghui Yang, Shuai Zhu, Hujun Bao, and Xiaowei Zhou. Gen- erating human motion in 3d scenes from text descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1855–1866, 2024. 3

  11. [11]

    Videocrafter2: Overcoming data limitations for high- quality video diffusion models

    Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao-Liang Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high- quality video diffusion models. 2024 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 7310–7320, 2024. 2

  12. [12]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James T. Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- α: Fast train- ing of diffusion transformer for photorealistic text-to-image synthesis. ArXiv, abs/2310.00426, 2023. 1

  13. [13]

    Foundhand: Large- scale domain-specific learning for controllable hand image generation

    Kefan Chen, Chaerin Min, Linguang Zhang, Shreyas Ham- pali, Cem Keskin, and Srinath Sridhar. Foundhand: Large- scale domain-specific learning for controllable hand image generation. arXiv preprint arXiv:2412.02690, 2024. 2

  14. [14]

    Multi-subject open-set personalization in video generation

    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aberman, Jun-Yan Zhu, Ming-Hsuan Yang, and Sergey Tulyakov. Multi-subject open-set personalization in video generation. arXiv preprint arXiv:2501.06187, 2025. 2

  15. [15]

    Dreamcinema: Cinematic transfer with free camera and 3d character

    Weiliang Chen, Fangfu Liu, Diankun Wu, Haowen Sun, Haixu Song, and Yueqi Duan. Dreamcinema: Cinematic transfer with free camera and 3d character. arXiv preprint arXiv:2408.12601, 2024. 2

  16. [16]

    Style in- jection in diffusion: A training-free approach for adapting large-scale diffusion models for style transfer

    Jiwoo Chung, Sangeek Hyun, and Jae-Pil Heo. Style in- jection in diffusion: A training-free approach for adapting large-scale diffusion models for style transfer. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8795–8805, 2024. 2

  17. [17]

    Laserhuman: language-guided scene- aware human motion generation in free environment.arXiv preprint arXiv:2403.13307, 2024

    Peishan Cong, Ziyi Wang, Zhiyang Dou, Yiming Ren, Wei Yin, Kai Cheng, Yujing Sun, Xiaoxiao Long, Xinge Zhu, and Yuexin Ma. Laserhuman: language-guided scene- aware human motion generation in free environment.arXiv preprint arXiv:2403.13307, 2024. 2, 3

  18. [18]

    Dragvideo: Interactive drag-style video editing

    Yufan Deng, Ruida Wang, Yuhao Zhang, Yu-Wing Tai, and Chi-Keung Tang. Dragvideo: Interactive drag-style video editing. In European Conference on Computer Vi- sion, pages 183–199. Springer, 2024. 3

  19. [19]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. In Forty-first international conference on machine learning,

  20. [20]

    Haoqiang Fan, Hao Su, and Leonidas J. Guibas. A point set generation network for 3d object reconstruction from a single image. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2463–2471, 2016. 6

  21. [21]

    Motioncharacter: Identity-preserving and motion controllable human video generation

    Haopeng Fang, Di Qiu, Binjie Mao, Pengfei Yan, and He Tang. Motioncharacter: Identity-preserving and motion controllable human video generation. ArXiv, abs/2411.18281, 2024. 2

  22. [22]

    Dreamoving: A human 9 video generation framework based on diffusion models

    Mengyang Feng, Jinlin Liu, Kai Yu, Yuan Yao, Zheng Hui, Xiefan Guo, Xianhui Lin, Haolan Xue, Chen Shi, Xiaowen Li, Aojie Li, Xiaoyang Kang, Biwen Lei, Miaomiao Cui, Peiran Ren, and Xuansong Xie. Dreamoving: A human 9 video generation framework based on diffusion models. ArXiv, abs/2312.05107, 2023. 2

  23. [23]

    Hu- mandit: Pose-guided diffusion transformer for long-form human motion video generation

    Qijun Gan, Yi Ren, Chen Zhang, Zhenhui Ye, Pan Xie, Xi- ang Yin, Zehuan Yuan, Bingyue Peng, and Jianke Zhu. Hu- mandit: Pose-guided diffusion transformer for long-form human motion video generation. 2025. 2

  24. [24]

    Preserve your own cor- relation: A noise prior for video diffusion models

    Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, An- drew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming-Yu Liu, and Yogesh Balaji. Preserve your own cor- relation: A noise prior for video diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22930–22941, 2023. 2

  25. [25]

    Diffusion as shader: 3d-aware video diffu- sion for versatile video generation control

    Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, et al. Diffusion as shader: 3d-aware video diffu- sion for versatile video generation control. arXiv preprint arXiv:2501.03847, 2025. 2

  26. [26]

    I2v-adapter: A general image-to-video adapter for diffusion models

    Xun Guo, Mingwu Zheng, Liang Hou, Yuan Gao, Yufan Deng, Pengfei Wan, Di Zhang, Yufan Liu, Weiming Hu, Zhengjun Zha, et al. I2v-adapter: A general image-to-video adapter for diffusion models. In ACM SIGGRAPH 2024 Conference Papers, pages 1–12, 2024. 2

  27. [27]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Y . Qiao, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without spe- cific tuning. ArXiv, abs/2307.04725, 2023. 2

  28. [28]

    Human poseitioning system (hps): 3d human pose estimation and self-localization in large scenes from body-mounted sensors

    Vladimir Guzov, Aymen Mir, Torsten Sattler, and Gerard Pons-Moll. Human poseitioning system (hps): 3d human pose estimation and self-localization in large scenes from body-mounted sensors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4318–4329, 2021. 3

  29. [29]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103 ,

  30. [30]

    Resolving 3d human pose ambiguities with 3d scene constraints

    Mohamed Hassan, Vasileios Choutas, Dimitrios Tzionas, and Michael J Black. Resolving 3d human pose ambiguities with 3d scene constraints. In Proceedings of the IEEE/CVF international conference on computer vision , pages 2282– 2292, 2019. 2, 3

  31. [31]

    Cameractrl: En- abling camera control for text-to-video generation

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: En- abling camera control for text-to-video generation. In The Thirteenth International Conference on Learning Repre- sentations, 2025. 2

  32. [32]

    Denoising dif- fusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. Advances in neural informa- tion processing systems, 33:6840–6851, 2020. 2

  33. [33]

    Video dif- fusion models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models. Advances in Neural Information Processing Systems, 35:8633–8646, 2022. 2

  34. [34]

    Storyagent: Cus- tomized storytelling video generation via multi-agent col- laboration

    Panwen Hu, Jin Jiang, Jianqi Chen, Mingfei Han, Shengcai Liao, Xiaojun Chang, and Xiaodan Liang. Storyagent: Cus- tomized storytelling video generation via multi-agent col- laboration. ArXiv, abs/2411.04925, 2024. 2

  35. [35]

    Move-in-2d: 2d-conditioned human motion generation

    Hsin-Ping Huang, Yang Zhou, Jui-Hsien Wang, Difan Liu, Feng Liu, Ming-Hsuan Yang, and Zhan Xu. Move-in-2d: 2d-conditioned human motion generation. arXiv preprint arXiv:2412.13185, 2024. 3

  36. [36]

    Diffusion- based generation, optimization, and planning in 3d scenes

    Siyuan Huang, Zan Wang, Puhao Li, Baoxiong Jia, Tengyu Liu, Yixin Zhu, Wei Liang, and Song-Chun Zhu. Diffusion- based generation, optimization, and planning in 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16750–16761, 2023. 3, 4

  37. [37]

    Owl- 1: Omni world model for consistent long video generation

    Yuanhui Huang, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Di Zhang, Jie Zhou, and Jiwen Lu. Owl- 1: Omni world model for consistent long video generation. ArXiv, abs/2412.09600, 2024. 2

  38. [38]

    VBench: Comprehensive benchmark suite for video generative mod- els

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative mod- els. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Reco...

  39. [39]

    Peekaboo: Interactive video generation via masked- diffusion

    Yash Jain, Anshul Nasery, Vibhav Vineet, and Harkirat Behl. Peekaboo: Interactive video generation via masked- diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8079– 8088, 2024. 3

  40. [40]

    Scaling up dynamic human-scene interaction mod- eling

    Nan Jiang, Zhiyuan Zhang, Hongjie Li, Xiaoxuan Ma, Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, and Siyuan Huang. Scaling up dynamic human-scene interaction mod- eling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 1737– 1747, 2024. 2, 3

  41. [41]

    Story-adapter: A training-free iterative framework for long story visual- ization

    Jia ju Mao, Xiaoke Huang, Yunfei Xie, Yuanqi Chang, Mude Hui, Bingjie Xu, and Yuyin Zhou. Story-adapter: A training-free iterative framework for long story visual- ization. ArXiv, abs/2410.06244, 2024. 2

  42. [42]

    Zhanghan Ke, Chunyi Sun, Lei Zhu, Ke Xu, and Ryn- son W.H. Lau. Harmonizer: Learning to perform white-box image and video harmonization. In European Conference on Computer Vision (ECCV), 2022. 14

  43. [43]

    3d gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (TOG), 42:1 – 14, 2023. 6

  44. [44]

    David: Modeling dynamic affordance of 3d objects using pre- trained video diffusion models

    Hyeonwoo Kim, Sangwon Beak, and Hanbyul Joo. David: Modeling dynamic affordance of 3d objects using pre- trained video diffusion models. ArXiv, abs/2501.08333,

  45. [45]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023. 3, 5

  46. [46]

    Putting people in their place: Affordance-aware hu- man insertion into scenes

    Sumith Kulal, Tim Brooks, Alex Aiken, Jiajun Wu, Jimei Yang, Jingwan Lu, Alexei A Efros, and Krishna Kumar Singh. Putting people in their place: Affordance-aware hu- man insertion into scenes. In Proceedings of the IEEE/CVF 10 Conference on Computer Vision and Pattern Recognition , pages 17089–17099, 2023. 3

  47. [47]

    Black Forest Labs. Flux. https://github.com/ black-forest-labs/flux . Accessed: 2024-09-24. 1

  48. [48]

    Ze- rohsi: Zero-shot 4d human-scene interaction by video gen- eration

    Hongjie Li, Hong-Xing Yu, Jiaman Li, and Jiajun Wu. Ze- rohsi: Zero-shot 4d human-scene interaction by video gen- eration. ArXiv, abs/2412.18600, 2024. 2

  49. [49]

    Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation

    Jiefeng Li, Chao Xu, Zhicun Chen, Siyuan Bian, Lixin Yang, and Cewu Lu. Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3383– 3393, 2021. 5

  50. [50]

    Hybrik-x: Hybrid analytical-neural inverse kinematics for whole-body mesh recovery

    Jiefeng Li, Siyuan Bian, Chao Xu, Zhicun Chen, Lixin Yang, and Cewu Lu. Hybrik-x: Hybrid analytical-neural inverse kinematics for whole-body mesh recovery. arXiv preprint arXiv:2304.05690, 2023. 5

  51. [51]

    Genzi: Zero-shot 3d human-scene interaction generation

    Lei Li and Angela Dai. Genzi: Zero-shot 3d human-scene interaction generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 20465–20474, 2024. 2, 3, 5, 7, 14

  52. [52]

    Animatable gaussians: Learning pose-dependent gaussian maps for high-fidelity human avatar modeling

    Zhe Li, Zerong Zheng, Lizhen Wang, and Yebin Liu. Animatable gaussians: Learning pose-dependent gaussian maps for high-fidelity human avatar modeling. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2, 6

  53. [53]

    Intergen: Diffusion-based multi-human motion generation under complex interactions

    Hanming Liang, Wenqian Zhang, Wenxu Li, Jingyi Yu, and Lan Xu. Intergen: Diffusion-based multi-human motion generation under complex interactions. Int. J. Comput. Vis., 132:3463–3483, 2023. 2

  54. [54]

    Open-Sora Plan: Open-Source Large Video Generation Model

    Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, Tanghui Jia, Junwu Zhang, Zhenyu Tang, Yatian Pang, Bin She, Cen Yan, Zhiheng Hu, Xiao wen Dong, Lin Chen, Zhang Pan, Xing Zhou, Shaoling Dong, Yonghong Tian, and Li Yuan. Open-sora plan: Open-source large video generation model. ArXiv, ...

  55. [55]

    Omnihuman-1: Rethinking the scaling-up of one-stage conditioned human animation models

    Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, and Chao Liang. Omnihuman-1: Rethinking the scaling-up of one-stage conditioned human animation models. 2025. 2

  56. [56]

    Separate motion from appear- ance: Customizing motion via customizing text-to-video diffusion models

    Huijie Liu, Jingyun Wang, Shuai Ma, Jie Hu, Xiaoming Wei, and Guoliang Kang. Separate motion from appear- ance: Customizing motion via customizing text-to-video diffusion models. arXiv preprint arXiv:2501.16714, 2025. 2

  57. [57]

    Phantom: Subject- consistent video generation via cross-modal alignment

    Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Ji- awei Liu, Qian He, and Xinglong Wu. Phantom: Subject- consistent video generation via cross-modal alignment

  58. [58]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European Conference on Computer Vision , pages 38–55. Springer, 2024. 3

  59. [59]

    Matthew Loper, Naureen Mahmood, Javier Romero, Ger- ard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, 2015. 2, 3

  60. [60]

    Smpl: A skinned multi-person linear model

    Matthew Loper, Naureen Mahmood, Javier Romero, Ger- ard Pons-Moll, and Michael J Black. Smpl: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2 , pages 851–866. 2023. 4

  61. [61]

    Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

    Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, et al. Step-video-t2v technical report: The practice, challenges, and future of video foun- dation model. arXiv preprint arXiv:2502.10248, 2025. 2

  62. [62]

    Trailblazer: Trajectory control for diffusion-based video generation

    Wan-Duo Kurt Ma, John P Lewis, and W Bastiaan Kleijn. Trailblazer: Trajectory control for diffusion-based video generation. In SIGGRAPH Asia 2024 Conference Papers , pages 1–11, 2024. 3

  63. [63]

    Cinemo: Consis- tent and controllable image animation with motion diffu- sion models

    Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Yuan- Fang Li, Cunjian Chen, and Yu Qiao. Cinemo: Consis- tent and controllable image animation with motion diffu- sion models. arXiv preprint arXiv:2407.15642, 2024. 2

  64. [64]

    Genheld: Generating and editing handheld objects

    Chaerin Min and Srinath Sridhar. Genheld: Generating and editing handheld objects. arXiv preprint arXiv:2406.05059,

  65. [65]

    Chatgpt-4o, 2025

    OpenAI. Chatgpt-4o, 2025. Accessed: 2025-03-08. 2, 4

  66. [66]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervi- sion. arXiv preprint arXiv:2304.07193, 2023. 14

  67. [67]

    Text2place: Affordance-aware text guided human placement

    Rishubh Parihar, Harsh Gupta, Sachidanand VS, and R Venkatesh Babu. Text2place: Affordance-aware text guided human placement. In European Conference on Computer Vision, pages 57–77. Springer, 2024. 2, 3

  68. [68]

    Expressive body capture: 3d hands, face, and body from a single image

    Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10975–10985, 2019. 4

  69. [69]

    Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019. 2, 3, 14

  70. [70]

    Hoi-diff: Text-driven syn- thesis of 3d human-object interactions using diffusion mod- els

    Xiaogang Peng, Yiming Xie, Zizhao Wu, Varun Jampani, Deqing Sun, and Huaizu Jiang. Hoi-diff: Text-driven syn- thesis of 3d human-object interactions using diffusion mod- els. ArXiv, abs/2312.06553, 2023. 2

  71. [71]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1

  72. [72]

    Dreambooth: Fine 11 tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine 11 tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500– 22510, 2023. 2

  73. [73]

    Magic insert: Style-aware drag-and-drop

    Nataniel Ruiz, Yuanzhen Li, Neal Wadhwa, Yael Pritch, Michael Rubinstein, David E Jacobs, and Shlomi Fruchter. Magic insert: Style-aware drag-and-drop. arXiv preprint arXiv:2407.02489, 2024. 3

  74. [74]

    Geodiffuser: Geometry-based im- age editing with diffusion models

    Rahul Sajnani, Jeroen Vanbaar, Jie Min, Kapil Katyal, and Srinath Sridhar. Geodiffuser: Geometry-based im- age editing with diffusion models. arXiv preprint arXiv:2404.14403, 2024. 2

  75. [75]

    Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, Jifeng Da, and Hong- sheng Li

    Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Y . Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, Jifeng Da, and Hong- sheng Li. Motion-i2v: Consistent and controllable image- to-video generation with explicit motion modeling. ArXiv, abs/2401.15977, 2024. 2

  76. [76]

    Dragdiffusion: Harnessing diffusion models for interac- tive point-based image editing

    Yujun Shi, Chuhui Xue, Jun Hao Liew, Jiachun Pan, Han- shu Yan, Wenqing Zhang, Vincent YF Tan, and Song Bai. Dragdiffusion: Harnessing diffusion models for interac- tive point-based image editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8839–8849, 2024. 2

  77. [77]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. 2

  78. [78]

    Sound to visual scene gener- ation by audio-to-visual latent alignment

    Kim Sung-Bin, Arda Senocak, Hyunwoo Ha, Andrew Owens, and Tae-Hyun Oh. Sound to visual scene gener- ation by audio-to-visual latent alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 6430–6440, 2023. 2

  79. [79]

    Add-it: Training-free object in- sertion in images with pretrained diffusion models

    Yoad Tewel, Rinon Gal, Dvir Samuel, Yuval Atzmon, Lior Wolf, and Gal Chechik. Add-it: Training-free object in- sertion in images with pretrained diffusion models. arXiv preprint arXiv:2411.07232, 2024. 3

  80. [80]

    Lingen: Towards high-resolution minute- length text-to-video generation with linear computational complexity

    Hongjie Wang, Chih-Yao Ma, Yen-Cheng Liu, Ji Hou, Tao Xu, Jialiang Wang, Felix Juefei-Xu, Yaqiao Luo, Peizhao Zhang, Tingbo Hou, Peter Vajda, Niraj Kumar Jha, and Xiaoliang Dai. Lingen: Towards high-resolution minute- length text-to-video generation with linear computational complexity. ArXiv, abs/2412.09856, 2024. 2

Showing first 80 references.