pith. machine review for the scientific record. sign in

arxiv: 2605.07402 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

InsHuman: Towards Natural and Identity-Preserving Human Insertion

Jian Chen, Jie Li, Shulian Zhang, Wenbo Li, Yangyang Gao, Yong Guo, Yulun Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:05 UTC · model grok-4.3

classification 💻 cs.CV
keywords human insertionimage editingidentity preservationforeground detectionface recognitiondiffusion modelsdataset construction
0
0 comments X

The pith

InsHuman inserts specific people into new backgrounds while preserving their identity and making the placement look natural.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to fix common failures in human insertion tasks where models change a person's face, put them in the wrong pose, or add or remove people when placing them into a different scene. It introduces three targeted pieces: a fusion step that weights human and background regions differently using masks, a face-matching step that locks identity through recognition features, and a data-pairing method that builds training examples with realistic interactions. A sympathetic reader would care because reliable insertion opens practical uses in photo editing, virtual staging, and content creation without distorting who the person is.

Core claim

The authors claim that Human-Background Adaptive Fusion detects foreground humans to create binary masks and applies region-aware weighting so that predicted and ground-truth latents align on pose, person count, and overall appearance; that Face-to-Face ID-Preserving detects and matches faces between output and source using recognition features to keep identity fixed; and that Bidirectional Data Pairing creates a dataset of high-quality human-background pairs. Together these produce plausible insertions that leave identity unchanged.

What carries the argument

Human-Background Adaptive Fusion (HBAF) uses foreground-detection masks and region-aware weighting to force coherent adaptation of human pose, count, and appearance to the background; Face-to-Face ID-Preserving (FFIP) enforces identity by matching face-recognition features between generated and source images.

Load-bearing premise

The approach assumes that binary masks from foreground detection and face-recognition features can guide insertion without creating new pose or identity inconsistencies in varied scenes.

What would settle it

Run the model on a set of source humans and target backgrounds where the correct physical placement requires a changed pose or viewpoint; if the outputs still show mismatched poses or altered faces under side-by-side identity checks, the claim is falsified.

Figures

Figures reproduced from arXiv: 2605.07402 by Jian Chen, Jie Li, Shulian Zhang, Wenbo Li, Yangyang Gao, Yong Guo, Yulun Zhang.

Figure 1
Figure 1. Figure 1: Qualitative comparison of different image editing models on human insertion task. Existing image editing models often exhibit issues e.g., poses failing to adapt to the background, deviations in the number of people or appearance from the reference image, and loss of facial features. In contrast, models fine-tuned with our method (InsHuman) more stably maintain human structure and identity features, achiev… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our proposed InsHuman. (a) Human-Background Adaptive Fusion (HBAF): Utilizing the human region mask to dynamically adjust the weight of the human region. (b) Face-to￾Face ID-Preserving (FFIP): Activated when t ≤ Tend, it employs a pretrained facial feature extraction network and matching algorithm to perform facial feature alignment without interfering with the overall person structure. 𝐼𝑏𝑔 𝐼𝑠𝑟… view at source ↗
Figure 3
Figure 3. Figure 3: Weighted supervision for human region. This mechanism uses a human region mask to guide the model in prioritizing the optimization of human structure and overall appearance. 3 Natural and Identity-Preserving Human Insertion In the following, we propose three components to address these challenges: HBAF for overall human structural coherence (Section 3.1), FFIP for facial identity preservation (Section 3.2)… view at source ↗
Figure 4
Figure 4. Figure 4: Variation of model-predicted images over timesteps. In the early stages of denoising, the model primarily determines the number of people and the overall composition. In the later stages of denoising, it mainly refines details and textures. the foreground person region are set to 1. As shown in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The construction process of BDP-InsHuman. Through (a) forward pairing and (b) reverse pairing, real-world images can serve as both ground truth and input to the model. The total loss is defined as Ltotal = LHBAF + λfaceLF F IP , where λface is a balancing coefficient that keeps the two loss terms at a similar order of magnitude. This design guides the model to preserve facial identity consistent with the s… view at source ↗
Figure 6
Figure 6. Figure 6: Visual comparison without HBAF. Removing HBAF causes spatial mismatch, limb distortions, and incorrect person counts. 𝐼𝑠𝑟𝑐 w/o FFIP Ours 𝐼𝑠𝑟𝑐 𝐼𝑏𝑔 Ours 𝐼𝑠𝑟𝑐 w/o FFIP Ours 𝐼 𝐼𝑠𝑟𝑐 𝑏𝑔 Ours [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ablation study results of the bidirectional construction strategy. A comparison among forward-only construction, reverse-only construction, and bidirectional construction validates the superiority of bidirectional construction strategy. 𝐼𝑠𝑟𝑐 𝐼𝑏𝑔 𝐼𝑠𝑟𝑐 Fix λ=2.5 Ours Ours 𝐼𝑠𝑟𝑐 𝐼𝑏𝑔 𝐼𝑠𝑟𝑐 Fix λ=2.5 Ours 𝐼𝑠𝑟𝑐 𝐼𝑏𝑔 Ours 𝐼𝑠𝑟𝑐 Fix λ=2.5 Ours Ours [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visual comparison with a fixed λ. Removal of the dynamic weight function λ(t) disrupts facial feature generation. Cross Model Generalization of InsHuman. To validate the generalization of our method, we apply InsHuman to FLUX.2 under the same settings. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visual comparison with λmax = 5. A large λmax will undermine the overall structure of the person. 𝐼𝑠𝑟𝑐 𝐼𝑏𝑔 𝐼𝑠𝑟𝑐 𝐼𝑏𝑔 𝐼𝑠𝑟𝑐 𝐼𝑏𝑔 FLUX.2 FLUX.2 Ours FLUX.2 Ours Ours [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visual comparison of FLUX.2 before and after applying InsHuman. After fine-tuning with InsHuman, the model can adaptively adjust the person’s position and scale, while keeping the number of people consistent with the reference image [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: More qualitative comparison of different image editing models on human insertion. Existing image editing models often exhibit issues on human insertion, e.g., poses failing to adapt to the background, deviations in the number of people or overall appearance from the reference image, and loss of facial features. In contrast, our method (InsHuman) more stably maintain human structure and identity features, … view at source ↗
read the original abstract

Human insertion aims to naturally place specific individuals into a target background. Although existing image editing models may have such ability, they often produce failure cases, including inappropriate human pose in new background, inconsistent number of people, and modified facial identity. Moreover, publicly available human datasets often lack full-body portraits and realistic physical interaction between humans and their background. To address these challenges, we propose InsHuman for natural and identity-preserving human insertion. Specifically, we propose Human-Background Adaptive Fusion (HBAF), which detects foreground humans to obtain a binary mask and applies region-aware weighting to align the human regions between predicted and ground-truth latents, ensuring the person's pose, count, and overall appearance are coherently adapted to the target background.We further propose Face-to-Face ID-Preserving (FFIP), which detects and matches faces between the generated image and the source image in terms of face recognition features to enforce identity consistency for each face.In addition, we propose Bidirectional Data Pairing (BDP) strategy to construct BDP-InsHuman, a high-quality dataset with realistic human-background interactions. Experiments demonstrate that InsHuman achieves significant improvements in generating plausible images while keeping human identity unchanged.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces InsHuman, a framework for inserting specific individuals into target backgrounds while preserving natural pose, count, appearance, and facial identity. It proposes three components: Human-Background Adaptive Fusion (HBAF), which obtains binary masks via foreground detection and applies region-aware weighting to align human regions between predicted and ground-truth latents; Face-to-Face ID-Preserving (FFIP), which matches face recognition features between generated and source images; and Bidirectional Data Pairing (BDP), which constructs the BDP-InsHuman dataset with realistic human-background interactions. The central claim is that these yield significant improvements over existing image editing models in generating plausible results without identity changes.

Significance. If the empirical claims hold with proper validation, the work could advance controllable human-centric image synthesis by targeting specific failure modes (pose inconsistency, count errors, identity drift) that current models exhibit. The dataset construction via BDP is a constructive contribution if the data and code are released. However, the absence of any quantitative metrics, baselines, or ablation details in the description substantially weakens the ability to evaluate impact or reproducibility.

major comments (3)
  1. [Abstract] Abstract: the claim that 'Experiments demonstrate that InsHuman achieves significant improvements' is unsupported because no quantitative metrics (e.g., FID, LPIPS, face similarity scores), baseline comparisons, ablation studies, or evaluation protocol details are supplied anywhere in the manuscript text.
  2. [Method (HBAF)] Human-Background Adaptive Fusion (HBAF) section: the region-aware weighting relies on accurate binary masks from off-the-shelf foreground detection; no robustness analysis is provided for occlusion, crowd overlap, or lighting variation, which directly risks breaking the claimed pose/count coherence if masks are noisy.
  3. [Experiments] Experiments section: the manuscript asserts improvements in 'plausible images while keeping human identity unchanged' but supplies neither the datasets used for testing, the specific metrics for identity preservation, nor comparisons to prior editing models, rendering the central claim unverifiable.
minor comments (2)
  1. [Abstract] Abstract: the phrasing 'modified facial identity' is vague; clarify whether this refers to identity drift or attribute changes.
  2. [Method] Notation: 'region-aware weighting' and 'face recognition features' are introduced without equations or pseudocode, making the precise implementation of HBAF and FFIP hard to follow.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We acknowledge the gaps in quantitative evaluation and experimental rigor in the submitted manuscript and will undertake a major revision to address them fully. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'Experiments demonstrate that InsHuman achieves significant improvements' is unsupported because no quantitative metrics (e.g., FID, LPIPS, face similarity scores), baseline comparisons, ablation studies, or evaluation protocol details are supplied anywhere in the manuscript text.

    Authors: We agree that the abstract claim is currently unsupported. The submitted manuscript lacks these details in the Experiments section. In the revised version we will add quantitative results using FID, LPIPS, and face similarity scores, include baseline comparisons, ablation studies on HBAF/FFIP/BDP, and describe the evaluation protocol and test datasets. The abstract will be updated accordingly to reflect the new content. revision: yes

  2. Referee: [Method (HBAF)] Human-Background Adaptive Fusion (HBAF) section: the region-aware weighting relies on accurate binary masks from off-the-shelf foreground detection; no robustness analysis is provided for occlusion, crowd overlap, or lighting variation, which directly risks breaking the claimed pose/count coherence if masks are noisy.

    Authors: We accept this criticism. The current manuscript provides no robustness analysis. We will add dedicated experiments and discussion in the revision, covering performance under occlusion, crowd overlap, and lighting changes, along with analysis of noisy mask effects and any mitigation approaches. revision: yes

  3. Referee: [Experiments] Experiments section: the manuscript asserts improvements in 'plausible images while keeping human identity unchanged' but supplies neither the datasets used for testing, the specific metrics for identity preservation, nor comparisons to prior editing models, rendering the central claim unverifiable.

    Authors: This observation is correct. The Experiments section in the submission is incomplete on these points. We will expand it to specify the test datasets (including BDP-InsHuman and any public benchmarks), detail identity metrics such as face embedding cosine similarity, and present both quantitative and qualitative comparisons against prior image editing models. revision: yes

Circularity Check

0 steps flagged

No circularity in claimed derivation or results

full rationale

The paper presents InsHuman as an additive method combining three proposed components (HBAF for region-aware fusion via binary masks, FFIP for face-feature identity enforcement, and BDP for dataset construction) on top of prior image-editing models. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-referential definitions appear in the provided text. Central claims rest on experimental demonstrations of improved plausibility and identity preservation rather than any reduction of outputs to inputs by construction. Any self-citations (not visible in the abstract) are not load-bearing for the method's validity, which is presented as empirical. This is the standard case of a non-circular applied CV paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The central claim depends on the effectiveness of three newly introduced components whose internal mechanisms and training details are not specified in the abstract; no free parameters are named, but the approach rests on standard computer vision assumptions about detection reliability.

axioms (2)
  • domain assumption Foreground human detection produces reliable binary masks suitable for region-aware weighting
    Invoked in the description of HBAF without further justification.
  • domain assumption Face recognition features can enforce identity consistency without side effects on pose or background
    Central to FFIP as described.
invented entities (3)
  • Human-Background Adaptive Fusion (HBAF) no independent evidence
    purpose: Detects foreground humans and applies region-aware weighting to align poses and appearance
    New module proposed to solve pose and count inconsistencies
  • Face-to-Face ID-Preserving (FFIP) no independent evidence
    purpose: Matches faces between generated and source images using recognition features
    New component to preserve facial identity
  • Bidirectional Data Pairing (BDP) no independent evidence
    purpose: Constructs BDP-InsHuman dataset with realistic human-background interactions
    New data strategy to address dataset limitations

pith-pipeline@v0.9.0 · 5522 in / 1574 out tokens · 60743 ms · 2026-05-11T02:05:29.126443+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 3 internal anchors

  1. [1]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  2. [2]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4199–4209, 2023

  3. [3]

    Instructpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023

  4. [4]

    Prompt-to- prompt image editing with cross attention control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to- prompt image editing with cross attention control. InInternational Conference on Learning Representations, 2023

  5. [5]

    Black Forest Labs. FLUX.2. Online.https://blackforestlabs.ai, 2025. Accessed: 2025-05-07

  6. [6]

    arXiv preprint arXiv:2510.06679 (2025)

    Bin Xia, Bohao Peng, Yuechen Zhang, Junjia Huang, Jiyang Liu, Jingyao Li, Haoru Tan, Sitong Wu, Chengyao Wang, Yitong Wang, et al. Dreamomni2: Multimodal instruction-based editing and generation. arXiv preprint arXiv:2510.06679, 2025

  7. [7]

    Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

    Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

  8. [8]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025

  9. [9]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

  10. [10]

    Viton-hd: High-resolution virtual try-on via misalignment-aware normalization

    Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14131–14140, 2021

  11. [11]

    Paint by example: Exemplar-based image editing with diffusion models

    Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by example: Exemplar-based image editing with diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18381–18391, 2023

  12. [12]

    Plug-and-play diffusion features for text- driven image-to-image translation

    Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text- driven image-to-image translation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1921–1930, 2023

  13. [13]

    Anydoor: Zero-shot object-level image customization

    Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero-shot object-level image customization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6418–6427, 2024

  14. [14]

    Objectstitch: Object compositing with diffusion models

    Yizhi Song, Zhifei Zhang, Zhe Lin, Scott Cohen, Brian Price, Jianming Zhang, Seung Yong Kim, and Daniel Ali. Objectstitch: Object compositing with diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18310–18319, 2023

  15. [15]

    Tf-icon: Diffusion-based training-free cross-domain image composition

    Shilin Lu, Yanzhu Liu, and Hao-Wei Adams. Tf-icon: Diffusion-based training-free cross-domain image composition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2294–2305, 2023

  16. [16]

    Putting people in their place: Affordance-aware human insertion into scenes

    Sumith Kulal, Tim Brooks, Alex Aiken, Jiajun Wu, Jimei Yang, Jingwan Lu, Alexei A Efros, and Krishna Kumar Singh. Putting people in their place: Affordance-aware human insertion into scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17089– 17099, 2023

  17. [17]

    Teleportraits: Training-free people insertion into any scene

    Jialu Gao, KJ Joseph, and Fernando De La Torre. Teleportraits: Training-free people insertion into any scene. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18866–18875, 2025

  18. [18]

    Insert anyone: High-fidelity full-body photo insertion via dual-branch adapters.Expert Systems with Applications, page 131013, 2026

    Yifan Zhang, Jianguo Wang, Zhongliang Tang, and Wenmin Wang. Insert anyone: High-fidelity full-body photo insertion via dual-branch adapters.Expert Systems with Applications, page 131013, 2026. 10

  19. [19]

    Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023

  20. [20]

    An image is worth one word: Personalizing text-to-image generation using textual inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. In International Conference on Learning Representations, 2023

  21. [21]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023

  22. [22]

    T2i- adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

    Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohui Qie. T2i- adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4296–4304, 2024

  23. [23]

    Humansd: A large-scale dataset and baseline for human-centric text-to-image generation

    Xuan Ju, Ailing Zeng, Chenjian Wang, Jianan Su, Jianing Wang, Yunsheng Li, Defeng Ding, Haiyong Zheng, Lu Qi, Anton van den Hengel, et al. Humansd: A large-scale dataset and baseline for human-centric text-to-image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22934–22945, 2023

  24. [24]

    Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models

    Hu Ye, Jun Zhang, Sibei Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6527–6536, 2024

  25. [25]

    Instantid: Zero-shot identity- preserving generation in seconds

    Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, and Anthony Chen. Instantid: Zero-shot identity- preserving generation in seconds. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  26. [26]

    Photomaker: Customizing realistic human photos via stacked id embedding

    Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming-Ming Cheng, and Ying Shan. Photomaker: Customizing realistic human photos via stacked id embedding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8640–8650, 2024

  27. [27]

    Fastcomposer: Tuning- free multi-subject image generation with localized attention.arXiv preprint arXiv:2305.10431, 2023

    Guangxuan Xiao, Tianwei Yin, William T Freeman, Frédo Durand, and Song Han. Fastcomposer: Tuning- free multi-subject image generation with localized attention.arXiv preprint arXiv:2305.10431, 2023

  28. [28]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019

  29. [29]

    Vggface2: A dataset for recognising faces across pose and age

    Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman. Vggface2: A dataset for recognising faces across pose and age. In2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pages 67–74. IEEE, 2018

  30. [30]

    Deepfashion: Powering robust clothes recognition and retrieval with rich annotations

    Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1096–1104, 2016

  31. [31]

    Crowdhuman: A benchmark for detecting human in a crowd,

    Shuai Shao, Zijian Zhao, Boxun Li, Tete Xiao, Gang Yu, Xiangyu Zhang, and Jian Sun. Crowdhuman: A benchmark for detecting human in a crowd.arXiv preprint arXiv:1805.00123, 2018

  32. [32]

    Gligen: Open-set grounded text-to-image generation

    Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023

  33. [33]

    Blended diffusion for text-driven editing of natural images

    Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022

  34. [34]

    Diffedit: Diffusion-based semantic image editing with mask guidance

    Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusion-based semantic image editing with mask guidance. InInternational Conference on Learning Representations, 2023

  35. [35]

    Magicanimate: Temporally consistent human image animation using diffusion model

    Jianhan Xu, Ke Xiao, Yiran Zhao, et al. Magicanimate: Temporally consistent human image animation using diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22744–22753, 2024. 11

  36. [36]

    Multi-concept customization of text-to-image diffusion

    Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023

  37. [37]

    Celebv-hq: A large-scale video facial attributes dataset

    Hao Zhu, Wu Wayne, Wentao Qiu, Chenxia Zhu, et al. Celebv-hq: A large-scale video facial attributes dataset. InEuropean conference on computer vision, pages 650–667. Springer, 2022

  38. [38]

    Stylegan-human: A data-centric odyssey of human generation

    Jianglin Fu, Shikai Li, Yuming Jiang, Kwan-Yee Lin, Chen Qian, Chen-Change Loy, Wayne Wu, and Ziwei Liu. Stylegan-human: A data-centric odyssey of human generation. InEuropean Conference on Computer Vision, pages 1–19. Springer, 2022

  39. [39]

    Text2human: Text-driven controllable human image generation.ACM Transactions on Graphics (TOG), 41(4):1–11, 2022

    Yuming Jiang, Shuai Yang, Haonan Qiu, Wayne Wu, Chen Change Loy, and Ziwei Liu. Text2human: Text-driven controllable human image generation.ACM Transactions on Graphics (TOG), 41(4):1–11, 2022

  40. [40]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

  41. [41]

    T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation

    Kaiyi Huang, Peize Sun, Jianing Hou, et al. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2556–2566, 2023

  42. [42]

    Editbench: Image editing evaluation dataset

    Su Wang, Chitwan Saharia, Ceslee Montgomery, Jordi Pont-Tuset, Shai Noy, Stefano Peliti, Richard Baird, and David J Fleet. Editbench: Image editing evaluation dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14545–14554, 2023

  43. [43]

    YOLOv12: Attention-Centric Real-Time Object Detectors

    Yunjie Tian, Qixiang Ye, and David Doermann. Yolov12: Attention-centric real-time object detectors. arXiv preprint arXiv:2502.12524, 2025

  44. [44]

    Arcface: Additive angular margin loss for deep face recognition

    Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019

  45. [45]

    Attend-and-excite: Attention- based semantic guidance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023

    Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention- based semantic guidance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023. 12 A Implementation details and Metrics Model Training.We useBDP-InsHumanobtained in Section 3.3 (consisting of 529 high-quality data pairs) t...