arxiv: 2605.07402 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

InsHuman: Towards Natural and Identity-Preserving Human Insertion

Jian Chen, Jie Li, Shulian Zhang, Wenbo Li, Yangyang Gao, Yong Guo, Yulun Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:05 UTC · model grok-4.3

classification 💻 cs.CV

keywords human insertionimage editingidentity preservationforeground detectionface recognitiondiffusion modelsdataset construction

0 comments

The pith

InsHuman inserts specific people into new backgrounds while preserving their identity and making the placement look natural.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to fix common failures in human insertion tasks where models change a person's face, put them in the wrong pose, or add or remove people when placing them into a different scene. It introduces three targeted pieces: a fusion step that weights human and background regions differently using masks, a face-matching step that locks identity through recognition features, and a data-pairing method that builds training examples with realistic interactions. A sympathetic reader would care because reliable insertion opens practical uses in photo editing, virtual staging, and content creation without distorting who the person is.

Core claim

The authors claim that Human-Background Adaptive Fusion detects foreground humans to create binary masks and applies region-aware weighting so that predicted and ground-truth latents align on pose, person count, and overall appearance; that Face-to-Face ID-Preserving detects and matches faces between output and source using recognition features to keep identity fixed; and that Bidirectional Data Pairing creates a dataset of high-quality human-background pairs. Together these produce plausible insertions that leave identity unchanged.

What carries the argument

Human-Background Adaptive Fusion (HBAF) uses foreground-detection masks and region-aware weighting to force coherent adaptation of human pose, count, and appearance to the background; Face-to-Face ID-Preserving (FFIP) enforces identity by matching face-recognition features between generated and source images.

Load-bearing premise

The approach assumes that binary masks from foreground detection and face-recognition features can guide insertion without creating new pose or identity inconsistencies in varied scenes.

What would settle it

Run the model on a set of source humans and target backgrounds where the correct physical placement requires a changed pose or viewpoint; if the outputs still show mismatched poses or altered faces under side-by-side identity checks, the claim is falsified.

Figures

Figures reproduced from arXiv: 2605.07402 by Jian Chen, Jie Li, Shulian Zhang, Wenbo Li, Yangyang Gao, Yong Guo, Yulun Zhang.

**Figure 1.** Figure 1: Qualitative comparison of different image editing models on human insertion task. Existing image editing models often exhibit issues e.g., poses failing to adapt to the background, deviations in the number of people or appearance from the reference image, and loss of facial features. In contrast, models fine-tuned with our method (InsHuman) more stably maintain human structure and identity features, achiev… view at source ↗

**Figure 2.** Figure 2: Overview of our proposed InsHuman. (a) Human-Background Adaptive Fusion (HBAF): Utilizing the human region mask to dynamically adjust the weight of the human region. (b) Face-toFace ID-Preserving (FFIP): Activated when t ≤ Tend, it employs a pretrained facial feature extraction network and matching algorithm to perform facial feature alignment without interfering with the overall person structure. 𝐼𝑏𝑔 𝐼𝑠𝑟… view at source ↗

**Figure 3.** Figure 3: Weighted supervision for human region. This mechanism uses a human region mask to guide the model in prioritizing the optimization of human structure and overall appearance. 3 Natural and Identity-Preserving Human Insertion In the following, we propose three components to address these challenges: HBAF for overall human structural coherence (Section 3.1), FFIP for facial identity preservation (Section 3.2)… view at source ↗

**Figure 4.** Figure 4: Variation of model-predicted images over timesteps. In the early stages of denoising, the model primarily determines the number of people and the overall composition. In the later stages of denoising, it mainly refines details and textures. the foreground person region are set to 1. As shown in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: The construction process of BDP-InsHuman. Through (a) forward pairing and (b) reverse pairing, real-world images can serve as both ground truth and input to the model. The total loss is defined as Ltotal = LHBAF + λfaceLF F IP , where λface is a balancing coefficient that keeps the two loss terms at a similar order of magnitude. This design guides the model to preserve facial identity consistent with the s… view at source ↗

**Figure 6.** Figure 6: Visual comparison without HBAF. Removing HBAF causes spatial mismatch, limb distortions, and incorrect person counts. 𝐼𝑠𝑟𝑐 w/o FFIP Ours 𝐼𝑠𝑟𝑐 𝐼𝑏𝑔 Ours 𝐼𝑠𝑟𝑐 w/o FFIP Ours 𝐼 𝐼𝑠𝑟𝑐 𝑏𝑔 Ours [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 8.** Figure 8: Ablation study results of the bidirectional construction strategy. A comparison among forward-only construction, reverse-only construction, and bidirectional construction validates the superiority of bidirectional construction strategy. 𝐼𝑠𝑟𝑐 𝐼𝑏𝑔 𝐼𝑠𝑟𝑐 Fix λ=2.5 Ours Ours 𝐼𝑠𝑟𝑐 𝐼𝑏𝑔 𝐼𝑠𝑟𝑐 Fix λ=2.5 Ours 𝐼𝑠𝑟𝑐 𝐼𝑏𝑔 Ours 𝐼𝑠𝑟𝑐 Fix λ=2.5 Ours Ours [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Visual comparison with a fixed λ. Removal of the dynamic weight function λ(t) disrupts facial feature generation. Cross Model Generalization of InsHuman. To validate the generalization of our method, we apply InsHuman to FLUX.2 under the same settings. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Visual comparison with λmax = 5. A large λmax will undermine the overall structure of the person. 𝐼𝑠𝑟𝑐 𝐼𝑏𝑔 𝐼𝑠𝑟𝑐 𝐼𝑏𝑔 𝐼𝑠𝑟𝑐 𝐼𝑏𝑔 FLUX.2 FLUX.2 Ours FLUX.2 Ours Ours [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 11.** Figure 11: Visual comparison of FLUX.2 before and after applying InsHuman. After fine-tuning with InsHuman, the model can adaptively adjust the person’s position and scale, while keeping the number of people consistent with the reference image [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

**Figure 12.** Figure 12: More qualitative comparison of different image editing models on human insertion. Existing image editing models often exhibit issues on human insertion, e.g., poses failing to adapt to the background, deviations in the number of people or overall appearance from the reference image, and loss of facial features. In contrast, our method (InsHuman) more stably maintain human structure and identity features, … view at source ↗

read the original abstract

Human insertion aims to naturally place specific individuals into a target background. Although existing image editing models may have such ability, they often produce failure cases, including inappropriate human pose in new background, inconsistent number of people, and modified facial identity. Moreover, publicly available human datasets often lack full-body portraits and realistic physical interaction between humans and their background. To address these challenges, we propose InsHuman for natural and identity-preserving human insertion. Specifically, we propose Human-Background Adaptive Fusion (HBAF), which detects foreground humans to obtain a binary mask and applies region-aware weighting to align the human regions between predicted and ground-truth latents, ensuring the person's pose, count, and overall appearance are coherently adapted to the target background.We further propose Face-to-Face ID-Preserving (FFIP), which detects and matches faces between the generated image and the source image in terms of face recognition features to enforce identity consistency for each face.In addition, we propose Bidirectional Data Pairing (BDP) strategy to construct BDP-InsHuman, a high-quality dataset with realistic human-background interactions. Experiments demonstrate that InsHuman achieves significant improvements in generating plausible images while keeping human identity unchanged.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InsHuman adds three targeted modules to fix pose, count, and identity failures in human insertion but supplies no metrics or robustness checks in the provided description.

read the letter

The paper identifies three recurring failure modes when dropping specific people into new backgrounds—wrong pose or count, and altered faces—and proposes HBAF, FFIP, and BDP to handle them. HBAF pulls a binary mask from foreground detection and uses region-aware weighting on latents to keep the inserted human coherent with the scene. FFIP runs face recognition to match features between source and output so identity stays fixed. BDP creates paired training data that includes realistic human-background contact, which existing datasets miss. These are straightforward extensions of existing segmentation and recognition tools, applied to a narrow but common editing task. The dataset construction step is the part that feels most reusable for others working on similar synthesis problems. The description stays at the level of component names and high-level goals. No quantitative results, baseline comparisons, or ablation numbers appear, so it is impossible to tell whether the claimed improvements are large, small, or mainly from extra training data. The mask step in HBAF is also a clear point of fragility: if foreground detection slips on occlusion, crowds, or unusual lighting, the region weighting cannot enforce the pose and count consistency the paper wants. Nothing in the text shows that the detectors were tested under those conditions. This work is aimed at people already building or fine-tuning image-editing pipelines who need quick fixes for human-specific artifacts. A practitioner might borrow the data-pairing idea or the face-matching loss without much trouble. It is coherent enough on its own terms to go to referees, though any review would have to focus on whether the experiments actually demonstrate gains once the numbers are shown.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces InsHuman, a framework for inserting specific individuals into target backgrounds while preserving natural pose, count, appearance, and facial identity. It proposes three components: Human-Background Adaptive Fusion (HBAF), which obtains binary masks via foreground detection and applies region-aware weighting to align human regions between predicted and ground-truth latents; Face-to-Face ID-Preserving (FFIP), which matches face recognition features between generated and source images; and Bidirectional Data Pairing (BDP), which constructs the BDP-InsHuman dataset with realistic human-background interactions. The central claim is that these yield significant improvements over existing image editing models in generating plausible results without identity changes.

Significance. If the empirical claims hold with proper validation, the work could advance controllable human-centric image synthesis by targeting specific failure modes (pose inconsistency, count errors, identity drift) that current models exhibit. The dataset construction via BDP is a constructive contribution if the data and code are released. However, the absence of any quantitative metrics, baselines, or ablation details in the description substantially weakens the ability to evaluate impact or reproducibility.

major comments (3)

[Abstract] Abstract: the claim that 'Experiments demonstrate that InsHuman achieves significant improvements' is unsupported because no quantitative metrics (e.g., FID, LPIPS, face similarity scores), baseline comparisons, ablation studies, or evaluation protocol details are supplied anywhere in the manuscript text.
[Method (HBAF)] Human-Background Adaptive Fusion (HBAF) section: the region-aware weighting relies on accurate binary masks from off-the-shelf foreground detection; no robustness analysis is provided for occlusion, crowd overlap, or lighting variation, which directly risks breaking the claimed pose/count coherence if masks are noisy.
[Experiments] Experiments section: the manuscript asserts improvements in 'plausible images while keeping human identity unchanged' but supplies neither the datasets used for testing, the specific metrics for identity preservation, nor comparisons to prior editing models, rendering the central claim unverifiable.

minor comments (2)

[Abstract] Abstract: the phrasing 'modified facial identity' is vague; clarify whether this refers to identity drift or attribute changes.
[Method] Notation: 'region-aware weighting' and 'face recognition features' are introduced without equations or pseudocode, making the precise implementation of HBAF and FFIP hard to follow.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We acknowledge the gaps in quantitative evaluation and experimental rigor in the submitted manuscript and will undertake a major revision to address them fully. Our point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'Experiments demonstrate that InsHuman achieves significant improvements' is unsupported because no quantitative metrics (e.g., FID, LPIPS, face similarity scores), baseline comparisons, ablation studies, or evaluation protocol details are supplied anywhere in the manuscript text.

Authors: We agree that the abstract claim is currently unsupported. The submitted manuscript lacks these details in the Experiments section. In the revised version we will add quantitative results using FID, LPIPS, and face similarity scores, include baseline comparisons, ablation studies on HBAF/FFIP/BDP, and describe the evaluation protocol and test datasets. The abstract will be updated accordingly to reflect the new content. revision: yes
Referee: [Method (HBAF)] Human-Background Adaptive Fusion (HBAF) section: the region-aware weighting relies on accurate binary masks from off-the-shelf foreground detection; no robustness analysis is provided for occlusion, crowd overlap, or lighting variation, which directly risks breaking the claimed pose/count coherence if masks are noisy.

Authors: We accept this criticism. The current manuscript provides no robustness analysis. We will add dedicated experiments and discussion in the revision, covering performance under occlusion, crowd overlap, and lighting changes, along with analysis of noisy mask effects and any mitigation approaches. revision: yes
Referee: [Experiments] Experiments section: the manuscript asserts improvements in 'plausible images while keeping human identity unchanged' but supplies neither the datasets used for testing, the specific metrics for identity preservation, nor comparisons to prior editing models, rendering the central claim unverifiable.

Authors: This observation is correct. The Experiments section in the submission is incomplete on these points. We will expand it to specify the test datasets (including BDP-InsHuman and any public benchmarks), detail identity metrics such as face embedding cosine similarity, and present both quantitative and qualitative comparisons against prior image editing models. revision: yes

Circularity Check

0 steps flagged

No circularity in claimed derivation or results

full rationale

The paper presents InsHuman as an additive method combining three proposed components (HBAF for region-aware fusion via binary masks, FFIP for face-feature identity enforcement, and BDP for dataset construction) on top of prior image-editing models. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-referential definitions appear in the provided text. Central claims rest on experimental demonstrations of improved plausibility and identity preservation rather than any reduction of outputs to inputs by construction. Any self-citations (not visible in the abstract) are not load-bearing for the method's validity, which is presented as empirical. This is the standard case of a non-circular applied CV paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The central claim depends on the effectiveness of three newly introduced components whose internal mechanisms and training details are not specified in the abstract; no free parameters are named, but the approach rests on standard computer vision assumptions about detection reliability.

axioms (2)

domain assumption Foreground human detection produces reliable binary masks suitable for region-aware weighting
Invoked in the description of HBAF without further justification.
domain assumption Face recognition features can enforce identity consistency without side effects on pose or background
Central to FFIP as described.

invented entities (3)

Human-Background Adaptive Fusion (HBAF) no independent evidence
purpose: Detects foreground humans and applies region-aware weighting to align poses and appearance
New module proposed to solve pose and count inconsistencies
Face-to-Face ID-Preserving (FFIP) no independent evidence
purpose: Matches faces between generated and source images using recognition features
New component to preserve facial identity
Bidirectional Data Pairing (BDP) no independent evidence
purpose: Constructs BDP-InsHuman dataset with realistic human-background interactions
New data strategy to address dataset limitations

pith-pipeline@v0.9.0 · 5522 in / 1574 out tokens · 60743 ms · 2026-05-11T02:05:29.126443+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 3 internal anchors

[1]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[2]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4199–4209, 2023

work page 2023
[3]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023

work page 2023
[4]

Prompt-to- prompt image editing with cross attention control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to- prompt image editing with cross attention control. InInternational Conference on Learning Representations, 2023

work page 2023
[5]

Black Forest Labs. FLUX.2. Online.https://blackforestlabs.ai, 2025. Accessed: 2025-05-07

work page 2025
[6]

arXiv preprint arXiv:2510.06679 (2025)

Bin Xia, Bohao Peng, Yuechen Zhang, Junjia Huang, Jiyang Liu, Jingyao Li, Haoru Tan, Sitong Wu, Chengyao Wang, Yitong Wang, et al. Dreamomni2: Multimodal instruction-based editing and generation. arXiv preprint arXiv:2510.06679, 2025

work page arXiv 2025
[7]

Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

work page arXiv 2025
[8]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Viton-hd: High-resolution virtual try-on via misalignment-aware normalization

Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14131–14140, 2021

work page 2021
[11]

Paint by example: Exemplar-based image editing with diffusion models

Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by example: Exemplar-based image editing with diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18381–18391, 2023

work page 2023
[12]

Plug-and-play diffusion features for text- driven image-to-image translation

Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text- driven image-to-image translation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1921–1930, 2023

work page 1921
[13]

Anydoor: Zero-shot object-level image customization

Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero-shot object-level image customization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6418–6427, 2024

work page 2024
[14]

Objectstitch: Object compositing with diffusion models

Yizhi Song, Zhifei Zhang, Zhe Lin, Scott Cohen, Brian Price, Jianming Zhang, Seung Yong Kim, and Daniel Ali. Objectstitch: Object compositing with diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18310–18319, 2023

work page 2023
[15]

Tf-icon: Diffusion-based training-free cross-domain image composition

Shilin Lu, Yanzhu Liu, and Hao-Wei Adams. Tf-icon: Diffusion-based training-free cross-domain image composition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2294–2305, 2023

work page 2023
[16]

Putting people in their place: Affordance-aware human insertion into scenes

Sumith Kulal, Tim Brooks, Alex Aiken, Jiajun Wu, Jimei Yang, Jingwan Lu, Alexei A Efros, and Krishna Kumar Singh. Putting people in their place: Affordance-aware human insertion into scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17089– 17099, 2023

work page 2023
[17]

Teleportraits: Training-free people insertion into any scene

Jialu Gao, KJ Joseph, and Fernando De La Torre. Teleportraits: Training-free people insertion into any scene. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18866–18875, 2025

work page 2025
[18]

Insert anyone: High-fidelity full-body photo insertion via dual-branch adapters.Expert Systems with Applications, page 131013, 2026

Yifan Zhang, Jianguo Wang, Zhongliang Tang, and Wenmin Wang. Insert anyone: High-fidelity full-body photo insertion via dual-branch adapters.Expert Systems with Applications, page 131013, 2026. 10

work page 2026
[19]

Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023

work page 2023
[20]

An image is worth one word: Personalizing text-to-image generation using textual inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. In International Conference on Learning Representations, 2023

work page 2023
[21]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023

work page 2023
[22]

T2i- adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohui Qie. T2i- adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4296–4304, 2024

work page 2024
[23]

Humansd: A large-scale dataset and baseline for human-centric text-to-image generation

Xuan Ju, Ailing Zeng, Chenjian Wang, Jianan Su, Jianing Wang, Yunsheng Li, Defeng Ding, Haiyong Zheng, Lu Qi, Anton van den Hengel, et al. Humansd: A large-scale dataset and baseline for human-centric text-to-image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22934–22945, 2023

work page 2023
[24]

Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models

Hu Ye, Jun Zhang, Sibei Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6527–6536, 2024

work page 2024
[25]

Instantid: Zero-shot identity- preserving generation in seconds

Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, and Anthony Chen. Instantid: Zero-shot identity- preserving generation in seconds. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[26]

Photomaker: Customizing realistic human photos via stacked id embedding

Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming-Ming Cheng, and Ying Shan. Photomaker: Customizing realistic human photos via stacked id embedding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8640–8650, 2024

work page 2024
[27]

Fastcomposer: Tuning- free multi-subject image generation with localized attention.arXiv preprint arXiv:2305.10431, 2023

Guangxuan Xiao, Tianwei Yin, William T Freeman, Frédo Durand, and Song Han. Fastcomposer: Tuning- free multi-subject image generation with localized attention.arXiv preprint arXiv:2305.10431, 2023

work page arXiv 2023
[28]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019

work page 2019
[29]

Vggface2: A dataset for recognising faces across pose and age

Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman. Vggface2: A dataset for recognising faces across pose and age. In2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pages 67–74. IEEE, 2018

work page 2018
[30]

Deepfashion: Powering robust clothes recognition and retrieval with rich annotations

Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1096–1104, 2016

work page 2016
[31]

Crowdhuman: A benchmark for detecting human in a crowd,

Shuai Shao, Zijian Zhao, Boxun Li, Tete Xiao, Gang Yu, Xiangyu Zhang, and Jian Sun. Crowdhuman: A benchmark for detecting human in a crowd.arXiv preprint arXiv:1805.00123, 2018

work page arXiv 2018
[32]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023

work page 2023
[33]

Blended diffusion for text-driven editing of natural images

Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022

work page 2022
[34]

Diffedit: Diffusion-based semantic image editing with mask guidance

Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusion-based semantic image editing with mask guidance. InInternational Conference on Learning Representations, 2023

work page 2023
[35]

Magicanimate: Temporally consistent human image animation using diffusion model

Jianhan Xu, Ke Xiao, Yiran Zhao, et al. Magicanimate: Temporally consistent human image animation using diffusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22744–22753, 2024. 11

work page 2024
[36]

Multi-concept customization of text-to-image diffusion

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023

work page 1931
[37]

Celebv-hq: A large-scale video facial attributes dataset

Hao Zhu, Wu Wayne, Wentao Qiu, Chenxia Zhu, et al. Celebv-hq: A large-scale video facial attributes dataset. InEuropean conference on computer vision, pages 650–667. Springer, 2022

work page 2022
[38]

Stylegan-human: A data-centric odyssey of human generation

Jianglin Fu, Shikai Li, Yuming Jiang, Kwan-Yee Lin, Chen Qian, Chen-Change Loy, Wayne Wu, and Ziwei Liu. Stylegan-human: A data-centric odyssey of human generation. InEuropean Conference on Computer Vision, pages 1–19. Springer, 2022

work page 2022
[39]

Text2human: Text-driven controllable human image generation.ACM Transactions on Graphics (TOG), 41(4):1–11, 2022

Yuming Jiang, Shuai Yang, Haonan Qiu, Wayne Wu, Chen Change Loy, and Ziwei Liu. Text2human: Text-driven controllable human image generation.ACM Transactions on Graphics (TOG), 41(4):1–11, 2022

work page 2022
[40]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

work page 2014
[41]

T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation

Kaiyi Huang, Peize Sun, Jianing Hou, et al. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2556–2566, 2023

work page 2023
[42]

Editbench: Image editing evaluation dataset

Su Wang, Chitwan Saharia, Ceslee Montgomery, Jordi Pont-Tuset, Shai Noy, Stefano Peliti, Richard Baird, and David J Fleet. Editbench: Image editing evaluation dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14545–14554, 2023

work page 2023
[43]

YOLOv12: Attention-Centric Real-Time Object Detectors

Yunjie Tian, Qixiang Ye, and David Doermann. Yolov12: Attention-centric real-time object detectors. arXiv preprint arXiv:2502.12524, 2025

work page internal anchor Pith review arXiv 2025
[44]

Arcface: Additive angular margin loss for deep face recognition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019

work page 2019
[45]

Attend-and-excite: Attention- based semantic guidance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention- based semantic guidance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023. 12 A Implementation details and Metrics Model Training.We useBDP-InsHumanobtained in Section 3.3 (consisting of 529 high-quality data pairs) t...

work page 2023