arxiv: 2401.07519 · v2 · pith:MFY5ZWIYnew · submitted 2024-01-15 · 💻 cs.CV · cs.AI

InstantID: Zero-shot Identity-Preserving Generation in Seconds

Qixun Wang , Xu Bai , Haofan Wang , Zekui Qin , Anthony Chen , Huaxia Li , Xu Tang , Yao Hu This is my paper

Pith reviewed 2026-05-17 20:57 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords identity preservationpersonalized image synthesisdiffusion modelszero-shot generationplug-and-play moduleface fidelitytext-to-image generation

0 comments

The pith

InstantID generates high-fidelity personalized images from one face photo in seconds without fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InstantID as a diffusion model solution for personalized image synthesis that uses only a single facial image as input. Earlier methods either required lengthy fine-tuning of many parameters or several reference photos, which limited their practical use. InstantID adds a plug-and-play IdentityNet that supplies strong semantic guidance from the face image and weak spatial guidance from landmarks, combined with text prompts, to steer generation inside existing models. This produces outputs in varied styles while keeping the subject's identity intact. If the approach holds, it would let users create identity-preserving images quickly with standard pre-trained models like SD1.5 and SDXL.

Core claim

The paper claims that InstantID achieves zero-shot identity-preserving generation through a novel IdentityNet that imposes strong semantic conditions from a single facial image and weak spatial conditions from landmark images, integrated with textual prompts to guide the diffusion process. This design delivers high face fidelity across styles, requires no fine-tuning, operates via a single forward inference, and serves as a compatible plugin for pre-trained models such as SD1.5 and SDXL.

What carries the argument

IdentityNet, which imposes strong semantic conditions from a facial image and weak spatial conditions from landmarks while integrating with textual prompts to steer diffusion-based generation.

If this is right

Personalized image synthesis works using only one reference facial image.
High face fidelity holds when generating outputs in different artistic styles.
No fine-tuning of the underlying diffusion model is required.
The module functions as a direct plugin for popular pre-trained models such as SD1.5 and SDXL.
Generation completes efficiently with a single forward inference pass.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The single-inference design could support generating multiple consistent images of the same person without repeated conditioning steps.
Extending the semantic-spatial split might help preserve identity when editing poses or expressions in the output.
Compatibility with existing models suggests the technique could be tested on related tasks such as preserving identity in short video clips.

Load-bearing premise

The assumption that strong semantic conditions from one face image plus weak spatial conditions from landmarks will maintain high face fidelity across styles without fine-tuning or additional references.

What would settle it

Generate images of a reference person in an unseen style or pose using only the single input image, then measure whether face recognition accuracy or human identification rates stay high compared to the original photo.

read the original abstract

There has been significant progress in personalized image synthesis with methods such as Textual Inversion, DreamBooth, and LoRA. Yet, their real-world applicability is hindered by high storage demands, lengthy fine-tuning processes, and the need for multiple reference images. Conversely, existing ID embedding-based methods, while requiring only a single forward inference, face challenges: they either necessitate extensive fine-tuning across numerous model parameters, lack compatibility with community pre-trained models, or fail to maintain high face fidelity. Addressing these limitations, we introduce InstantID, a powerful diffusion model-based solution. Our plug-and-play module adeptly handles image personalization in various styles using just a single facial image, while ensuring high fidelity. To achieve this, we design a novel IdentityNet by imposing strong semantic and weak spatial conditions, integrating facial and landmark images with textual prompts to steer the image generation. InstantID demonstrates exceptional performance and efficiency, proving highly beneficial in real-world applications where identity preservation is paramount. Moreover, our work seamlessly integrates with popular pre-trained text-to-image diffusion models like SD1.5 and SDXL, serving as an adaptable plugin. Our codes and pre-trained checkpoints will be available at https://github.com/InstantID/InstantID.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InstantID adds a practical IdentityNet for single-shot face personalization on diffusion models, though experimental validation is key to its claims.

read the letter

Hi, the main point on InstantID is that it introduces a plug-and-play IdentityNet to condition frozen diffusion models like SD1.5 and SDXL on a single face image plus landmarks. This setup uses strong semantic signals from the face and weaker spatial ones from landmarks, fused with text prompts, to generate styled outputs without any per-user fine-tuning or multiple references. That directly targets the storage and time costs of methods like DreamBooth or LoRA and the compatibility or fidelity issues in earlier zero-shot ID embedding approaches. The architecture description and the plan to release code and checkpoints are clear practical steps forward. The paper lays out the motivation cleanly and shows how the new conditioning components fit into existing UNets via cross-attention injection. On the soft spots, the abstract claims high fidelity across styles but the visible text gives no metrics, ablations, or direct comparisons, so it is hard to judge whether the conditioning actually dominates when text prompts push hard on style. The stress-test worry about prompt semantics overriding identity features in non-photorealistic cases looks like a fair thing to check in the results. This work is aimed at practitioners building creative tools who need fast, low-setup personalization rather than theorists advancing core diffusion math. A reader working on applied conditioning or deployment would find the design details useful. The concrete method, engagement with prior limitations, and promised reproducibility make it worth sending for peer review so the experiments can be properly evaluated.

Referee Report

2 major / 1 minor

Summary. The paper introduces InstantID, a plug-and-play module for zero-shot identity-preserving image generation with diffusion models. Using a single reference facial image, it proposes IdentityNet to apply strong semantic conditioning on the face image and weak spatial conditioning on landmarks, fused with text prompts inside a frozen SD1.5 or SDXL UNet. The central claim is that this approach achieves high-fidelity personalization across diverse styles without any fine-tuning or multiple references, while remaining compatible with community pre-trained models.

Significance. If the claims are substantiated, the work would offer a practical advance over fine-tuning-heavy methods (DreamBooth, LoRA) and prior ID-embedding approaches by enabling efficient, single-image, zero-shot personalization. The open release of code and pre-trained checkpoints is a clear strength that supports reproducibility and adoption.

major comments (2)

[Method] Method section (IdentityNet description): The design imposes strong semantic conditions from one face image plus weak landmark cues via cross-attention injection, yet provides no analysis, regularization term, or ablation demonstrating that identity features reliably dominate when text prompts introduce conflicting style semantics; this assumption is load-bearing for the zero-shot high-fidelity guarantee across styles.
[Experiments] Experiments / Results: The abstract and manuscript assert 'exceptional performance' and 'high fidelity' but contain no quantitative metrics (e.g., face similarity scores, CLIP-based identity preservation), ablation studies on conditioning strength, or direct comparisons against baselines such as IP-Adapter or other single-reference zero-shot methods, preventing verification of the central claims.

minor comments (1)

[Abstract] Abstract: The statement that the method 'proves highly beneficial' is imprecise; replace with concrete qualifiers tied to the reported efficiency or fidelity gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive evaluation of the work's significance. We address the major comments point by point below, agreeing that revisions are needed to provide more rigorous analysis and quantitative evidence.

read point-by-point responses

Referee: [Method] Method section (IdentityNet description): The design imposes strong semantic conditions from one face image plus weak landmark cues via cross-attention injection, yet provides no analysis, regularization term, or ablation demonstrating that identity features reliably dominate when text prompts introduce conflicting style semantics; this assumption is load-bearing for the zero-shot high-fidelity guarantee across styles.

Authors: We thank the referee for highlighting this important aspect. The IdentityNet is intentionally designed to apply strong semantic conditioning on the reference face image to ensure identity dominance, with weak spatial conditioning on landmarks to allow flexibility. However, we recognize that the manuscript does not provide explicit ablations or regularization analysis for conflicting text prompts. In the revised manuscript, we will include new experiments demonstrating identity preservation under style conflicts, along with a discussion of the conditioning mechanism. revision: yes
Referee: [Experiments] Experiments / Results: The abstract and manuscript assert 'exceptional performance' and 'high fidelity' but contain no quantitative metrics (e.g., face similarity scores, CLIP-based identity preservation), ablation studies on conditioning strength, or direct comparisons against baselines such as IP-Adapter or other single-reference zero-shot methods, preventing verification of the central claims.

Authors: We agree that quantitative metrics would better substantiate our claims of exceptional performance. The current manuscript emphasizes qualitative results to showcase the zero-shot and plug-and-play nature. We will revise the experiments section to include quantitative evaluations, such as face similarity scores using established face recognition models, CLIP-based metrics for identity and style, ablation studies on conditioning strengths, and comparisons with IP-Adapter and similar single-reference methods. revision: yes

Circularity Check

0 steps flagged

No circularity: InstantID introduces novel IdentityNet conditioning without reducing claims to fitted inputs or self-citations

full rationale

The paper's central contribution is the design of a new IdentityNet module that applies strong semantic conditioning from a single face image and weak spatial cues from landmarks, fused with text prompts inside a frozen diffusion UNet. This is presented as an architectural innovation and plug-and-play component rather than a mathematical derivation, fitted prediction, or result obtained by self-citation. No equations or steps in the described method reduce to prior outputs by construction; the approach relies on explicit new conditioning mechanisms whose performance is evaluated empirically against baselines. The derivation chain is therefore self-contained and independent of the circularity patterns listed.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claim depends on the effectiveness of the new IdentityNet module and the assumption that pre-trained diffusion models can be extended via plug-and-play conditioning without retraining.

axioms (1)

domain assumption Pre-trained text-to-image diffusion models such as SD1.5 and SDXL can be effectively steered for identity preservation by adding an IdentityNet without modifying base model parameters.
Invoked in the plug-and-play integration description.

invented entities (1)

IdentityNet no independent evidence
purpose: To impose strong semantic and weak spatial conditions by integrating facial and landmark images with textual prompts.
New module proposed to address fidelity and compatibility issues in prior ID embedding methods.

pith-pipeline@v0.9.0 · 5533 in / 1152 out tokens · 53585 ms · 2026-05-17T20:57:42.525913+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
cs.CV 2026-05 unverdicted novelty 7.0

INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
Follow the Mean: Reference-Guided Flow Matching
cs.LG 2026-05 unverdicted novelty 7.0

Flow matching admits controllable generation by shifting the conditional endpoint mean computed from a reference set, enabling training-free guidance on frozen pretrained models.
Follow the Mean: Reference-Guided Flow Matching
cs.LG 2026-05 unverdicted novelty 7.0

Flow matching admits reference-guided control by shifting the conditional endpoint mean, enabling training-free steering of models like FLUX via example banks and a semi-parametric variant on DiT.
Adaptive Subspace Projection for Generative Personalization
cs.CV 2026-05 unverdicted novelty 7.0

A training-free adaptive subspace projection method mitigates semantic collapsing in generative personalization by isolating and adjusting drift in a low-dimensional subspace using the stable pre-trained embedding as anchor.
StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition
cs.GR 2026-04 unverdicted novelty 7.0

StyleID supplies human-perception-aligned benchmarks and fine-tuned encoders that improve facial identity recognition robustness across stylization types and strengths.
HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

HumANDiff improves motion consistency in human video generation by sampling diffusion noise on an articulated human body template and adding joint appearance-motion prediction plus a geometric consistency loss.
VACE: All-in-One Video Creation and Editing
cs.CV 2025-03 unverdicted novelty 7.0

VACE unifies reference-to-video generation, video-to-video editing, and masked video-to-video editing in one Diffusion Transformer framework using a Video Condition Unit for inputs and a Context Adapter for task injection.
InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation
cs.CV 2026-05 conditional novelty 6.0

InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.
L2P: Unlocking Latent Potential for Pixel Generation
cs.CV 2026-05 unverdicted novelty 6.0

L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
Decoupling Semantics and Fingerprints: A Universal Representation for AI-Generated Image Detection
cs.CV 2026-05 unverdicted novelty 6.0

ODP-Net structurally disentangles universal forgery traces from generator fingerprints and semantics via orthogonal decomposition and purification, delivering state-of-the-art generalization to unseen AI image generat...
DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior
cs.CV 2026-04 unverdicted novelty 6.0

DreamShot uses video diffusion priors and a role-attention consistency loss to produce coherent, personalized storyboards with better character and scene continuity than text-to-image methods.
PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios
cs.CV 2026-04 unverdicted novelty 6.0

PostureObjectStitch generates assembly-aware anomaly images by decoupling multi-view features into high-frequency, texture and RGB components, modulating them temporally in a diffusion model, and applying conditional ...
Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling
cs.CV 2025-12 conditional novelty 6.0

Scone unifies subject understanding and generation in a two-stage trained model to improve both composition and distinction in multi-subject image generation, outperforming prior open-source models on new benchmarks.
How Noise Benefits AI-generated Image Detection
cs.CV 2025-11 unverdicted novelty 6.0

PiN-CLIP jointly trains a noise generator and detector under a variational positive-incentive principle to inject feature-space noise that suppresses shortcut directions and improves out-of-distribution accuracy by 5....
Adversarial Concept Distillation for One-Step Diffusion Personalization
cs.CV 2025-10 unverdicted novelty 6.0

OPAD enables reliable high-quality personalization of one-step diffusion models via multi-step teacher distillation combined with adversarial alignment losses.
RealDiffusion: Physics-informed Attention for Multi-character Storybook Generation
cs.CV 2026-05 unverdicted novelty 5.0

RealDiffusion uses heat diffusion as a dissipative prior and a region-aware stochastic process inside a training-free physics-informed attention mechanism to improve multi-character coherence while preserving narrativ...
When Few Steps Are Enough: Training-Free Acceleration of Identity-Preserved Generation
cs.CV 2026-05 unverdicted novelty 5.0

Frozen identity adapter from FLUX dev works on distilled schnell model, enabling 5.9x faster generation with better identity preservation in few steps.
PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards
cs.CV 2025-12 conditional novelty 5.0

A data-generation pipeline plus pairwise subject-consistency rewards in RL improve consistency and prompt adherence for multi-subject personalized image generation.
AHS: Adaptive Head Synthesis via Synthetic Data Augmentations
cs.CV 2026-04 unverdicted novelty 4.0

Adaptive Head Synthesis (AHS) employs head-reenacted synthetic data augmentation to enable robust head swapping on full upper-body images without paired training data.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 18 Pith papers · 7 internal anchors

[1]

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, K., Aittala, M., Aila, T., Laine, S., Catanzaro, B., et al.: ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

arXiv preprint arXiv:2307.09481 (2023)

Chen, X., Huang, L., Liu, Y., Shen, Y., Zhao, D., Zhao, H.: Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481 (2023)

work page arXiv 2023
[3]

In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. O...

work page 2021
[4]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: Personalizing text-to-image genera- tion using textual inversion (2022). https://doi.org/10.48550/ARXIV.2208.01618, https://arxiv.org/abs/2208.01618

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2208.01618 2022
[5]

Designing an encoder for fast personalization of text-to-image models.arXiv preprint arXiv:2302.12228, 2023

Gal, R., Arar, M., Atzmon, Y., Bermano, A.H., Chechik, G., Cohen-Or, D.: De- signing an encoder for fast personalization of text-to-image models. arXiv preprint arXiv:2302.12228 (2023)

work page arXiv 2023
[6]

In: ICLR (2021)

Hu, E.J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. In: ICLR (2021)

work page 2021
[7]

In: International Confer- ence on Machine Learning (2023), https://api.semanticscholar.org/CorpusID: 257038979

Huang, L., Chen, D., Liu, Y., Shen, Y., Zhao, D., Zhou, J.: Composer: Creative and controllable image synthesis with composable conditions. In: International Confer- ence on Machine Learning (2023), https://api.semanticscholar.org/CorpusID: 257038979

work page 2023
[8]

Auto-Encoding Variational Bayes

Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: Bengio, Y., Le- Cun, Y. (eds.) 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings (2014), http://arxiv.org/abs/1312.6114

work page internal anchor Pith review Pith/arXiv arXiv 2014
[9]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Li, Z., Cao, M., Wang, X., Qi, Z., Cheng, M.M., Shan, Y.: Photomaker: Customiz- ing realistic human photos via stacked id embedding (2023)

work page 2023
[11]

T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

Mou, C., Wang, X., Xie, L., Zhang, J., Qi, Z., Shan, Y., Qie, X.: T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

In: International Conference on Machine Learning (2021), https://api.semanticscholar.org/CorpusID:245335086

Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and edit- ing with text-guided diffusion models. In: International Conference on Machine Learning (2021), https://api.semanticscholar.org/CorpusID:245335086

work page 2021
[13]

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., M¨ uller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis (2023)

work page 2023
[14]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021) 16 Wang et al

work page 2021
[15]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text- conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1(2), 3 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

work page 2022
[17]

In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, Oc- tober 5-9, 2015, Proceedings, Part III 18

Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, Oc- tober 5-9, 2015, Proceedings, Part III 18. pp. 234–241. Springer (2015)

work page 2015
[18]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

work page 2023
[19]

Advances in Neural Information Processing Systems 35, 36479–36494 (2022)

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text- to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35, 36479–36494 (2022)

work page 2022
[20]

Valevski, D., Wasserman, D., Matias, Y., Leviathan, Y.: Face0: Instantaneously conditioning a text-to-image model on a face (2023)

work page 2023
[21]

arXiv preprint arXiv:2307.00040 (2023)

Wang, T., Li, L., Lin, K., Lin, C.C., Yang, Z., Zhang, H., Liu, Z., Wang, L.: Disco: Disentangled control for referring human dance generation in real world. arXiv preprint arXiv:2307.00040 (2023)

work page arXiv 2023
[22]

arXiv preprint arXiv:2302.13848 (2023)

Wei, Y., Zhang, Y., Ji, Z., Bai, J., Zhang, L., Zuo, W.: Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:2302.13848 (2023)

work page arXiv 2023
[23]

Yan, Y., Zhang, C., Wang, R., Zhou, Y., Zhang, G., Cheng, P., Yu, G., Fu, B.: Facestudio: Put your face everywhere in seconds (2023)

work page 2023
[24]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compati- ble image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3836–3847 (2023)

work page 2023
[26]

Advances in Neural Information Processing Systems (2023)

Zhao, S., Chen, D., Chen, Y.C., Bao, J., Hao, S., Yuan, L., Wong, K.Y.K.: Uni- controlnet: All-in-one control to text-to-image diffusion models. Advances in Neural Information Processing Systems (2023)

work page 2023
[27]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zheng, Y., Yang, H., Zhang, T., Bao, J., Chen, D., Huang, Y., Yuan, L., Chen, D., Zeng, M., Wen, F.: General facial representation learning in a visual-linguistic manner. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18697–18709 (2022)

work page 2022
[28]

Zhou, Y., Zhang, R., Sun, T., Xu, J.: Enhancing detail preservation for cus- tomized text-to-image generation: A regularization-free approach. arXiv preprint arXiv:2305.13579 (2023) InstantID: Zero-shot Identity-Preserving Generation in Seconds 17 A Supplementary Details A.1 Implementation Detail In Figure 3, the spatial control, canny image (b) and depth...

work page arXiv 2023